Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Shell command to extract html source code content given an url

  • 14-03-2016 11:58am
    #1
    Registered Users, Registered Users 2 Posts: 262 ✭✭


    Trying curl and wget with little joy


Comments

  • Moderators, Society & Culture Moderators Posts: 17,643 Mod ✭✭✭✭Graham


    What have you tried and what are you getting back?


  • Registered Users, Registered Users 2 Posts: 262 ✭✭guylikeme


    curl -u <user>|<password><URL>

    gives output of page but not the html source e.g I cant find the text thats on the page


  • Registered Users, Registered Users 2 Posts: 10,948 ✭✭✭✭28064212


    guylikeme wrote: »
    curl -u <user>|<password><URL>

    gives output of page but not the html source e.g I cant find the text thats on the page
    What do you mean by "output of page "?
    curl www.boards.ie
    
    prints
    <!DOCTYPE html>
    <html>
            <head>
                    <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
                    <meta name="viewport" content="width=device-width,initial-scale=1" />
                    <title>boards.ie - Now Ye're Talkin'</title>
    [...]
    
    i.e. the html source

    Boardsie Enhancement Suite - a browser extension to make using Boards on desktop a better experience (includes full-width display, keyboard shortcuts, dark mode, and more). Now available through your browser's extension store.

    Firefox: https://addons.mozilla.org/addon/boardsie-enhancement-suite/

    Chrome/Edge/Opera: https://chromewebstore.google.com/detail/boardsie-enhancement-suit/bbgnmnfagihoohjkofdnofcfmkpdmmce



  • Registered Users, Registered Users 2 Posts: 262 ✭✭guylikeme


    What i want is to get the text of the page

    So on the boards example, i would get the thread titles


  • Moderators, Computer Games Moderators Posts: 4,282 Mod ✭✭✭✭deconduo


    guylikeme wrote: »
    What i want is to get the text of the page

    So on the boards example, i would get the thread titles

    Can you give us an example of what you are getting, and what you think you should be getting?


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 36,170 ✭✭✭✭ED E


    Think the OP has the terminology wrong, HTML source != HTML Text Content

    What you want is a HTML parser/scraper like Beautiful Soup. So you can find a tag and get its content.
    soup = ....
    threadName = soup.head.content
    

    Syntax is something like that, its really really friendly. If you work in Python its great and if not your language probably has an equivalent.


  • Registered Users, Registered Users 2 Posts: 262 ✭✭guylikeme


    Ok, the output is best explained by doing the following...

    1. Open Internet Explorer to a page that contains text
    2. Save file (html)
    3. Open the html in notepad

    The text of the page is in that file - this is what i want to obtain.


  • Registered Users, Registered Users 2 Posts: 10,948 ✭✭✭✭28064212


    guylikeme wrote: »
    What i want is to get the text of the page

    So on the boards example, i would get the thread titles
    What you're looking for is called "scraping". Using the shell to do it is not going to be easy, and a bad place to start looking. I would suggest using something like python with the Beautiful Soup library .

    It's a complex area. Using the example above of getting boards thread titles from the home page: the thread titles are loaded by Javascript after the page loads. If you get the homepage source using curl, it won't have the titles because it doesn't run javascript.

    Boardsie Enhancement Suite - a browser extension to make using Boards on desktop a better experience (includes full-width display, keyboard shortcuts, dark mode, and more). Now available through your browser's extension store.

    Firefox: https://addons.mozilla.org/addon/boardsie-enhancement-suite/

    Chrome/Edge/Opera: https://chromewebstore.google.com/detail/boardsie-enhancement-suit/bbgnmnfagihoohjkofdnofcfmkpdmmce



  • Registered Users, Registered Users 2 Posts: 6,264 ✭✭✭Buford T Justice


    As has been said above. Curl will give you the html content. You are looking to parse it and extract certain elements.

    Doubt shell will make that very easy for you at all. An altertnative to python is jSoup in Java


  • Registered Users, Registered Users 2 Posts: 6,590 ✭✭✭daymobrew


    perl and the LWP::Simple module will download the html file and then you can parse it easily.


  • Advertisement
  • Registered Users, Registered Users 2 Posts: 6,336 ✭✭✭OfflerCrocGod


    Cheerio for node provides a jQuery like API if you are more used to that https://www.npmjs.com/package/cheerio example usage here https://github.com/briandipalma/random-scripts/blob/master/request_videos.js#L41


  • Moderators, Society & Culture Moderators Posts: 17,643 Mod ✭✭✭✭Graham


    OP, you might get some more useful/specific suggestions if you explain to us what problem you're trying to solve.

    Is this a one-off task or will you want to scape content regularly?
    If it's scheduled will you be scraping the same sites/pages all the time?
    What do you intent to do with the content once you have retrieved it (put it in a database etc).
    Are you a developer looking to put together your own solution or are you just looking for the quickest/easiest way to grab the content?
    If you're a dev, what languages are you familiar with.


  • Registered Users, Registered Users 2 Posts: 262 ✭✭guylikeme


    Graham wrote: »
    OP, you might get some more useful/specific suggestions if you explain to us what problem you're trying to solve.

    Is this a one-off task or will you want to scape content regularly?
    If it's scheduled will you be scraping the same sites/pages all the time?
    What do you intent to do with the content once you have retrieved it (put it in a database etc).
    Are you a developer looking to put together your own solution or are you just looking for the quickest/easiest way to grab the content?
    If you're a dev, what languages are you familiar with.

    One off.

    Put it in a String to simply check if it contains certain String.

    Java/Python/Bash


  • Closed Accounts Posts: 2,267 ✭✭✭h57xiucj2z946q


    Can you post the url to the actual page?

    I'm guessing the page is using ajax calls or maybe even dreaded iframes, hence you are not finding the text you are looking for in the raw html source for the given url.


  • Registered Users, Registered Users 2 Posts: 262 ✭✭guylikeme


    daymobrew wrote: »
    perl and the LWP::Simple module will download the html file and then you can parse it easily.

    Can you show an example


  • Registered Users, Registered Users 2 Posts: 6,590 ✭✭✭daymobrew


    guylikeme wrote: »
    Can you show an example
    #!/usr/bin/perl -w
    
    use strict;
    use LWP::Simple qw(get);
    
    if ($ARGV[0]) {
      my $html = get $ARGV[0];
      
      # Contents of file now in $html.
    }  
    


Advertisement