Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Shell command to extract html source code content given an url

Options
  • 14-03-2016 12:58pm
    #1
    Registered Users Posts: 262 ✭✭


    Trying curl and wget with little joy


Comments

  • Moderators, Society & Culture Moderators Posts: 17,642 Mod ✭✭✭✭Graham


    What have you tried and what are you getting back?


  • Registered Users Posts: 262 ✭✭guylikeme


    curl -u <user>|<password><URL>

    gives output of page but not the html source e.g I cant find the text thats on the page


  • Registered Users Posts: 10,494 ✭✭✭✭28064212


    guylikeme wrote: »
    curl -u <user>|<password><URL>

    gives output of page but not the html source e.g I cant find the text thats on the page
    What do you mean by "output of page "?
    curl www.boards.ie
    
    prints
    <!DOCTYPE html>
    <html>
            <head>
                    <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
                    <meta name="viewport" content="width=device-width,initial-scale=1" />
                    <title>boards.ie - Now Ye're Talkin'</title>
    [...]
    
    i.e. the html source

    Boardsie Enhancement Suite - a browser extension to make using Boards on desktop a better experience (includes full-width display, keyboard shortcuts, dark mode, and more). Now available through your browser's extension store.

    Firefox: https://addons.mozilla.org/addon/boardsie-enhancement-suite/

    Chrome/Edge/Opera: https://chromewebstore.google.com/detail/boardsie-enhancement-suit/bbgnmnfagihoohjkofdnofcfmkpdmmce



  • Registered Users Posts: 262 ✭✭guylikeme


    What i want is to get the text of the page

    So on the boards example, i would get the thread titles


  • Moderators, Computer Games Moderators Posts: 4,281 Mod ✭✭✭✭deconduo


    guylikeme wrote: »
    What i want is to get the text of the page

    So on the boards example, i would get the thread titles

    Can you give us an example of what you are getting, and what you think you should be getting?


  • Advertisement
  • Registered Users Posts: 36,166 ✭✭✭✭ED E


    Think the OP has the terminology wrong, HTML source != HTML Text Content

    What you want is a HTML parser/scraper like Beautiful Soup. So you can find a tag and get its content.
    soup = ....
    threadName = soup.head.content
    

    Syntax is something like that, its really really friendly. If you work in Python its great and if not your language probably has an equivalent.


  • Registered Users Posts: 262 ✭✭guylikeme


    Ok, the output is best explained by doing the following...

    1. Open Internet Explorer to a page that contains text
    2. Save file (html)
    3. Open the html in notepad

    The text of the page is in that file - this is what i want to obtain.


  • Registered Users Posts: 10,494 ✭✭✭✭28064212


    guylikeme wrote: »
    What i want is to get the text of the page

    So on the boards example, i would get the thread titles
    What you're looking for is called "scraping". Using the shell to do it is not going to be easy, and a bad place to start looking. I would suggest using something like python with the Beautiful Soup library .

    It's a complex area. Using the example above of getting boards thread titles from the home page: the thread titles are loaded by Javascript after the page loads. If you get the homepage source using curl, it won't have the titles because it doesn't run javascript.

    Boardsie Enhancement Suite - a browser extension to make using Boards on desktop a better experience (includes full-width display, keyboard shortcuts, dark mode, and more). Now available through your browser's extension store.

    Firefox: https://addons.mozilla.org/addon/boardsie-enhancement-suite/

    Chrome/Edge/Opera: https://chromewebstore.google.com/detail/boardsie-enhancement-suit/bbgnmnfagihoohjkofdnofcfmkpdmmce



  • Registered Users Posts: 6,250 ✭✭✭Buford T Justice


    As has been said above. Curl will give you the html content. You are looking to parse it and extract certain elements.

    Doubt shell will make that very easy for you at all. An altertnative to python is jSoup in Java


  • Registered Users Posts: 6,494 ✭✭✭daymobrew


    perl and the LWP::Simple module will download the html file and then you can parse it easily.


  • Advertisement
  • Registered Users Posts: 6,306 ✭✭✭OfflerCrocGod


    Cheerio for node provides a jQuery like API if you are more used to that https://www.npmjs.com/package/cheerio example usage here https://github.com/briandipalma/random-scripts/blob/master/request_videos.js#L41


  • Moderators, Society & Culture Moderators Posts: 17,642 Mod ✭✭✭✭Graham


    OP, you might get some more useful/specific suggestions if you explain to us what problem you're trying to solve.

    Is this a one-off task or will you want to scape content regularly?
    If it's scheduled will you be scraping the same sites/pages all the time?
    What do you intent to do with the content once you have retrieved it (put it in a database etc).
    Are you a developer looking to put together your own solution or are you just looking for the quickest/easiest way to grab the content?
    If you're a dev, what languages are you familiar with.


  • Registered Users Posts: 262 ✭✭guylikeme


    Graham wrote: »
    OP, you might get some more useful/specific suggestions if you explain to us what problem you're trying to solve.

    Is this a one-off task or will you want to scape content regularly?
    If it's scheduled will you be scraping the same sites/pages all the time?
    What do you intent to do with the content once you have retrieved it (put it in a database etc).
    Are you a developer looking to put together your own solution or are you just looking for the quickest/easiest way to grab the content?
    If you're a dev, what languages are you familiar with.

    One off.

    Put it in a String to simply check if it contains certain String.

    Java/Python/Bash


  • Closed Accounts Posts: 2,267 ✭✭✭h57xiucj2z946q


    Can you post the url to the actual page?

    I'm guessing the page is using ajax calls or maybe even dreaded iframes, hence you are not finding the text you are looking for in the raw html source for the given url.


  • Registered Users Posts: 262 ✭✭guylikeme


    daymobrew wrote: »
    perl and the LWP::Simple module will download the html file and then you can parse it easily.

    Can you show an example


  • Registered Users Posts: 6,494 ✭✭✭daymobrew


    guylikeme wrote: »
    Can you show an example
    #!/usr/bin/perl -w
    
    use strict;
    use LWP::Simple qw(get);
    
    if ($ARGV[0]) {
      my $html = get $ARGV[0];
      
      # Contents of file now in $html.
    }  
    


Advertisement