Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

[Java]-how to read information off webpage?

Options
  • 12-03-2013 10:18am
    #1
    Closed Accounts Posts: 799 ✭✭✭


    Hi,
    how would you go about getting some information(but not all) off a web page.

    Say for example I only wanted the lotto plus results from this page on the lotto website http://www.lottery.ie/Accessible-Results/

    Is there some way to parse whats on that page?


Comments

  • Registered Users Posts: 11,262 ✭✭✭✭jester77


    Are you allowed to republish the results?

    You could use something like selenium with a htmlunit driver.


  • Registered Users Posts: 68,317 ✭✭✭✭seamus


    This is typically known as "web scraping".

    First thing I would do is check the website to see if they have any kind of feed that you can access instead of the plain web page. The feed is usually much cleaner and easier to collect the data from.

    If that doesn't exist, then it's a matter of parsing the HTML to find and extract the data that you need.

    Libraries probably exist that will take the HTML document and build a tree-like object out of it, e.g.

    document.element("body").element("table").element("td").value();

    Or similar.


  • Registered Users Posts: 1,414 ✭✭✭Fluffy88


    There is lots of ways to parse what's on that page :)
    http://stackoverflow.com/questions/2168610/which-html-parser-is-best

    One option is use a HTML parser library so you can use simple methods to get the results you need, e.g. using the jsoup parser
    String html = /*You get the raw HTML from a http request*/;
    Document doc = Jsoup.parse(html);
    Element lottoPlusOne = doc.select("div.LottoPlus1").first();
    


    Using a HTML parser is a easiest way to do this as it will do most of the work for you. Sometimes though a parser might be too much for the purpose intended so you could just parse the page yourself and pull out what you need.
    You can use lots of different approaches to do this, probably one of the easiest ways is to use the substring() method of the String class.
    String html = /*You get the raw HTML from a http request*/;
    
    // get starting position of interested text
    int beginningOfText = html.index("<div class="latest-results LottoPlus1">");
    
    // get ending position of interested text
    int endOfText = html.index("</div>", beginningOfText);
    
    // pull out the interested text
    String lottoPlusOneHTML = html.substring(beginningOfText, endOfText);
    
    That would give you the Raw HTML of the LottoPlus1 latest results and you could pull out the info you need from there.

    I would advise the HTML parser route as doing it yourself is pointless unless it's for a college project or something as if the HTML changes slightly it may break a manual approach whereas the HTML parser would be far more robust.


  • Closed Accounts Posts: 799 ✭✭✭Logical_Bear


    thanks all!
    Seamus by feed i take it you mean an RSS?

    Fluffy88 thanks as well,Its not a college project more just something to do that I havent done before so on that front I might manually code it but thats a good point that if say lotto.ie changes the page set up i could be parsing goobledygook as the results!


  • Registered Users Posts: 68,317 ✭✭✭✭seamus


    thanks all!
    Seamus by feed i take it you mean an RSS?
    Could be RSS, but many sites provide more modern things like SOAP services or minimal APIs.

    The main benefit of them is that they won't suddenly change one day and break your application.

    With so many sites offering mobile versions, you can also exploit them as they will be simpler and smaller pages.


  • Advertisement
  • Moderators, Technology & Internet Moderators Posts: 1,333 Mod ✭✭✭✭croo




  • Registered Users Posts: 2,021 ✭✭✭ChRoMe


    This can be dodgy in that it violates T&Cs of sites, make sure you can't get into trouble for reusing data that does not belong to you.


  • Closed Accounts Posts: 799 ✭✭✭Logical_Bear


    ChRoMe wrote: »
    This can be dodgy in that it violates T&Cs of sites, make sure you can't get into trouble for reusing data that does not belong to you.
    should I mail the webmaster or something?
    Was just using that lotto link as an example


  • Registered Users Posts: 2,021 ✭✭✭ChRoMe


    should I mail the webmaster or something?
    Was just using that lotto link as an example

    Yeah that might be an idea, check if there are any terms and conditions listed on the site. Generally web scraping is something that is really frowned upon.


  • Closed Accounts Posts: 799 ✭✭✭Logical_Bear


    ChRoMe wrote: »
    Yeah that might be an idea, check if there are any terms and conditions listed on the site. Generally web scraping is something that is really frowned upon.
    thanks cChRoMe.


  • Advertisement
Advertisement