If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)

Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

[Java]-how to read information off webpage?

Logical_Bear · 2013-03-12 09:18:13

Hi, how would you go about getting some information(but not all) off a web page. Say for example I only wanted the lotto plus results from this page on the lotto website http://www.lottery.ie/Accessible-Results/ Is there some way to parse whats on that page?

Options

12-03-2013 10:18am

#1

Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

Hi,
how would you go about getting some information(but not all) off a web page.

Say for example I only wanted the lotto plus results from this page on the lotto website http://www.lottery.ie/Accessible-Results/

Is there some way to parse whats on that page?

0

Comments

Options
#2 12-03-2013 10:28am
jester77

Registered Users Posts: 11,262 ✭✭✭✭

Join Date: October 2003

Posts: 10877

Are you allowed to republish the results?

You could use something like selenium with a htmlunit driver.

0
Options
#3 12-03-2013 10:32am
seamus

Registered Users Posts: 68,317 ✭✭✭✭

Join Date: July 2001

Posts: 67070

This is typically known as "web scraping".

First thing I would do is check the website to see if they have any kind of feed that you can access instead of the plain web page. The feed is usually much cleaner and easier to collect the data from.

If that doesn't exist, then it's a matter of parsing the HTML to find and extract the data that you need.

Libraries probably exist that will take the HTML document and build a tree-like object out of it, e.g.

document.element("body").element("table").element("td").value();

Or similar.

0
Options
#4 12-03-2013 10:46am
Fluffy88

Registered Users Posts: 1,414 ✭✭✭

Join Date: October 2009

Posts: 1394
There is lots of ways to parse what's on that page
http://stackoverflow.com/questions/2168610/which-html-parser-is-best

One option is use a HTML parser library so you can use simple methods to get the results you need, e.g. using the jsoup parser
String html = /*You get the raw HTML from a http request*/; Document doc = Jsoup.parse(html); Element lottoPlusOne = doc.select("div.LottoPlus1").first();

Using a HTML parser is a easiest way to do this as it will do most of the work for you. Sometimes though a parser might be too much for the purpose intended so you could just parse the page yourself and pull out what you need.
You can use lots of different approaches to do this, probably one of the easiest ways is to use the substring() method of the String class.
String html = /*You get the raw HTML from a http request*/; // get starting position of interested text int beginningOfText = html.index("<div class="latest-results LottoPlus1">"); // get ending position of interested text int endOfText = html.index("</div>", beginningOfText); // pull out the interested text String lottoPlusOneHTML = html.substring(beginningOfText, endOfText);
That would give you the Raw HTML of the LottoPlus1 latest results and you could pull out the info you need from there.

I would advise the HTML parser route as doing it yourself is pointless unless it's for a college project or something as if the HTML changes slightly it may break a manual approach whereas the HTML parser would be far more robust.

0
Options
#5 12-03-2013 10:55am
Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

thanks all!
Seamus by feed i take it you mean an RSS?

Fluffy88 thanks as well,Its not a college project more just something to do that I havent done before so on that front I might manually code it but thats a good point that if say lotto.ie changes the page set up i could be parsing goobledygook as the results!

0
Options
#6 12-03-2013 10:58am
seamus

Registered Users Posts: 68,317 ✭✭✭✭

Join Date: July 2001

Posts: 67070

Logical_Bear wrote: »

thanks all!
Seamus by feed i take it you mean an RSS?

Could be RSS, but many sites provide more modern things like SOAP services or minimal APIs.

The main benefit of them is that they won't suddenly change one day and break your application.

With so many sites offering mobile versions, you can also exploit them as they will be simpler and smaller pages.

0
Advertisement
Options
#7 12-03-2013 11:01am
croo

Moderators, Technology & Internet Moderators Posts: 1,333 Mod ✭✭✭✭

Join Date: August 2006

Posts: 1311

http://www.boards.ie/vbulletin/showthread.php?p=76424705

0
Options
#8 12-03-2013 12:47pm
ChRoMe

Registered Users Posts: 2,021 ✭✭✭

Join Date: February 1998

Posts: 1961

This can be dodgy in that it violates T&Cs of sites, make sure you can't get into trouble for reusing data that does not belong to you.

0
Options
#9 12-03-2013 1:42pm
Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

ChRoMe wrote: »

This can be dodgy in that it violates T&Cs of sites, make sure you can't get into trouble for reusing data that does not belong to you.

should I mail the webmaster or something?
Was just using that lotto link as an example

0
Options
#10 12-03-2013 1:43pm
ChRoMe

Registered Users Posts: 2,021 ✭✭✭

Join Date: February 1998

Posts: 1961

Logical_Bear wrote: »

should I mail the webmaster or something?
Was just using that lotto link as an example

Yeah that might be an idea, check if there are any terms and conditions listed on the site. Generally web scraping is something that is really frowned upon.

0
Options
#11 12-03-2013 2:07pm
Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

ChRoMe wrote: »

Yeah that might be an idea, check if there are any terms and conditions listed on the site. Generally web scraping is something that is really frowned upon.

thanks cChRoMe.

0
Advertisement