[Java]-how to read information off webpage?

Logical_Bear · 2013-03-12 09:18:13

Hi, how would you go about getting some information(but not all) off a web page. Say for example I only wanted the lotto plus results from this page on the lotto website http://www.lottery.ie/Accessible-Results/ Is there some way to parse whats on that page?

12-03-2013 09:18AM

#1

Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

Hi,
how would you go about getting some information(but not all) off a web page.

Say for example I only wanted the lotto plus results from this page on the lotto website http://www.lottery.ie/Accessible-Results/

Is there some way to parse whats on that page?

0

Comments

#2 12-03-2013 09:28AM

jester77

Registered Users, Registered Users 2 Posts: 11,264 ✭✭✭✭

Join Date: October 2003

Posts: 10879

Are you allowed to republish the results?

You could use something like selenium with a htmlunit driver.

0
#3 12-03-2013 09:32AM

seamus

Registered Users, Registered Users 2 Posts: 68,173 ✭✭✭✭

Join Date: July 2001

Posts: 66931

This is typically known as "web scraping".

First thing I would do is check the website to see if they have any kind of feed that you can access instead of the plain web page. The feed is usually much cleaner and easier to collect the data from.

If that doesn't exist, then it's a matter of parsing the HTML to find and extract the data that you need.

Libraries probably exist that will take the HTML document and build a tree-like object out of it, e.g.

document.element("body").element("table").element("td").value();

Or similar.

0
#4 12-03-2013 09:46AM
Fluffy88

Registered Users, Registered Users 2 Posts: 1,414 ✭✭✭

Join Date: October 2009

Posts: 1394
There is lots of ways to parse what's on that page
http://stackoverflow.com/questions/2168610/which-html-parser-is-best

One option is use a HTML parser library so you can use simple methods to get the results you need, e.g. using the jsoup parser
String html = /*You get the raw HTML from a http request*/; Document doc = Jsoup.parse(html); Element lottoPlusOne = doc.select("div.LottoPlus1").first();

Using a HTML parser is a easiest way to do this as it will do most of the work for you. Sometimes though a parser might be too much for the purpose intended so you could just parse the page yourself and pull out what you need.
You can use lots of different approaches to do this, probably one of the easiest ways is to use the substring() method of the String class.
String html = /*You get the raw HTML from a http request*/; // get starting position of interested text int beginningOfText = html.index("<div class="latest-results LottoPlus1">"); // get ending position of interested text int endOfText = html.index("</div>", beginningOfText); // pull out the interested text String lottoPlusOneHTML = html.substring(beginningOfText, endOfText);
That would give you the Raw HTML of the LottoPlus1 latest results and you could pull out the info you need from there.

I would advise the HTML parser route as doing it yourself is pointless unless it's for a college project or something as if the HTML changes slightly it may break a manual approach whereas the HTML parser would be far more robust.

0
#5 12-03-2013 09:55AM

Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

thanks all!
Seamus by feed i take it you mean an RSS?

Fluffy88 thanks as well,Its not a college project more just something to do that I havent done before so on that front I might manually code it but thats a good point that if say lotto.ie changes the page set up i could be parsing goobledygook as the results!

0
#6 12-03-2013 09:58AM

seamus

Registered Users, Registered Users 2 Posts: 68,173 ✭✭✭✭

Join Date: July 2001

Posts: 66931

Logical_Bear wrote: »

thanks all!
Seamus by feed i take it you mean an RSS?

Could be RSS, but many sites provide more modern things like SOAP services or minimal APIs.

The main benefit of them is that they won't suddenly change one day and break your application.

With so many sites offering mobile versions, you can also exploit them as they will be simpler and smaller pages.

0
Advertisement
#7 12-03-2013 10:01AM

croo

Moderators, Technology & Internet Moderators Posts: 1,338 Mod ✭✭✭✭

Join Date: August 2006

Posts: 1316

http://www.boards.ie/vbulletin/showthread.php?p=76424705

0
#8 12-03-2013 11:47AM

ChRoMe

Registered Users, Registered Users 2 Posts: 2,021 ✭✭✭

Join Date: February 1998

Posts: 1961

This can be dodgy in that it violates T&Cs of sites, make sure you can't get into trouble for reusing data that does not belong to you.

0
#9 12-03-2013 12:42PM

Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

ChRoMe wrote: »

This can be dodgy in that it violates T&Cs of sites, make sure you can't get into trouble for reusing data that does not belong to you.

should I mail the webmaster or something?
Was just using that lotto link as an example

0
#10 12-03-2013 12:43PM

ChRoMe

Registered Users, Registered Users 2 Posts: 2,021 ✭✭✭

Join Date: February 1998

Posts: 1961

Logical_Bear wrote: »

should I mail the webmaster or something?
Was just using that lotto link as an example

Yeah that might be an idea, check if there are any terms and conditions listed on the site. Generally web scraping is something that is really frowned upon.

0
#11 12-03-2013 01:07PM

Logical_Bear

Closed Accounts Posts: 799 ✭✭✭

Join Date: July 2012

Posts: 768

ChRoMe wrote: »

Yeah that might be an idea, check if there are any terms and conditions listed on the site. Generally web scraping is something that is really frowned upon.

thanks cChRoMe.

0
Advertisement