Finding position of certain data

Lmao_Man · 05-07-2013 12:04am #1

I've started messing around with some code just to try learn a few things about Java, but I've run in to a problem.

Basically, what I need to do is find the position of the "registered on" date from the following page: http://www.whois.com/whois/example.co.uk (I think this the correct site, can't check the HTML at the moment because I'm on my mobile - some sites don't display the data in pure HTML). So for example I want to get "before-Aug-1996" on the link I gave and nothing else.

I figured out how to print a whole line to the screen which includes the date but it gives too much info that I don't need by using if (nextLine.contains("registered on") etc. but I'm kind of stuck.

I probably haven't explained this well enough but just wondering if there's any tips you have to do this. I assume I'll need a loop anyway? Would probably be good to start checking for the string "registered on" half way down the page too to stop wasting time searching the start of the page.

damned_junkie · 05-07-2013 1:54am

Sounds like a job for regular expressions.

http://www.vogella.com/articles/JavaRegularExpressions/article.html

ChRoMe · 05-07-2013 7:20am

damned_junkie wrote: »

Sounds like a job for regular expressions.

http://www.vogella.com/articles/JavaRegularExpressions/article.html

now he has two problems

Atomic Pineapple · 05-07-2013 12:06pm

Jsoup can help you parse the HTML of the page

http://jsoup.org/

Not sure if that's actually what your after?

Lmao_Man · 05-07-2013 4:33pm

draffodx wrote: »

Jsoup can help you parse the HTML of the page

http://jsoup.org/

Not sure if that's actually what your after?

Thanks everyone, saw this mentioned before actually.

In simple terms, all I want to do is print what date the domain was registered on if that makes sense?

Atomic Pineapple · 05-07-2013 5:00pm

Lmao_Man wrote: »

Thanks everyone, saw this mentioned before actually.

In simple terms, all I want to do is print what date the domain was registered on if that makes sense?

I haven't looked at the HTML of the site but Jsoup may be able to help you parse the exact tag that holds the value, then you should be able to cleanly print it.

fergalr · 05-07-2013 8:38pm

Regular expressions is the right tool for this.

Even if you parsed the HTML, you'd probably still use regular expressions to strip out everything but the actual date.

If the pages are in a regular format, I wouldn't even bother parsing the HTML, I'd just use regular expressions to do everything.

BUT:
Not every whois page returns information in that form.
So, to extract the dates generically, you would have to write one regex per type of whois page. That could be a big job.

If this is just a learning experience, then go and learn about regular expressions. Be aware that if you wanted to process more complicated pages, you should use a HTML parsing library.

But if you actually want a structured database of whois data, pay someone else who has already done it.

amen · 05-07-2013 9:47pm

Silly question but why are you looking at the html page?

Nearly all top level domain registers (.com,.gov,o.co.uk, .ie etc) all you to call their registration service with a url and they will return the data to you in a known standard format.

You can just extract the data you want. Much easier than parsing a html page.

jmcc · 06-07-2013 3:35pm

amen wrote: »

Silly question but why are you looking at the html page?

Nearly all top level domain registers (.com,.gov,o.co.uk, .ie etc) all you to call their registration service with a url and they will return the data to you in a known standard format.

You can just extract the data you want. Much easier than parsing a html page.

Actually it is not that simple. With the .com and other gTLDs, the registry maintains a thin WHOIS which just redirects you to the WHOIS maintained by the registrars (accredited resellers). The text and structure on those results varies considerably.

Regards...jmcc

amen · 06-07-2013 4:13pm

ohh its changed a bit then since I last looked at it.

The text and structure on those results varies considerably

if thats true then he has even more problems as you can't assume that each register will have the same language/wording as each other.

jmcc · 06-07-2013 4:19pm

amen wrote: »

ohh its changed a bit then since I last looked at it.

If it is only a registration date, then the full recursive lookup is not needed. That might make it relatively simple as it will be the same set of data being parsed. However it can vary from TLD to TLD. With a thin WHOIS, the registry generally just maintains the basic data and the redirection url. With some versions of 'whois' the -n swich prevents the redirect being followed and just provides the bare registration details.

Regards...jmcc

Finding position of certain data

Comments