Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Finding position of certain data

Options
  • 05-07-2013 12:04am
    #1
    Registered Users Posts: 605 ✭✭✭


    I've started messing around with some code just to try learn a few things about Java, but I've run in to a problem.

    Basically, what I need to do is find the position of the "registered on" date from the following page: http://www.whois.com/whois/example.co.uk (I think this the correct site, can't check the HTML at the moment because I'm on my mobile - some sites don't display the data in pure HTML). So for example I want to get "before-Aug-1996" on the link I gave and nothing else.

    I figured out how to print a whole line to the screen which includes the date but it gives too much info that I don't need by using if (nextLine.contains("registered on") etc. but I'm kind of stuck.

    I probably haven't explained this well enough but just wondering if there's any tips you have to do this. I assume I'll need a loop anyway? Would probably be good to start checking for the string "registered on" half way down the page too to stop wasting time searching the start of the page.


Comments

  • Registered Users Posts: 44 damned_junkie




  • Registered Users Posts: 2,021 ✭✭✭ChRoMe



    now he has two problems :p


  • Registered Users Posts: 18,272 ✭✭✭✭Atomic Pineapple


    Jsoup can help you parse the HTML of the page

    http://jsoup.org/

    Not sure if that's actually what your after?


  • Registered Users Posts: 605 ✭✭✭Lmao_Man


    draffodx wrote: »
    Jsoup can help you parse the HTML of the page

    http://jsoup.org/

    Not sure if that's actually what your after?


    Thanks everyone, saw this mentioned before actually.

    In simple terms, all I want to do is print what date the domain was registered on if that makes sense?


  • Registered Users Posts: 18,272 ✭✭✭✭Atomic Pineapple


    Lmao_Man wrote: »
    Thanks everyone, saw this mentioned before actually.

    In simple terms, all I want to do is print what date the domain was registered on if that makes sense?

    I haven't looked at the HTML of the site but Jsoup may be able to help you parse the exact tag that holds the value, then you should be able to cleanly print it.


  • Advertisement
  • Registered Users Posts: 1,922 ✭✭✭fergalr


    Regular expressions is the right tool for this.

    Even if you parsed the HTML, you'd probably still use regular expressions to strip out everything but the actual date.

    If the pages are in a regular format, I wouldn't even bother parsing the HTML, I'd just use regular expressions to do everything.

    BUT:
    Not every whois page returns information in that form.
    So, to extract the dates generically, you would have to write one regex per type of whois page. That could be a big job.

    If this is just a learning experience, then go and learn about regular expressions. Be aware that if you wanted to process more complicated pages, you should use a HTML parsing library.

    But if you actually want a structured database of whois data, pay someone else who has already done it.


  • Registered Users Posts: 2,781 ✭✭✭amen


    Silly question but why are you looking at the html page?

    Nearly all top level domain registers (.com,.gov,o.co.uk, .ie etc) all you to call their registration service with a url and they will return the data to you in a known standard format.

    You can just extract the data you want. Much easier than parsing a html page.


  • Registered Users Posts: 7,292 ✭✭✭jmcc


    amen wrote: »
    Silly question but why are you looking at the html page?

    Nearly all top level domain registers (.com,.gov,o.co.uk, .ie etc) all you to call their registration service with a url and they will return the data to you in a known standard format.

    You can just extract the data you want. Much easier than parsing a html page.
    Actually it is not that simple. With the .com and other gTLDs, the registry maintains a thin WHOIS which just redirects you to the WHOIS maintained by the registrars (accredited resellers). The text and structure on those results varies considerably.

    Regards...jmcc


  • Registered Users Posts: 2,781 ✭✭✭amen


    ohh its changed a bit then since I last looked at it.
    The text and structure on those results varies considerably

    if thats true then he has even more problems as you can't assume that each register will have the same language/wording as each other.


  • Registered Users Posts: 7,292 ✭✭✭jmcc


    amen wrote: »
    ohh its changed a bit then since I last looked at it.
    If it is only a registration date, then the full recursive lookup is not needed. That might make it relatively simple as it will be the same set of data being parsed. However it can vary from TLD to TLD. With a thin WHOIS, the registry generally just maintains the basic data and the redirection url. With some versions of 'whois' the -n swich prevents the redirect being followed and just provides the bare registration details.

    Regards...jmcc


  • Advertisement
Advertisement