Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Can you recommend a good Java HTML parser?

Options
  • 07-01-2012 5:17pm
    #1
    Registered Users Posts: 2,345 ✭✭✭


    Hi

    Can anybody recommend a good bug free Java HTML parser?

    I am looking to make an Android App in my spare time as a little project to help me learn a bit about Android development and keep up with Java programming at the same time too.

    The App will need to parse a HTML file which makes a timetable out of <table>'s with more <table>'s embedded inside the <td> tags of the parent <table>.
    The timetable includes attributes like colspan to define the length of time something is on for so I would need all these to be parsed correctly and made available to be read in some way.
    There are also empty tags dotted around the place which are crucial to the design of the timetable so I would also need these to be included and not stripped out.

    Additional Things I'd Like:
    • Simple to use, with good documentation
    • Very reliable and bug free
    • Able to parse and fix bad HTML
    • Implements DOM
    • Able to parse XML (Optional)

    Any recommendations you could give me would be much appreciated.

    Thank you.


Comments

  • Registered Users Posts: 37 corman007


    Maybe you could parse it as an xml file using sax -
    java api for parsing xml


  • Registered Users Posts: 37 corman007


    That is treat file as an xml file, but your html would need to be properly formatted like xhtml


  • Registered Users Posts: 1,127 ✭✭✭smcelhinney


    Ahem, what feed are you "acquiring: :D

    I echo what corman says, but if you don't know whether your HTML is well formed (would be better as XHTML), then its hard to work with. You're into the quite murky waters of regular expressions to extract strings etc.

    Look at this link, under Grabbing HTML tags. It might give you a starting point.


  • Registered Users Posts: 33 frezzabelle


    Have a look at doj or jsoup they're pretty handy, css selectors are great for scraping. http://code.google.com/p/hue/wiki/Doj
    http://jsoup.org


  • Registered Users Posts: 2,345 ✭✭✭Kavrocks


    corman007 wrote: »
    Maybe you could parse it as an xml file using sax -
    java api for parsing xml
    corman007 wrote: »
    That is treat file as an xml file, but your html would need to be properly formatted like xhtml
    Its valid XHTML bar its missing </body></html> when it is being displayed correctly but there are times when the service generating the page isn't working but I don't know if its valid then.

    I'll definitely look into sax. Is it javax.xml.parsers.SAXParser?
    Ahem, what feed are you "acquiring: :D

    I echo what corman says, but if you don't know whether your HTML is well formed (would be better as XHTML), then its hard to work with. You're into the quite murky waters of regular expressions to extract strings etc.

    Look at this link, under Grabbing HTML tags. It might give you a starting point.
    Specfically I'm looking to grab timetables from that service but it doesn't always play nice as it is now.

    Its a feed out of my control and it isn't a very reliable feed either and when its not working correctly I wouldn't be grabbing anything as there isn't any usable information on it but I was hoping the parser could fix whatever HTML there is and I could quickly scan to see if there is anything usable and if not just discard it.

    If I treat it as XML and its bad HTML an XML parser would throw and exception? If it did I could catch that and then discard it which would probably be easier and less work?

    Parsing it like XML does seem like a better idea than as HTML.
    Have a look at doj or jsoup they're pretty handy, css selectors are great for scraping. http://code.google.com/p/hue/wiki/Doj
    http://jsoup.org
    I had a quick look at jsoup after somebody had told me to take a look at BeautifulSoup which they said they used for a similar project. I'll look more deeply into it as I wasn't sure what it was like, thanks. The unfortunate part is there is no css used in the HTML I'm looking to grab.

    Thanks for the replies.


  • Advertisement
  • Moderators, Technology & Internet Moderators Posts: 1,333 Mod ✭✭✭✭croo


    what about xpath/xquery?

    ps. while trying to remind myself if it was xpath and or xquery I happened across this and thought it might be of interest.
    http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
    It's a novel approach to the bad formatted html issue!


  • Registered Users Posts: 2,345 ✭✭✭Kavrocks


    croo wrote: »
    what about xpath/xquery?

    ps. while trying to remind myself if it was xpath and or xquery I happened across this and thought it might be of interest.
    http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
    It's a novel approach to the bad formatted html issue!
    That's a good link, very interesting comparison, thanks.

    I'll check out xpath/xquery too.


  • Registered Users Posts: 339 ✭✭duffman85


    Have you looked at JTidy(http://jtidy.sourceforge.net/)

    I used it for a project to parse a downloaded webpage and find all links,images,etc.

    I found it straightforward enough to use. It works with messy HTML and XHTML - I think it can do XML.


  • Registered Users Posts: 2,345 ✭✭✭Kavrocks


    duffman85 wrote: »
    Have you looked at JTidy(http://jtidy.sourceforge.net/)

    I used it for a project to parse a downloaded webpage and find all links,images,etc.

    I found it straightforward enough to use. It works with messy HTML and XHTML - I think it can do XML.
    No I haven't, if it can do all that very well I will certainly take a look at it, thanks.


Advertisement