Can you recommend a good Java HTML parser?

Kavrocks · 07-01-2012 5:17pm #1

Hi

Can anybody recommend a good bug free Java HTML parser?

I am looking to make an Android App in my spare time as a little project to help me learn a bit about Android development and keep up with Java programming at the same time too.

The App will need to parse a HTML file which makes a timetable out of <table>'s with more <table>'s embedded inside the <td> tags of the parent <table>.
The timetable includes attributes like colspan to define the length of time something is on for so I would need all these to be parsed correctly and made available to be read in some way.
There are also empty tags dotted around the place which are crucial to the design of the timetable so I would also need these to be included and not stripped out.

Additional Things I'd Like:

Simple to use, with good documentation
Very reliable and bug free
Able to parse and fix bad HTML
Implements DOM
Able to parse XML (Optional)

Any recommendations you could give me would be much appreciated.

Thank you.

corman007 · 08-01-2012 1:13am

Maybe you could parse it as an xml file using sax -
java api for parsing xml

corman007 · 08-01-2012 1:16am

That is treat file as an xml file, but your html would need to be properly formatted like xhtml

smcelhinney · 08-01-2012 1:25am

Ahem, what feed are you "acquiring:

I echo what corman says, but if you don't know whether your HTML is well formed (would be better as XHTML), then its hard to work with. You're into the quite murky waters of regular expressions to extract strings etc.

Look at this link, under Grabbing HTML tags. It might give you a starting point.

frezzabelle · 08-01-2012 1:53am

Have a look at doj or jsoup they're pretty handy, css selectors are great for scraping. http://code.google.com/p/hue/wiki/Doj
http://jsoup.org

Kavrocks · 08-01-2012 2:27pm

corman007 wrote: »

Maybe you could parse it as an xml file using sax -
java api for parsing xml

corman007 wrote: »

That is treat file as an xml file, but your html would need to be properly formatted like xhtml

Its valid XHTML bar its missing </body></html> when it is being displayed correctly but there are times when the service generating the page isn't working but I don't know if its valid then.

I'll definitely look into sax. Is it javax.xml.parsers.SAXParser?

smcelhinney wrote: »

Ahem, what feed are you "acquiring:

I echo what corman says, but if you don't know whether your HTML is well formed (would be better as XHTML), then its hard to work with. You're into the quite murky waters of regular expressions to extract strings etc.

Look at this link, under Grabbing HTML tags. It might give you a starting point.

Specfically I'm looking to grab timetables from that service but it doesn't always play nice as it is now.

Its a feed out of my control and it isn't a very reliable feed either and when its not working correctly I wouldn't be grabbing anything as there isn't any usable information on it but I was hoping the parser could fix whatever HTML there is and I could quickly scan to see if there is anything usable and if not just discard it.

If I treat it as XML and its bad HTML an XML parser would throw and exception? If it did I could catch that and then discard it which would probably be easier and less work?

Parsing it like XML does seem like a better idea than as HTML.

frezzabelle wrote: »

Have a look at doj or jsoup they're pretty handy, css selectors are great for scraping. http://code.google.com/p/hue/wiki/Doj
http://jsoup.org

I had a quick look at jsoup after somebody had told me to take a look at BeautifulSoup which they said they used for a similar project. I'll look more deeply into it as I wasn't sure what it was like, thanks. The unfortunate part is there is no css used in the HTML I'm looking to grab.

Thanks for the replies.

croo · 09-01-2012 10:53pm

what about xpath/xquery?

ps. while trying to remind myself if it was xpath and or xquery I happened across this and thought it might be of interest.
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
It's a novel approach to the bad formatted html issue!

Kavrocks · 10-01-2012 9:57am

croo wrote: »

what about xpath/xquery?

ps. while trying to remind myself if it was xpath and or xquery I happened across this and thought it might be of interest.
http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/
It's a novel approach to the bad formatted html issue!

That's a good link, very interesting comparison, thanks.

I'll check out xpath/xquery too.

duffman85 · 10-01-2012 10:37am

Have you looked at JTidy(http://jtidy.sourceforge.net/)

I used it for a project to parse a downloaded webpage and find all links,images,etc.

I found it straightforward enough to use. It works with messy HTML and XHTML - I think it can do XML.

Kavrocks · 10-01-2012 11:08am

duffman85 wrote: »

Have you looked at JTidy(http://jtidy.sourceforge.net/)

I used it for a project to parse a downloaded webpage and find all links,images,etc.

I found it straightforward enough to use. It works with messy HTML and XHTML - I think it can do XML.

No I haven't, if it can do all that very well I will certainly take a look at it, thanks.

Can you recommend a good Java HTML parser?

Comments