python - parse HTML into XML/JSON

The_B_Man · 14-04-2011 01:21AM #1

Hi,
So I'm looking to convert some HTML on an external website (that I don't control) into a readable format. I'm looking to get the actual values from the tags, not the tags themselves. See here:

<div class="myClass1">
	<h4>Area1</h4>
	<div class="name">john</div>
	<div class="address">1 meh st</div>
	<div class="name">paul</div>
	<div class="address">12 meh st</div>
</div>

<div class="myClass2">
	<h4>Area2</h4>
	<div class="name">mickey</div>
	<div class="address">23 bleh st</div>
	<div class="name">joe</div>
	<div class="address">123 Some St</div>
</div>

So out of this, I'll want my python to output just:

Area1
john
1 meh st
paul
12 meh st

Area2
mickey
23 bleh st
joe
12 some st

Sorry for the stupid fake values!

So anyway, what I had already was some code that reads the HTML. So I a string that has the full code. Now I need some way to navigate through the tags, like XML, and access the values themselves. I couldnt figure it out with BeautifulSoup. I managed to strip some tags, but ended up with all the div tags still intact.

See here:

file = urllib2.urlopen("http://www.mysite.com/?get="+value).read(200000)
	nameData = re.findall('<div class="name">.*?</div>',file)
	addrData = re.findall('<div class="address">.*?</div>',file)

	for n, a in zip(nameData,addrData):
		dataString=dataString+n+":"+a+","

This reads in the data in "Area1" only, I'm guessing because the <div> closes before it, so it gets read like the "root" is getting closed (i'm guessing).

Can anyone help?

Thanks.

PS I already have code elsewhere that I use to parse xml like this:
dom.getElementsByTagName("root")[0].getElementsByTagName("person")[0].getElementsByTagName("details")
and then I use the "getAttribute()" method.
So if anyone knows how I can convert it to some usable XML, that'll be perfect too!!

Procasinator · 14-04-2011 05:39PM

I not a regular Python developer, but I had to achieve a similar task before (XML wasn't well-formed), and I think BeautifulSoup really is the best solution.

What problems were you having with BeautifulSoup?

As for your current code, when I run it I get 8 div items (4 from Area1, 4 from Area2) as a result using your specified HTML.

python - parse HTML into XML/JSON

Comments