Advertisement
Help Keep Boards Alive. Support us by going ad free today. See here: https://subscriptions.boards.ie/.
If we do not hit our goal we will be forced to close the site.

Current status: https://keepboardsalive.com/

Annual subs are best for most impact. If you are still undecided on going Ad Free - you can also donate using the Paypal Donate option. All contribution helps. Thank you.
https://www.boards.ie/group/1878-subscribers-forum

Private Group for paid up members of Boards.ie. Join the club.

python - parse HTML into XML/JSON

  • 14-04-2011 01:21AM
    #1
    Registered Users, Registered Users 2, Paid Member Posts: 8,000 ✭✭✭
    Something about sandwiches


    Hi,
    So I'm looking to convert some HTML on an external website (that I don't control) into a readable format. I'm looking to get the actual values from the tags, not the tags themselves. See here:
    <div class="myClass1">
    	<h4>Area1</h4>
    	<div class="name">john</div>
    	<div class="address">1 meh st</div>
    	<div class="name">paul</div>
    	<div class="address">12 meh st</div>
    </div>
    
    <div class="myClass2">
    	<h4>Area2</h4>
    	<div class="name">mickey</div>
    	<div class="address">23 bleh st</div>
    	<div class="name">joe</div>
    	<div class="address">123 Some St</div>
    </div>
    

    So out of this, I'll want my python to output just:
    Area1
    john
    1 meh st
    paul
    12 meh st
    
    Area2
    mickey
    23 bleh st
    joe
    12 some st
    

    Sorry for the stupid fake values! :p

    So anyway, what I had already was some code that reads the HTML. So I a string that has the full code. Now I need some way to navigate through the tags, like XML, and access the values themselves. I couldnt figure it out with BeautifulSoup. I managed to strip some tags, but ended up with all the div tags still intact.

    See here:
    file = urllib2.urlopen("http://www.mysite.com/?get="+value).read(200000)
    	nameData = re.findall('<div class="name">.*?</div>',file)
    	addrData = re.findall('<div class="address">.*?</div>',file)
    
    	for n, a in zip(nameData,addrData):
    		dataString=dataString+n+":"+a+","
    

    This reads in the data in "Area1" only, I'm guessing because the <div> closes before it, so it gets read like the "root" is getting closed (i'm guessing).

    Can anyone help?

    Thanks.


    PS I already have code elsewhere that I use to parse xml like this:
    dom.getElementsByTagName("root")[0].getElementsByTagName("person")[0].getElementsByTagName("details")
    and then I use the "getAttribute()" method.
    So if anyone knows how I can convert it to some usable XML, that'll be perfect too!!


Comments

  • Registered Users, Registered Users 2 Posts: 1,311 ✭✭✭Procasinator


    I not a regular Python developer, but I had to achieve a similar task before (XML wasn't well-formed), and I think BeautifulSoup really is the best solution.

    What problems were you having with BeautifulSoup?

    As for your current code, when I run it I get 8 div items (4 from Area1, 4 from Area2) as a result using your specified HTML.


Advertisement