Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

python - parse HTML into XML/JSON

  • 14-04-2011 12:21am
    #1
    Registered Users, Registered Users 2 Posts: 7,893 ✭✭✭


    Hi,
    So I'm looking to convert some HTML on an external website (that I don't control) into a readable format. I'm looking to get the actual values from the tags, not the tags themselves. See here:
    <div class="myClass1">
    	<h4>Area1</h4>
    	<div class="name">john</div>
    	<div class="address">1 meh st</div>
    	<div class="name">paul</div>
    	<div class="address">12 meh st</div>
    </div>
    
    <div class="myClass2">
    	<h4>Area2</h4>
    	<div class="name">mickey</div>
    	<div class="address">23 bleh st</div>
    	<div class="name">joe</div>
    	<div class="address">123 Some St</div>
    </div>
    

    So out of this, I'll want my python to output just:
    Area1
    john
    1 meh st
    paul
    12 meh st
    
    Area2
    mickey
    23 bleh st
    joe
    12 some st
    

    Sorry for the stupid fake values! :p

    So anyway, what I had already was some code that reads the HTML. So I a string that has the full code. Now I need some way to navigate through the tags, like XML, and access the values themselves. I couldnt figure it out with BeautifulSoup. I managed to strip some tags, but ended up with all the div tags still intact.

    See here:
    file = urllib2.urlopen("http://www.mysite.com/?get="+value).read(200000)
    	nameData = re.findall('<div class="name">.*?</div>',file)
    	addrData = re.findall('<div class="address">.*?</div>',file)
    
    	for n, a in zip(nameData,addrData):
    		dataString=dataString+n+":"+a+","
    

    This reads in the data in "Area1" only, I'm guessing because the <div> closes before it, so it gets read like the "root" is getting closed (i'm guessing).

    Can anyone help?

    Thanks.


    PS I already have code elsewhere that I use to parse xml like this:
    dom.getElementsByTagName("root")[0].getElementsByTagName("person")[0].getElementsByTagName("details")
    and then I use the "getAttribute()" method.
    So if anyone knows how I can convert it to some usable XML, that'll be perfect too!!


Comments

  • Registered Users, Registered Users 2 Posts: 1,311 ✭✭✭Procasinator


    I not a regular Python developer, but I had to achieve a similar task before (XML wasn't well-formed), and I think BeautifulSoup really is the best solution.

    What problems were you having with BeautifulSoup?

    As for your current code, when I run it I get 8 div items (4 from Area1, 4 from Area2) as a result using your specified HTML.


Advertisement