Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie
Hi there,
There is an issue with role permissions that is being worked on at the moment.
If you are having trouble with access or permissions on regional forums please post here to get access: https://www.boards.ie/discussion/2058365403/you-do-not-have-permission-for-that#latest

Encoding problem trying to web scrape in python using BeautifulSoup

  • 01-11-2016 1:01pm
    #1
    Registered Users, Registered Users 2 Posts: 120 ✭✭


    Hi,
    I have been trying to follow tutorials online using requests / urllib and BeautifulSoup, it throws an error at print(soup.prettify()) depending on the site. The code below works properly inside of pycharm but not in cmd

    #this is the code I use
    import requests
    from bs4 import BeautifulSoup

    r = requests.get('https://www.facebook.com/')

    soup = BeautifulSoup(r.content,'html5lib')#I have tried . encode here

    print(soup.prettify())

    I get this error:

    line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 27842: character maps to <undefined>

    I spent hours searching around online for a solution, I tried

    soup = BeautifulSoup(r.content,'html5lib').encode('utf-8')

    which returns another error AttributeError: 'bytes' object has no attribute 'prettify'.

    r is encoded to ISO-8859-1 but I tried r.encoding = 'utf-8' it throws the same error as above.


    If anyone could help me out it would be greatly appreciated


Comments

  • Registered Users, Registered Users 2 Posts: 13 rwsz365


    I presume you are on Windows? Windows console can't handle unicode, see http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console. You can encode to ascii and ignore some characters if you like, `print(soup.prettify().encode('ascii', 'ignore'))`.

    Your other error when you call the encode method is because `soup` is a byte stream after you call the encode method, not an instance of BeautifulSoup so it does not have any method called prettify.


  • Registered Users, Registered Users 2 Posts: 1,275 ✭✭✭bpmurray


    U+2019 is the right quotation mark. Since this isn't normally available in the console (which is CP 850), you need
    print soup.prettify("cp850")
    


Advertisement