I've spent last few hours trying to figure out the following:
- fetch a website using urllib2
- filter out some of its content using SGMLParser
- decode result from p. 2. using charset as set in <META http-equiv="Content-Type" ...> of the website
- encode the whole thing using UTF-8
- stuff it into MySQL
Expand|Select|Wrap|Line Numbers
- File "c:\python24\lib\site-packages\MySQLdb\cursors.py", line 147, in execute
- query = query.encode(charset)
- UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 8584: ordinal not in range (128)
Secondly, there seems to be little change no matter if I do any decode/encode or not in points 3 and 4 above. It just won't go.
Right now I'm on the verge of claiming that Python's decode() doesn't do its job. As an example:
Expand|Select|Wrap|Line Numbers
- websiteContent.decode('iso-8859-1').encode('ascii')
- UnicodeEncodeError: 'ascii' codec can't encode character u'\x97' in position 8033: ordinal not in range(128)
Any hints on how to do the 1-2-3-4-5 sequence above will be appreciated.
PTB