Fredrik Lundh wrote:
thebjorn wrote:
I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/
...
first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1.
It would be the way to go, however it looks like they've managed to get
Latin-1 data in exactly two columns in the entire database (this is a
commercial product of course, so there's no way for us to fix things).
And just to make things more interesting, I think I'm running into an
ADO bug where capital letters (outside the U+0000 to U+007F range) are
returning strange values:
>>c.execute('create table utf8 (f1 varchar(15))')
u'ÆØÅÉ'.encode('utf-8')
'\xc3\x86\xc3\x98\xc3\x85\xc3\x89'
>>x = _
c.execute('insert into utf8 (f1) values (?)', (x,))
c.execute('select * from utf8')
c.fetchall()
((u'\xc3\u2020\xc3\u02dc\xc3\u2026\xc3\u2030',),)
>>>
I haven't tested this through C[#/++] to verify that it's an ADO issue,
but it seems unlikely that MS would view this as anything but incorrect
usage no matter where the issue is...
Anyway, sorry for venting :-)
if that's not possible, you can roundtrip via ISO-8859-1 yourself:
>>u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
....
>>print u.encode("iso-8859-1").decode("utf-8")
Blåbærsyltetøy
That's very nice!
-- bjorn