473,385 Members | 1,673 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Is there a way to get utf-8 out of a Unicode string?

I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/

Then I read the data out using adodbapi (which returns all strings as
Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
find any way to get back to the original short of:

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?

-- bjorn

Oct 30 '06 #1
4 1956
thebjorn wrote:
I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/

Then I read the data out using adodbapi (which returns all strings as
Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
find any way to get back to the original short of:

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?
first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1. if that's not possible, you
can roundtrip via ISO-8859-1 yourself:
>>u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
u
u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
>>u.encode("iso-8859-1")
'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
>>u.encode("iso-8859-1").decode("utf-8")
u'Bl\xe5b\xe6rsyltet\xf8y'
>>print u.encode("iso-8859-1").decode("utf-8")
Blåbærsyltetøy

</F>

Oct 30 '06 #2
Hei,

On 2006-10-30 08:25:41 +0100, thebjorn wrote:
def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?
Slightly lack hackish:

return ''.join(chr(ord(c)) for c in s)

Gerrit.
Oct 30 '06 #3
Fredrik Lundh wrote:
thebjorn wrote:
I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/
...
first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1.
It would be the way to go, however it looks like they've managed to get
Latin-1 data in exactly two columns in the entire database (this is a
commercial product of course, so there's no way for us to fix things).
And just to make things more interesting, I think I'm running into an
ADO bug where capital letters (outside the U+0000 to U+007F range) are
returning strange values:
>>c.execute('create table utf8 (f1 varchar(15))')
u'ÆØÅÉ'.encode('utf-8')
'\xc3\x86\xc3\x98\xc3\x85\xc3\x89'
>>x = _
c.execute('insert into utf8 (f1) values (?)', (x,))
c.execute('select * from utf8')
c.fetchall()
((u'\xc3\u2020\xc3\u02dc\xc3\u2026\xc3\u2030',),)
>>>
I haven't tested this through C[#/++] to verify that it's an ADO issue,
but it seems unlikely that MS would view this as anything but incorrect
usage no matter where the issue is...

Anyway, sorry for venting :-)
if that's not possible, you can roundtrip via ISO-8859-1 yourself:
>>u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
....
>>print u.encode("iso-8859-1").decode("utf-8")
Blåbærsyltetøy
That's very nice!

-- bjorn

Oct 30 '06 #4
Gerrit Holl wrote:
Hei,

On 2006-10-30 08:25:41 +0100, thebjorn wrote:
def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')
....
Is there a less hack'ish way to do this?

Slightly lack hackish:

return ''.join(chr(ord(c)) for c in s)
Much less hackish :-)

-- bjorn

Oct 30 '06 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: aa | last post by:
Is it OK to include an ANSI file into a UTF-8 file?
38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
16
by: lawrence | last post by:
I was told in another newsgroup (about XML, I was wondering how to control user input) that most modern browsers empower the designer to cast the user created input to a particular character...
7
by: Philipp Lenssen | last post by:
How do I load and save a UTF-8 document in XML in ASP/VBS? Well, the loading* is not the problem actually -- the file is in UTF-8, and understood correctly -- but once saved, the UTF-8 is...
1
by: JJBW | last post by:
Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...
2
by: Guillermo Rosich Capablanca | last post by:
I have a problem with utf-8 enconding and I don't know what to do in order to make it work. I want to open a new window with excel data so the user can choose to save it local. Everything...
23
by: Steven T. Hatton | last post by:
This is one of the first obstacles I encountered when getting started with C++. I found that everybody had their own idea of what a string is. There was std::string, QString, xercesc::XMLString,...
7
by: Jimmy Shaw | last post by:
Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32,...
10
by: Jed | last post by:
I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.