Is there a way to get utf-8 out of a Unicode string?

thebjorn

I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/

Then I read the data out using adodbapi (which returns all strings as
Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
find any way to get back to the original short of:

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?

-- bjorn

Oct 30 '06 #1

Subscribe Post Reply

1956

Fredrik Lundh

thebjorn wrote:

I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/

Then I read the data out using adodbapi (which returns all strings as
Unicode) and I get u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'. I couldn't
find any way to get back to the original short of:

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?

first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1. if that's not possible, you
can roundtrip via ISO-8859-1 yourself:

>>u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'
u

u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'

>>u.encode("iso-8859-1")

'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'

>>u.encode("iso-8859-1").decode("utf-8")

u'Bl\xe5b\xe6rsyltet\xf8y'

>>print u.encode("iso-8859-1").decode("utf-8")

Blåbærsyltetøy

</F>

Oct 30 '06 #2

Gerrit Holl

Hei,

On 2006-10-30 08:25:41 +0100, thebjorn wrote:

def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

i.e. chopping off the u in the repr of a unicode string, and relying on
eval to interpret the \xHH sequences.

Is there a less hack'ish way to do this?

Slightly lack hackish:

return ''.join(chr(ord(c)) for c in s)

Gerrit.

Oct 30 '06 #3

thebjorn

Fredrik Lundh wrote:

thebjorn wrote:

I've got a database (ms sqlserver) that's (way) out of my control,
where someone has stored utf-8 encoded Unicode data in regular varchar
fields, so that e.g. the string 'Blåbærsyltetøy' is in the database
as 'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y' :-/

...

first, check if you can get your database adapter to understand that the
database contains UTF-8 and not ISO-8859-1.

It would be the way to go, however it looks like they've managed to get
Latin-1 data in exactly two columns in the entire database (this is a
commercial product of course, so there's no way for us to fix things).
And just to make things more interesting, I think I'm running into an
ADO bug where capital letters (outside the U+0000 to U+007F range) are
returning strange values:

>>c.execute('create table utf8 (f1 varchar(15))')
u'ÆØÅÉ'.encode('utf-8')

'\xc3\x86\xc3\x98\xc3\x85\xc3\x89'

>>x = _
c.execute('insert into utf8 (f1) values (?)', (x,))
c.execute('select * from utf8')
c.fetchall()

((u'\xc3\u2020\xc3\u02dc\xc3\u2026\xc3\u2030',),)

>>>

I haven't tested this through C[#/++] to verify that it's an ADO issue,
but it seems unlikely that MS would view this as anything but incorrect
usage no matter where the issue is...

Anyway, sorry for venting :-)

if that's not possible, you can roundtrip via ISO-8859-1 yourself:

>>u = u'Bl\xc3\xa5b\xc3\xa6rsyltet\xc3\xb8y'

....

>>print u.encode("iso-8859-1").decode("utf-8")

Blåbærsyltetøy

That's very nice!

-- bjorn

Oct 30 '06 #4

thebjorn

Gerrit Holl wrote:

Hei,

On 2006-10-30 08:25:41 +0100, thebjorn wrote:
def unfk(s):
return eval(repr(s)[1:]).decode('utf-8')

....

Is there a less hack'ish way to do this?

Slightly lack hackish:

return ''.join(chr(ord(c)) for c in s)

Much less hackish :-)

-- bjorn

Oct 30 '06 #5

by: aa | last post by:

Is it OK to include an ANSI file into a UTF-8 file?

PHP

French "No" character entity

by: Haines Brown | last post by:

I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...

HTML / CSS

Adobe GoLive 6 - Nasty feature with UTF-8 encoding

by: Zenobia | last post by:

Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...

HTML / CSS

can a textarea on a form be used to cast text to a specific charset like UTF-16?

by: lawrence | last post by:

I was told in another newsgroup (about XML, I was wondering how to control user input) that most modern browsers empower the designer to cast the user created input to a particular character...

HTML / CSS

Saving XML as UTF-8?

by: Philipp Lenssen | last post by:

How do I load and save a UTF-8 document in XML in ASP/VBS? Well, the loading* is not the problem actually -- the file is in UTF-8, and understood correctly -- but once saved, the UTF-8 is...

ASP / Active Server Pages

UTF-8 with signature & UTF-8 without signature

by: JJBW | last post by:

Hi I am creating some aspx files in Visual Studio 2003 for a Danish web site. The page is encoded as UTF-8 - However, when I save the the aspx file as "UTF-8 without signature" the Danish...

ASP.NET

encoding problems (utf-8)

by: Guillermo Rosich Capablanca | last post by:

I have a problem with utf-8 enconding and I don't know what to do in order to make it work. I want to open a new window with excel data so the user can choose to save it local. Everything...

ASP.NET

UTF-16 & wchar_t: the 2nd worst thing about C++

by: Steven T. Hatton | last post by:

This is one of the first obstacles I encountered when getting started with C++. I found that everybody had their own idea of what a string is. There was std::string, QString, xercesc::XMLString,...

C / C++

Converting from UTF-16 to UTF-32

by: Jimmy Shaw | last post by:

Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32,...

C / C++

CDONTS or CDOSYS UTF-8 Email

by: Jed | last post by:

I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email...

ASP / Active Server Pages

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Is there a way to get utf-8 out of a Unicode string?

Similar topics