469,929 Members | 1,742 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,929 developers. It's quick & easy.

Unicode perplex

I've got an interesting little problem that I can't find an
answer to after hunting through the doc (2.3.3). I've
got a string that contains something that kind of
resembles an HTML document. On looking through
it, I find a <meta http-equiv="content-type"
content="text/html; charset=UTF-8"> tag.

The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.) I don't need to convert it with a codec,
I need to change the class under the data.

I don't want to have to write a c language
extension, and I also don't want to have to write
it out to a file and read it back in. The product
involved (FIT) is distributed under the GPL[1], so
packages that don't have the same license (or
that aren't maintained across all systems which
support Python) aren't eligible.

It's also not possible to ask the service caller to
properly specify the string when they pass it to me.

Any ideas?

John Roth

[1] That wasn't my choice, so political comments
aren't relevant. Bitch at Ward Cunningham if you
want to bitch.
Jul 18 '05 #1
5 1314
John Roth wrote:
Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.)


Which it isnt't.

AFAIK Python's storage format for Unicode strings is
some form of 2-byte representation, it certainly isn't
UTF-8.

So if you want to turn your string into a Python Unicode
object, you really have to push it trough the UTF-8 codec...

--Irmen
Jul 18 '05 #2
John Roth wrote:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick


does

str2 = str.decode('utf-8')

work?
--
What part of "Ph'nglui mglw'nath Cthulhu R'lyeh wgah'nagl fhtagn" don't
you understand?
Jul 18 '05 #3

"Irmen de Jong" <irmen@-nospam-remove-this-xs4all.nl> wrote in message
news:40*********************@news.xs4all.nl...
John Roth wrote:
Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.)
Which it isnt't.

AFAIK Python's storage format for Unicode strings is
some form of 2-byte representation, it certainly isn't
UTF-8.

So if you want to turn your string into a Python Unicode
object, you really have to push it trough the UTF-8 codec...


I see. I'm really very much a novice at unicode and all
the codec stuff. If I understand you, I need to get the
utf-8 codec and use the decode function to turn it into
a unicode string, and then use the encode function to
turn it back to a standard 8-byte string so I can write
it out (or send it down the pipe or socket...)

Thanks. Now that you point it out, it does look kind
of obvious - the second time.

John Roth
--Irmen

Jul 18 '05 #4

"Ivan Voras" <ivoras@__geri.cc.fer.hr> wrote in message
news:cb**********@bagan.srce.hr...
John Roth wrote:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
does

str2 = str.decode('utf-8')

work?


[dirty word]. Thanks. I knew I'd seen it before
somewhere; it just didn't occur to me to look in
the obvious place. It sure ought to.

Thanks.

John Roth

--
What part of "Ph'nglui mglw'nath Cthulhu R'lyeh wgah'nagl fhtagn" don't
you understand?

Jul 18 '05 #5
John Roth wrote:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.) I don't need to convert it with a codec,
I need to change the class under the data.


you're making more assumptions about things you don't know anything
about than is really good for you. had you read any article on Python's
Unicode system, you'd learned that UTF-8 is an encoding, while Python
Unicode string type contains sequences of Unicode characters.

or in other words, if you have something that isn't a Python Unicode
string, and you want a Python Unicode string, you need to convert it.

more reading:

http://www.effbot.org/zone/unicode-objects.htm
http://www.reportlab.com/i18n/python..._tutorial.html
(slightly outdated; ignore installation/setup parts)
http://www.egenix.com/files/python/U...C2002-Talk.pdf

</F>


Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by Michael Weir | last post: by
8 posts views Thread by Bill Eldridge | last post: by
8 posts views Thread by Francis Girard | last post: by
4 posts views Thread by webdev | last post: by
2 posts views Thread by Neil Schemenauer | last post: by
10 posts views Thread by Nikolay Petrov | last post: by
6 posts views Thread by Jeff | last post: by
24 posts views Thread by ChaosKCW | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.