473,387 Members | 1,904 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Unicode perplex

I've got an interesting little problem that I can't find an
answer to after hunting through the doc (2.3.3). I've
got a string that contains something that kind of
resembles an HTML document. On looking through
it, I find a <meta http-equiv="content-type"
content="text/html; charset=UTF-8"> tag.

The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.) I don't need to convert it with a codec,
I need to change the class under the data.

I don't want to have to write a c language
extension, and I also don't want to have to write
it out to a file and read it back in. The product
involved (FIT) is distributed under the GPL[1], so
packages that don't have the same license (or
that aren't maintained across all systems which
support Python) aren't eligible.

It's also not possible to ask the service caller to
properly specify the string when they pass it to me.

Any ideas?

John Roth

[1] That wasn't my choice, so political comments
aren't relevant. Bitch at Ward Cunningham if you
want to bitch.
Jul 18 '05 #1
5 1433
John Roth wrote:
Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.)


Which it isnt't.

AFAIK Python's storage format for Unicode strings is
some form of 2-byte representation, it certainly isn't
UTF-8.

So if you want to turn your string into a Python Unicode
object, you really have to push it trough the UTF-8 codec...

--Irmen
Jul 18 '05 #2
John Roth wrote:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick


does

str2 = str.decode('utf-8')

work?
--
What part of "Ph'nglui mglw'nath Cthulhu R'lyeh wgah'nagl fhtagn" don't
you understand?
Jul 18 '05 #3

"Irmen de Jong" <irmen@-nospam-remove-this-xs4all.nl> wrote in message
news:40*********************@news.xs4all.nl...
John Roth wrote:
Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.)
Which it isnt't.

AFAIK Python's storage format for Unicode strings is
some form of 2-byte representation, it certainly isn't
UTF-8.

So if you want to turn your string into a Python Unicode
object, you really have to push it trough the UTF-8 codec...


I see. I'm really very much a novice at unicode and all
the codec stuff. If I understand you, I need to get the
utf-8 codec and use the decode function to turn it into
a unicode string, and then use the encode function to
turn it back to a standard 8-byte string so I can write
it out (or send it down the pipe or socket...)

Thanks. Now that you point it out, it does look kind
of obvious - the second time.

John Roth
--Irmen

Jul 18 '05 #4

"Ivan Voras" <ivoras@__geri.cc.fer.hr> wrote in message
news:cb**********@bagan.srce.hr...
John Roth wrote:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
does

str2 = str.decode('utf-8')

work?


[dirty word]. Thanks. I knew I'd seen it before
somewhere; it just didn't occur to me to look in
the obvious place. It sure ought to.

Thanks.

John Roth

--
What part of "Ph'nglui mglw'nath Cthulhu R'lyeh wgah'nagl fhtagn" don't
you understand?

Jul 18 '05 #5
John Roth wrote:
The problem is that I've got a normal string where
the byte stream is actually UTF-8. How do I turn
it into a Unicode string? Remember that the trick
is that it's still going to have the *same* stream of
bytes (at least if the Unicode string is implemented
in UTF-8.) I don't need to convert it with a codec,
I need to change the class under the data.


you're making more assumptions about things you don't know anything
about than is really good for you. had you read any article on Python's
Unicode system, you'd learned that UTF-8 is an encoding, while Python
Unicode string type contains sequences of Unicode characters.

or in other words, if you have something that isn't a Python Unicode
string, and you want a Python Unicode string, you need to convert it.

more reading:

http://www.effbot.org/zone/unicode-objects.htm
http://www.reportlab.com/i18n/python..._tutorial.html
(slightly outdated; ignore installation/setup parts)
http://www.egenix.com/files/python/U...C2002-Talk.pdf

</F>


Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Michael Weir | last post by:
I'm sure this is a very simple thing to do, once you know how to do it, but I am having no fun at all trying to write utf-8 strings to a unicode file. Does anyone have a couple of lines of code...
8
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...
8
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...
48
by: Zenobia | last post by:
Recently I was editing a document in GoLive 6. I like GoLive because it has some nice features such as: * rewrite source code * check syntax * global search & replace (through several files at...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
2
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...
10
by: Nikolay Petrov | last post by:
How can I convert DOS cyrillic text to Unicode
6
by: Jeff | last post by:
Hi - I'm setting up a streamreader in a VB.NET app to read a text file and display its contents in a multiline textbox. If I set it up with System.Text.Encoding.Unicode, it reads a unicode...
24
by: ChaosKCW | last post by:
Hi I am reading from an oracle database using cx_Oracle. I am writing to a SQLite database using apsw. The oracle database is returning utf-8 characters for euopean item names, ie special...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.