By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,942 Members | 1,788 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,942 IT Pros & Developers. It's quick & easy.

Q: a simple(?) raw-utf-8 conversion to internal type unicode "\304\246\311\231\316\257\316\271\303\222"

P: n/a
Hi,

Apologies first as I am not a unicode expert.... indeed I the details
probably totally elude me. Not withstanding: how can I convert a
binary string containing UTF-8 binary into a python unicode string?

cutdown example:
$ cat ./uc.py
#!/usr/bin/env python
imported="\304\246\311\231\316\257\316\271\303\222
\317\216\317\203\305\224\304\271\304\220"
print "English/ASCII quoting:",'"'+imported+'"',"SUCCEEDS :-)" # xterm
encoding if UTF8
print "German/ALCOR quoting:",u"\N{runic cross punctuation}"+"test"
+"\N{runic cross punctuation}","AOK :-)"
print "German/ALCOR quoting:",u"\N{runic cross
punctuation}"+imported+u"\N{runic cross punctuation}","FAILS :-("

$ ./uc.py
English/ASCII quoting: "ĦəίιÒ ώσŔĹĐ" SUCCEEDS :-)
German/ALCOR quoting: *test* AOK :-)
German/ALCOR quoting:
Traceback (most recent call last):
File "./uc.py", line 5, in <module>
print "German/ALCOR quoting:",u"\N{runic cross
punctuation}"+imported+u"\N{runic cross punctuation}","FAILS :-("
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)

The last print statement fails because the ascii "imported" characters
are 8 bit encoded UTF-8 and dont know it! How do I tell "imported" that
it is actually already UTF-8 unicode?

Cheers
NevilleDNZ

Jan 1 '07 #1
Share this Question
Share on Google+
1 Reply


P: n/a
It was just TOO easy... on posting my message to google groups, and
when I re-read the posting on groups I found that google had pointed me
to a python-unicode tutorial...
http://www.reportlab.com/i18n/python..._tutorial.html - exercise one :-)

Gosh sometime a google is worth so much more then ₁₀¹⁰⁰!

Happy New Year
NevilleD

It works now:
$ ./uc.py
English/ASCII quoting: "ĦəίιÒ ώσŔĹĐ" SUCCEEDS :-)
German/ALCOR quoting: *test* AOK :-)
German/ALCOR quoting: *ĦəίιÒ ώσŔĹĐ* FAILS :-(
nevilled@alfa:/root0/home/nevilled/Project/20 $ vi ./uc.py
nevilled@alfa:/root0/home/nevilled/Project/20 $ cat ./uc.py
#!/usr/bin/env python
imported=unicode("\304\246\311\231\316\257\316\271 \303\222
\317\216\317\203\305\224\304\271\304\220","utf-8")
print "English/ASCII quoting:",'"'+imported+'"',"SUCCEEDS :-)" # xterm
encoding if UTF8
print "German/ALCOR quoting:",u"\N{runic cross punctuation}test\N{runic
cross punctuation}","AOK :-)"
print "German/ALCOR quoting:",u"\N{runic cross
punctuation}"+imported+u"\N{runic cross punctuation}","Just TOO easy
:-)"

$ ./uc.py
English/ASCII quoting: "ĦəίιÒ ώσŔĹĐ" SUCCEEDS :-)
German/ALCOR quoting: *test* AOK :-)
German/ALCOR quoting: *ĦəίιÒ ώσŔĹĐ* Just TOO easy :-)

NevilleDNZ wrote:
Hi,

Apologies first as I am not a unicode expert.... indeed I the details
probably totally elude me. Not withstanding: how can I convert a
binary string containing UTF-8 binary into a python unicode string?

cutdown example:
$ cat ./uc.py
#!/usr/bin/env python
imported="\304\246\311\231\316\257\316\271\303\222
\317\216\317\203\305\224\304\271\304\220"
print "English/ASCII quoting:",'"'+imported+'"',"SUCCEEDS :-)" # xterm
encoding if UTF8
print "German/ALCOR quoting:",u"\N{runic cross punctuation}"+"test"
+"\N{runic cross punctuation}","AOK :-)"
print "German/ALCOR quoting:",u"\N{runic cross
punctuation}"+imported+u"\N{runic cross punctuation}","FAILS :-("

$ ./uc.py
English/ASCII quoting: "ĦəίιÒ ώσŔĹĐ" SUCCEEDS :-)
German/ALCOR quoting: *test* AOK :-)
German/ALCOR quoting:
Traceback (most recent call last):
File "./uc.py", line 5, in <module>
print "German/ALCOR quoting:",u"\N{runic cross
punctuation}"+imported+u"\N{runic cross punctuation}","FAILS :-("
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
ordinal not in range(128)

The last print statement fails because the ascii "imported" characters
are 8 bit encoded UTF-8 and dont know it! How do I tell "imported" that
it is actually already UTF-8 unicode?

Cheers
NevilleDNZ
Jan 1 '07 #2

This discussion thread is closed

Replies have been disabled for this discussion.