By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,694 Members | 1,882 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,694 IT Pros & Developers. It's quick & easy.

Need help on UNICODE conversion

P: n/a
Hi,

today I (Python beginner) ran into a problem:^

I have a JPG file which contains some comment as unicode.

After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).

How do I convert such string to a real unicode string and to
a windows_1252 or latin1 afterwards? I know it's a text with
german umlauts.

I tried this:
if rawdata[:7] == "UNICODE":
ustring = rawdata[7:]
us2 = unicode(ustring, "windows_1252")
as2 = us2.encode("windows_1252")
self.dic["ComUNI"] = rawdata

But all I get on each stage is a normal string with lots of \\0x00.

TIA
Bernd

Jul 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Bernd Preusing <b.********@web.de> writes:
After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).


Can you find out what the real value of that string is? I very much
doubt that it contains literal backslashes. Also, I find it strange
that it has the letter 'o' after one backslash, but the number '0'
after all other bacskslashes.

Regards,
Martin
Jul 18 '05 #2

P: n/a
Bernd Preusing wrote:
I have a JPG file which contains some comment as unicode.

After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).


Seems that this is not properly cut and pasted :-(

I suppose that "\\0x00" is just a complicated replacement for "\x00" used by
the debugger. As long as all characters are in the range 0..255, you could
simply remove every other character:
"XHXeXlXlXoX XWXoXrXlXd"[1::2] 'Hello World'


Use 8 instead of 1 as start index to also remove "UNICODE".
That might eliminate the need for a unicode string, or you could easily
create one from the "normal" string.
Peter
Jul 18 '05 #3

P: n/a
Bernd Preusing wrote:
I have a JPG file which contains some comment as unicode.

After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).


As others have pointed out, this seems to be an unfaithful cut and
paste; to really tell what it is we'd have to see the actual contents of
the string. If it is really Unicode, however, it looks like it might be
a UTF-16 encoding. Try 'utf-16' for the encoding name.

--
Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \ You're wasting time / Asking what if / You linger on too long
\__/ Chante Moore
Jul 18 '05 #4

P: n/a
Erik Max Francis <ma*@alcyone.com> wrote:
Bernd Preusing wrote:
I have a JPG file which contains some comment as unicode.

After reading in the string with s=file.read(70) from file offest 4
I get a string which is shown as
'UNICODE\\0x00\\ox00K\\0x00o' and so forth in the debugger
(using Komodo).


As others have pointed out, this seems to be an unfaithful cut and
paste; to really tell what it is we'd have to see the actual contents of
the string. If it is really Unicode, however, it looks like it might be
a UTF-16 encoding. Try 'utf-16' for the encoding name.


Yes, sorry. Cut & paste was not possible, so I wrote it down
with some errors, very tired and frustrated :-(
I had tried to attach a small screenshot, but this is no binary news
group...

My first fault was to cut off the first 7 bytes, but I had to
eliminate 8.

The byte array is
0000: 55 4e 49 43 4f 44 45 00 00 4b 00 6f 00 6d 00 6d UNICODE..K.o.m.m
0010: 00 65 00 6e 00 74 00 61 00 72 00 20 00 55 00 6e .e.n.t.a.r. .U.n
0020: 00 69 00 63 00 6f 00 64 00 65 00 20 00 2a 00 e4 .i.c.o.d.e. .*..
0030: 00 f6 00 fc 00 c4 00 d6 00 dc 00 df 00 2a 00 0d
0040: 00 0a 00 0d 00 0a

I had to cut off the beginning, which is "UNICODE\x00".
The remainder means "Kommentar Unicode **"
(this contains german umlauts at the end)

Now I have a string
ustring = "\x00K\x00o\x00m....."

us2 = unicode(ustring, "utf_16")
yields: UnicodeDecodeError: 'utf16' codec can't decode bytes in
position 48-49: illegal encoding

Strange, because that position is at "00 dc" and not earlier!?

According to your tips I stripped off all remainig \x00 and got
"Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n"

I can go on with that string now :-))
But what would have been the "right" way?

Thaks again
Bernd

Jul 18 '05 #5

P: n/a
Erik Max Francis <ma*@alcyone.com> writes:
u = unicode(codecs.BOM_UTF16_BE + u, 'utf-16')
u

u'Kommentar Unicode *\xe4\xf6\xfc\xc4\xd6\xdc\xdf*\r\n\r\n'

... which I can convert to Latin-1 and print to then see the umlauts and
the double S.


It is better to use "utf-16-be" as the codec name in the first place,
instead of artificially prepending a BOM, and letting the UTF-16 codec
determine byte order.

Regards,
Martin

Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.