By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,551 Members | 1,159 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,551 IT Pros & Developers. It's quick & easy.

convert string with raw binary data to unicode

P: n/a
Hi,

I want to pass raw binary data from a file to a COM object. I read the data
from file like this:

data = file('path_to_file','rb').read()

If passed to a COM object, data is converted to unicode in the way one would
expect for strings. I.e. a lot of zeros are filled in. I want each two
characters from data to be interpreted as one unicode character. I read the
docu about codecs but can not find a suitable codec. I also tried to read
the data like this:

data = codecs.open('path_to_file','rb','???').read()

I tried to use UCS2 for the ???, but this encoding does not exist. A posting
found via google supposes to use UTF-16 but this is not the same and raises
an error.

This shouldn't be a big problem, but I can figure out how to solve it. Can
anybody help?

regards,
Achim
Jul 18 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
"Achim Domma" <do***@procoders.net> writes:
Hi,

I want to pass raw binary data from a file to a COM object. I read the data
from file like this:

data = file('path_to_file','rb').read()

If passed to a COM object, data is converted to unicode in the way one would
expect for strings. I.e. a lot of zeros are filled in. I want each two
characters from data to be interpreted as one unicode character. I read the
docu about codecs but can not find a suitable codec. I also tried to read
the data like this:

data = codecs.open('path_to_file','rb','???').read()

I tried to use UCS2 for the ???, but this encoding does not exist. A posting
found via google supposes to use UTF-16 but this is not the same and raises
an error.

This shouldn't be a big problem, but I can figure out how to solve it. Can
anybody help?


If I understand your problem correctly, you want to construct a unicode
object containing arbitrary data in it's internal buffer.

And if I understand Python's unicode implementation correctly, than I
would say it isn't possible - since unicode objects do not contain
binary data, they contain characters (or how is this called in the
unicode world?).

OTOH, it should be possible to write a small extension wrapping the
PyUnicode_FromUnicode() function to accept arbitrary data.

Is there also a possibility to write a codec which does this?

Note that the 'if's above are probably big 'if's...

Thomas
Jul 18 '05 #2

P: n/a
Achim Domma:
data = codecs.open('path_to_file','rb','???').read()

I tried to use UCS2 for the ???, but this encoding does not exist. A posting found via google supposes to use UTF-16 but this is not the same and raises an error.


It is better to show the error message when sending queries to a news
group. You may want to look at the 'errors' argument which can be one of:

'strict' Raise ValueError (or a subclass); this is the default.
'ignore' Ignore the character and continue with the next.
'replace' Replace with a suitable replacement character
'xmlcharrefreplace' Replace with the appropriate XML character reference
'backslashreplace' Replace with backslashed escape sequences.

Take a look at the results after using, say, 'backslashreplace' and you
may find that much of your file is not UTF-16 or that it is byte swapped or
that there are just a few bad characters in a header or similar.

Neil
Jul 18 '05 #3

P: n/a
Achim Domma wrote:
Hi,

I want to pass raw binary data from a file to a COM object. I read the data
from file like this:

data = file('path_to_file','rb').read()

If passed to a COM object, data is converted to unicode in the way one would
expect for strings. I.e. a lot of zeros are filled in. I want each two
characters from data to be interpreted as one unicode character. I read the
docu about codecs but can not find a suitable codec. I also tried to read
the data like this:

data = codecs.open('path_to_file','rb','???').read()

I tried to use UCS2 for the ???, but this encoding does not exist. A posting
found via google supposes to use UTF-16 but this is not the same and raises
an error.

This shouldn't be a big problem, but I can figure out how to solve it. Can
anybody help?


Try utf-16-le or utf-16-be (depending on endianness of the data) as
encoding.
--
Sjoerd Mullender <sj****@acm.org>

Jul 18 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.