473,242 Members | 1,837 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,242 software developers and data experts.

Binary strings, unicode and encodings

Maybe you have a minute to clarify the following matter...

Consider:

---

from cStringIO import StringIO

def bencode_rec(x, b):
t = type(x)

if t is str:
b.write('%d:%s' % (len(x), x))
else:
assert 0

def bencode(x):
b = StringIO()

bencode_rec(x, b)

return b.getvalue()

---

Now, if I write bencode('failure reason') into a socket, what will I get
on the other side of the connection?

a) A sequence of bytes where each byte represents an ASCII character

b) A sequence of bytes where each byte represents the UTF-8 encoding of a
Unicode character

c) It depends on the system locale/it depends on what the site module
specifies using setdefaultencoding(name)

---

So, if a Python client in China connects to a Python server in Europe,
must they be careful to specify a common encoding on both sides of the
connection?

Regards,

L.
Jul 18 '05 #1
11 2899
Laurent Therond wrote:

Consider:
---
from cStringIO import StringIO

def bencode_rec(x, b):
t = type(x)
if t is str:
b.write('%d:%s' % (len(x), x))
else:
assert 0
The above is confusing. Why not just do

def bencode_rec(x, b):
assert type(x) is str
b.write(.....)

Why the if/else etc?

def bencode(x):
b = StringIO()
bencode_rec(x, b)
return b.getvalue()

---
Now, if I write bencode('failure reason') into a socket, what will I get
on the other side of the connection?
This is Python. Why not try it and see? I wrote a quick test at
the interactive prompt and concluded that StringIO converts to
strings, so if your input is Unicode it has to be encodeable or
you'll get the usual exception.
a) A sequence of bytes where each byte represents an ASCII character
Yes, provided your input is exclusively ASCII (7-bit) data.
b) A sequence of bytes where each byte represents the UTF-8 encoding of a
Unicode character
Yes, if UTF-8 is your default encoding and you're using Unicode input.
c) It depends on the system locale/it depends on what the site module
specifies using setdefaultencoding(name)


Yes, as it always does if you are using Unicode but converting to byte strings
as it appears StringIO does.

-Peter
Jul 18 '05 #2
Peter Hansen <pe***@engcorp.com> wrote in message news:<40***************@engcorp.com>...
The above is confusing. Why not just do

def bencode_rec(x, b):
assert type(x) is str
b.write(.....)

Why the if/else etc?
That's a code extract. The real code was more complicated.
This is Python. Why not try it and see? I wrote a quick test at
the interactive prompt and concluded that StringIO converts to
strings, so if your input is Unicode it has to be encodeable or
you'll get the usual exception.


Good point. Sorry, I don't have those good reflexes--I am new to
Python.

So, your test revealed that StringIO converts to byte strings.
Does that mean:
- If the input string contains characters that cannot be encoded
in ASCII, bencode_rec will fail?

Yet, if your locale specifies UTF-8 as the default encoding, it should
not fail, right?

Hence, I conclude your test was made on a system that uses ASCII/ISO
8859-1 as its default encoding. Is that right?
a) A sequence of bytes where each byte represents an ASCII character


Yes, provided your input is exclusively ASCII (7-bit) data.


OK.
b) A sequence of bytes where each byte represents the UTF-8 encoding of a
Unicode character


Yes, if UTF-8 is your default encoding and you're using Unicode input.


OK.
c) It depends on the system locale/it depends on what the site module
specifies using setdefaultencoding(name)


Yes, as it always does if you are using Unicode but converting to byte strings
as it appears StringIO does.


Umm...not sure here...I think StringIO must behave differently
depending on your locale and depending on how you assigned the string.

Thanks for your help!

L.
Jul 18 '05 #3
I forgot to ask something else...

If a client and a server run on locales/platforms that use different
encodings, they are bound to wrongly interpret string bytes. Correct?
Jul 18 '05 #4
I used the interpreter on my system:
import sys
sys.getdefaultencoding() 'ascii'

OK
from cStringIO import StringIO
b = StringIO()
b.write('%d:%s' % (len('string'), 'string'))
print b.getvalue() 6:string

OK
c = StringIO()
c.write('%d:%s' % (len('stringé'), 'stringé'))
print c.getvalue() 7:stringé

OK

Did StringIO just recognize Extended ASCII?
Did StringIO just recognize ISO 8859-1?

é belongs to Extended ASCII AND ISO 8859-1.
print c.getvalue().decode('US-ASCII') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 8: ordinal
not in range(128)
print c.getvalue().decode('ISO-8859-1') Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python23\lib\encodings\cp437.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x82' in position 8
: character maps to <undefined>


OK

It must have been Extended ASCII, then.

I must do other tests.
Jul 18 '05 #5
On Thu, Jan 15, 2004 at 11:38:39AM -0800, Laurent Therond wrote:
Maybe you have a minute to clarify the following matter...

Consider:

---

from cStringIO import StringIO

def bencode_rec(x, b):
t = type(x)

if t is str:
b.write('%d:%s' % (len(x), x))
else:
assert 0

def bencode(x):
b = StringIO()

bencode_rec(x, b)

return b.getvalue()

---

Now, if I write bencode('failure reason') into a socket, what will I get
on the other side of the connection?

a) A sequence of bytes where each byte represents an ASCII character
Yes.

b) A sequence of bytes where each byte represents the UTF-8 encoding of a
Unicode character
Coincidentally, yes. This is not because the unicode you wrote to the
socket is encoded as UTF-8 before it is sent, but because the *non*-unicode
you wrote to the socket *happened* to be a valid UTF-8 byte string (All
ASCII byte strings fall into this coincidental case).

c) It depends on the system locale/it depends on what the site module
specifies using setdefaultencoding(name)


Not at all. 'failure reason' isn't unicode, there are no unicode
transformations going on in the example program, the default encoding is
never used and has no effect on the program's behavior.

bencode_rec has an assert in it for a reason. *Only* byte strings can be
sent using it. If you want to send unicode, you'll have to encode it
yourself and send the encoded bytes, then decode it on the other end. If
you choose to depend on the default system encoding, you'll probably end up
with problems, but if you explicitly select an encoding yourself, you won't.

Jp

Jul 18 '05 #6
Laurent Therond wrote:
Now, if I write bencode('failure reason') into a socket, what will I get
on the other side of the connection?
Jp has already explained this, but let me stress his observations.
a) A sequence of bytes where each byte represents an ASCII character


A sequence of bytes, period. 'failure reason' is a byte string. The
bytes in this string are literally copied from the source code .py file
to the cStringIO object.

If your source code was in an encoding that is an ASCII superset
(such as ascii, iso-8859-1, cp1252), then yes: the text 'failure reason'
will come out as a byte string representing ASCII characters.

Python has a second, independent string type, called unicode. Literals
of that type are not simply written in quotes, but with a leading u''.

You should never use the unicode type in a place where byte strings
are expected. Python will apply the system default encoding to these,
which gives exceptions if the Unicode characters are outside the
characters supported in the system default encoding (which is us-ascii).

You also should avoid byte string literals with non-ASCII characters
such as 'stringé'; use unicode literals. The user invoking your script
may use a different encoding on his system, so he would get moji-bake,
as the last character in the string literal does *not* denote
LATIN SMALL LETTER E WITH ACUTE, but instead denotes the byte '\xe9'
(which is that character only if you use a latin-1-like encoding).

HTH,
Martin

Jul 18 '05 #7
Laurent Therond wrote:

So, your test revealed that StringIO converts to byte strings.
Does that mean:
- If the input string contains characters that cannot be encoded
in ASCII, bencode_rec will fail?
Yes, if your default encoding is ASCII.
Yet, if your locale specifies UTF-8 as the default encoding, it should
not fail, right?
True, provided you are actually creating UTF-8 strings... just sticking
in a character that has the 8th bit set doesn't mean the string is UTF-8
of course.
Hence, I conclude your test was made on a system that uses ASCII/ISO
8859-1 as its default encoding. Is that right?


Correct, Windows 98, sys.getdefaultencoding() returns 'ascii'.
c) It depends on the system locale/it depends on what the site module
specifies using setdefaultencoding(name)


Yes, as it always does if you are using Unicode but converting to byte strings
as it appears StringIO does.


Umm...not sure here...I think StringIO must behave differently
depending on your locale and depending on how you assigned the string.


It's always possible that StringIO takes locale into account in some
special way, but I suspect it does not. As for "how you assigned the string"
I'm not sure I understand what that might mean. How many ways do you know
to assign a string in Python?

-Peter
Jul 18 '05 #8
Laurent Therond wrote:

I forgot to ask something else...

If a client and a server run on locales/platforms that use different
encodings, they are bound to wrongly interpret string bytes. Correct?


Since the byte strings are by definition *encoded* forms of the Unicode
data, they definitely need to have a shared frame of reference or they
will misinterpret the data, as you surmise. You can't decode something
if you don't know how it was encoded.

-Peter
Jul 18 '05 #9
Laurent Therond wrote:

I used the interpreter on my system:
c = StringIO()
c.write('%d:%s' % (len('stringé'), 'stringé'))
print c.getvalue() 7:stringé

OK

Did StringIO just recognize Extended ASCII?
Did StringIO just recognize ISO 8859-1?

é belongs to Extended ASCII AND ISO 8859-1.


No, StringIO didn't "recognize" anything but a simple string. There is
no issue of codecs and encoding and such going on here, because you are
sending in a string (as it happens, one that's not 8-bit clean, but that's
irrelevant though it may be the cause of your confusion) and getting out
a string. StringIO does not make any attempt to "encode" something that
is already a string.
print c.getvalue().decode('US-ASCII') Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 8: ordinal
not in range(128)
print c.getvalue().decode('ISO-8859-1') Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python23\lib\encodings\cp437.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x82' in position 8
: character maps to <undefined>


OK

It must have been Extended ASCII, then.


Hmm... note that when you are trying to decode that string, you are
attempting to print a unicode rather than a string. When you try to
print that on your console, the console must decode it using the default
encoding again. I think you know this, but in case you didn't: it explains
why you got a DecodeError in the first place, but an EncodeError in the
second. The second example worked, treating the string as having been
encoded using ISO-8859-1, and returns a unicode. If you had assigned
it instead of printing it, you should have seen now errors.

-Peter
Jul 18 '05 #10
Peter, thank you for taking the time to answer.

I will need some time to digest this information.

From where I stand, a Python newbie who knows more about Java, this
concept of binary string is puzzling. I wish Python dealt in Unicode
natively, as Java does. It makes things a lot easier to comprehend.
Having strings be byte arrays, on the other, seems to confuse me.
Jul 18 '05 #11

"Laurent Therond" <go****@axiomatize.com> wrote in message news:26**************************@posting.google.c om...
Peter, thank you for taking the time to answer.

I will need some time to digest this information.

From where I stand, a Python newbie who knows more about Java, this
concept of binary string is puzzling. I wish Python dealt in Unicode
natively, as Java does. It makes things a lot easier to comprehend
Python does deal with Unicode natively. You just need to put u
character before the string. This of course a violation of the rule
"There should be one-- and preferably only one --obvious way to do it."
'a' == u'a'. But remember that Python appeared before Unicode,
so strings in Python could not be unicode strings from the beginning
.. Having strings be byte arrays, on the other, seems to confuse me.


Use unicode strings only.

-- Serge.
Jul 18 '05 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Pete | last post by:
Sorry for the ambiguity of my last post, What I am try to do is enter a 10 bit binary string eg: 1110001010 and then permute them into an array using an array containing 3,5,2,7,4,10,1,9,8,6 as...
3
by: ruben.de.visscher | last post by:
I am trying to write a program that encrypts 8-bit plaintext using a 10-bit key. To generate subkeys, and for other things, i will need to be able to perform permutations on binary strings for...
4
by: Greg | last post by:
I'm trying to write a basic tool to convert strings to unicode encodings. Should be easy enough, I can do the encoding bit with the various encoding tools in C#, but what I can't seem to do is...
40
by: apprentice | last post by:
Hello, I'm writing an class library that I imagine people from different countries might be interested in using, so I'm considering what needs to be provided to support foreign languages,...
2
by: CharlesL | last post by:
I am trying to handle binary strings in php. I get a binary output initialization vector from mcrypt as such: from mcrypt: $iv = mcrypt_create_iv($iv_size, MCRYPT_RAND); This output may have...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: fareedcanada | last post by:
Hello I am trying to split number on their count. suppose i have 121314151617 (12cnt) then number should be split like 12,13,14,15,16,17 and if 11314151617 (11cnt) then should be split like...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
Git
by: egorbl4 | last post by:
Скачал я git, хотел начать настройку, а там вылезло вот это Что это? Что мне с этим делать? ...
1
by: davi5007 | last post by:
Hi, Basically, I am trying to automate a field named TraceabilityNo into a web page from an access form. I've got the serial held in the variable strSearchString. How can I get this into the...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.