By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,656 Members | 801 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,656 IT Pros & Developers. It's quick & easy.

possible unicode bug in implicit string concatenation?

P: n/a
Hi team! While troubleshooting a crash I had while using BitTorrent
where the torrent's target file names didn't fall into the ascii range
I was playing around in the interpreter and noticed this behaviour:
u'\u12345' + 'foo' u'\u12345foo' u'\u12345' u'foo' u'\u12345foo' u'\u12345' + u'foo'.encode('ascii') u'\u12345foo' u'\u12345' u'foo'.encode('ascii') Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in
position 0: ordinal not in range(128)


Is this a bug, or is my understanding of how Python works flawed? I
tried tracing it within the interpreter itself bug got lost after a
little while... I'm familiar with the interpreter loop, but not the
parser, and I suspect this is something to do with implicit string
concatenation being parsed differently from the explicit version, i.e.
the explicit version uses the + operator slot, while the implicit
version does something else. Any ideas?

Fahd Khan
ICON | Clinical Research
W: 281-295-4834
Jul 18 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Fahd Khan wrote:
u'\u12345' u'foo'.encode('ascii')

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\u1234' in
position 0: ordinal not in range(128)
Is this a bug, or is my understanding of how Python works flawed?


Yes :-) Your understanding is flawed.
I
tried tracing it within the interpreter itself bug got lost after a
little while... I'm familiar with the interpreter loop, but not the
parser, and I suspect this is something to do with implicit string
concatenation being parsed differently from the explicit version, i.e.
the explicit version uses the + operator slot, while the implicit
version does something else. Any ideas?


During parsing, strings are concatenated. And concatenation is
the same as +. So the expression at the top of this message is
the same as u'\u12345foo'.encode('ascii'). That fails because
\u1234 is not supported in ASCII. Now,

u'\u12345'+u'foo'.encode('ascii')

is something completely different: concatenation does not happen
during parsing, but only at execution. The computation of this
expression is as follows

u'foo'.encode('ascii') gives 'foo'
u'\u12345'+'foo' finds that Unicode and byte strings are to be
added. This causes the byte string to be coerced to Unicode,
computing
'foo'.decode(sys.getdefaultencoding())
sys.getdefaultencoding() gives 'ascii'
'foo'.decode('ascii') gives u'foo'
u'\u12345'+u'foo' gives u'\u12345foo'

Regards,
Martin
Jul 18 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.