By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,965 Members | 1,779 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,965 IT Pros & Developers. It's quick & easy.

Why does the "".join(r) do this?

P: n/a
Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)
Jul 18 '05 #1
Share this Question
Share on Google+
16 Replies


P: n/a
Jim Hefferon wrote:
I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?


It can't just concatenate because your list contains other
items which are unicode strings. Python is attempting to convert
your strings to unicode strings to do the join, and it fails
because your strings contain characters which don't have
meaning to the default decoder.

-Peter
Jul 18 '05 #2

P: n/a

Jim> I'm building up a web page by stuffing an array and then doing
Jim> "".join(r) at the end. I intend to later encode it as 'latin1', so
Jim> I'd like it to just concatenate. While I can work around this
Jim> error, the reason for it escapes me.

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

Skip

Jul 18 '05 #3

P: n/a
Jim Hefferon wrote:
I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?


Let's reduce the problem to its simplest case:
unichr(174) + chr(174) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)

So why doesn't it just concatenate? Because there is no way of knowing how
to properly decode chr(174) or any other non-ascii character to unicode:
chr(174).decode("latin1") u'\xae' chr(174).decode("latin2") u'\u017d'


Use either unicode or str, but don't mix them. That should keep you out of
trouble.

Peter

Jul 18 '05 #4

P: n/a
Skip Montanaro wrote:
Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.


This is bound to fail when the first non-ascii str occurs:
u"".join(["a", "b"]) u'ab' u"".join(["a", chr(174)]) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)
Apart from that, Python automatically switches to unicode if the list
contains unicode items:
"".join(["a", u"o"])

u'ao'

Peter

Jul 18 '05 #5

P: n/a
Jim Hefferon wrote:
Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)


What about unichr() ?
#!/usr/bin/python2.3 -u
t="abc"+unichr(174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k



Jul 18 '05 #6

P: n/a
Jim Hefferon wrote:
Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)


What about unichr() ?
#!/usr/bin/python2.3 -u
t="abc"+unichr(174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z']
r.append(t)
# k=u"".join(r)
k="".join(r)
print k
// moma
http://www.futuredesktop.org
Jul 18 '05 #7

P: n/a

Peter> Skip Montanaro wrote:
Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.


Peter> This is bound to fail when the first non-ascii str occurs:

...

Yeah I realized that later. I missed that he was appending non-ASCII
strings to his list. I thought he was only appending unicode objects and
ASCII strings (in which case what he was trying should have worked). Serves
me right for trying to respond with a head cold.

Skip

Jul 18 '05 #8

P: n/a
Peter Otten wrote:
Skip Montanaro wrote:

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

This is bound to fail when the first non-ascii str occurs:


Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)
--
C isn't that hard: void (*(*f[])())() defines f as an array of
unspecified size, of pointers to functions that return pointers to
functions that return void.
Jul 18 '05 #9

P: n/a
"Ivan Voras" <ivoras@__geri.cc.fer.hr> wrote in message
news:c8**********@bagan.srce.hr...
Peter Otten wrote:
Skip Montanaro wrote:

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bitsin your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

This is bound to fail when the first non-ascii str occurs:


Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)


Does the encoding line (1st or second line of program) do this?
I don't remember if it does or not - although I'd suspect not.
Otherwise it seems like a reasonably straightforward function
to write.

John Roth

--
C isn't that hard: void (*(*f[])())() defines f as an array of
unspecified size, of pointers to functions that return pointers to
functions that return void.

Jul 18 '05 #10

P: n/a
John Roth wrote:
"Ivan Voras" <ivoras@__geri.cc.fer.hr> wrote in message
news:c8**********@bagan.srce.hr...

Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)


Does the encoding line (1st or second line of program) do this?
I don't remember if it does or not - although I'd suspect not.
Otherwise it seems like a reasonably straightforward function
to write.


As a str does not preserve information about the encoding, the
# -*- coding: XXX -*-
comment does not help here. It does however control the decoding of unicode
strings. I suppose using unicode for non-ascii literals plus the above
coding comment is as close as you can get to the desired effect.

With some more work you could probably automate string conversion like it is
done with quixote's htmltext. Not sure if that would be worth the effort,
though.

Peter

Jul 18 '05 #11

P: n/a
Peter Otten <__*******@web.de> wrote
So why doesn't it just concatenate? Because there is no way of knowing how
to properly decode chr(174) or any other non-ascii character to unicode:
chr(174).decode("latin1") u'\xae' chr(174).decode("latin2") u'\u017d'

Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.
Use either unicode or str, but don't mix them. That should keep you out of
trouble.


Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

Thanks; I am often struck by how helpful this group is,
Jim
Jul 18 '05 #12

P: n/a
"Jim Hefferon" <jh*******@smcvt.edu> wrote in message
news:54**************************@posting.google.c om...
Peter Otten <__*******@web.de> wrote
So why doesn't it just concatenate? Because there is no way of knowing how to properly decode chr(174) or any other non-ascii character to unicode:
>> chr(174).decode("latin1") u'\xae'
>> chr(174).decode("latin2")

u'\u017d'
>>
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.


Maybe I can simplify it? The result has to be in a single encoding,
which will be UTF-8 if any of the strings is a unicode string.
Ascii-7 is a proper subset of UTF-8, so there is no difficulty with
the concatination. 8-bit encodings are not, so the concatination
checks that any normal strings are, in fact, Ascii-7. The encoding
is actually doing the validity check, not an encoding conversion.

The only way the system could do a clean concatination between
unicode and one of the 8-bit encodings is to know beforehand which
of the 8-bit encodings it is dealing with, and there is no way that it
currently has of knowing that.

The people who implemented unicode (in 2.0, I believe) seem to
have decided not to guess. That's in line with the "explicit is better
than implicit" principle.
Use either unicode or str, but don't mix them. That should keep you out of trouble.


Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.


Ah. The issue then is rather simple: what is the encoding of the normal
strings? I'd presume Latin-1. So simply run the list of strings through a
function that converts any normal string to unicode using the Latin-1
codec, and then they should concatinate fine.

As far as the web goes, I'd suggest you make sure you specify UTF-8
in both the HTTP headers and in a <meta> tag in the HTML header,
and make sure that what you write out is, indeed, UTF-8.

John Roth

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

Thanks; I am often struck by how helpful this group is,
Jim

Jul 18 '05 #13

P: n/a
Jim Hefferon wrote:
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now?
Because you're mixing normal strings and Unicode strings. To do that,
it needs to convert the normal strings to Unicode, and to do that it has
to know what encoding you want.
As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.


It's the process by which you turn an arbitrary string into a Unicode
string and back. When you're adding normal strings and Unicode strings,
you end up with a Unicode string, which means the normal strings have to
be implicitly converted. That's why you're getting the error.

Work with strings or Unicode strings, not a mixture, and you won't have
this problem.

--
__ Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
/ \ San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
\__/ She glanced at her watch ... It was 9:23.
-- James Clavell
Jul 18 '05 #14

P: n/a
Jim Hefferon wrote:
Peter Otten <__*******@web.de> wrote
So why doesn't it just concatenate? Because there is no way of knowing
how to properly decode chr(174) or any other non-ascii character to
unicode:
>>> chr(174).decode("latin1") u'\xae'
>>> chr(174).decode("latin2")

u'\u017d'
>>>
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.


Perhaps another example will help in addition to the answers already given:
1 + 2.0 3.0

In the above 1 is converted to 1.0 before it can be added to 2.0, i. e. we
have
float(1) + 2.0 3.0

In the same spirit
u"a" + "b" u'ab'

"b" is converted to unicode before u"a" and u"b" can be concatenated. The
same goes for string formatting:
"a%s" % u"b" u'ab' u"a%s" % "b" u'ab'

The following might be the conversion function:
def tounicode(s, encoding="ascii"): .... return s.decode(encoding)
.... u"a" + tounicode("b") u'ab'

Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:
u"a" + tounicode(chr(174), "latin1") u'a\xae' u"a" + tounicode(chr(174), "latin2") u'a\u017d'

By the way, in the real conversion routine the encoding isn't hardcoded, see
sys.get/setdefaultencoding() for the details. Therefore you _could_ modify
site.py to assume e. g. latin1 as the encoding of 8 bit strings. The
practical benefit of that is limited as you cannot make assumptions about
machines not under your control and therefore are stuck with ascii as the
least common denominator for scripts meant to be portable - which brings us
back to:
Use either unicode or str, but don't mix them. That should keep you out
of trouble.


Or make all conversions explicit with the str.decode()/unicode.encode()
methods.
Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.


While details are often helpful to identify a problem that is different from
the poster's guess, unicode handling is pretty general, and it was rather
my post that was lacking clarity.

Peter

Jul 18 '05 #15

P: n/a
Peter Otten <__*******@web.de> wrote:
Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:

Thanks, Peter and others, you have been enlightening. I understand
you to say that Python insists that I explicitly decide the decoding,
and not just smoosh the strings. Thanks.

I will write to the documentation person with the suggestion that the
documentation of .join(seq) at
http://docs.python.org/lib/string-methods.html#l2h-188 might be
updated from:
"Return a string which is the concatenation of the strings in the
sequence seq."
Use either unicode or str, but don't mix them. That should keep you out
of trouble.


Or make all conversions explicit with the str.decode()/unicode.encode()
methods.

Now I only have to figure out whic codec's are available and
appropriate.
Thanks again,

Jim
Jul 18 '05 #16

P: n/a

"Jim Hefferon" <jh*******@smcvt.edu> wrote in message
news:54*************************@posting.google.co m...
Thanks, Peter and others, you have been enlightening. I understand
you to say that Python insists that I explicitly decide the decoding,
and not just smoosh the strings. Thanks.


Abstractly, byte strings and unicode strings are different types of beasts.
If you forget what you know about the CPython computer implementation and
linear computer memories, it make little sense to combine them. The result
would have to be some currently nonexistent byte-unicode string.

Terry J. Reedy


Jul 18 '05 #17

This discussion thread is closed

Replies have been disabled for this discussion.