Why does the "".join(r) do this?

Jim Hefferon

Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)

Jul 18 '05 #1

Subscribe Post Reply

2421

Peter Hansen

Jim Hefferon wrote:

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

It can't just concatenate because your list contains other
items which are unicode strings. Python is attempting to convert
your strings to unicode strings to do the join, and it fails
because your strings contain characters which don't have
meaning to the default decoder.

-Peter

Jul 18 '05 #2

Skip Montanaro

Jim> I'm building up a web page by stuffing an array and then doing
Jim> "".join(r) at the end. I intend to later encode it as 'latin1', so
Jim> I'd like it to just concatenate. While I can work around this
Jim> error, the reason for it escapes me.

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

Skip

Jul 18 '05 #3

Peter Otten

Jim Hefferon wrote:

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

Let's reduce the problem to its simplest case:

unichr(174) + chr(174) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)

So why doesn't it just concatenate? Because there is no way of knowing how
to properly decode chr(174) or any other non-ascii character to unicode:
chr(174).decode("latin1") u'\xae' chr(174).decode("latin2") u'\u017d'

Use either unicode or str, but don't mix them. That should keep you out of
trouble.

Peter

Jul 18 '05 #4

Peter Otten

Skip Montanaro wrote:

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

This is bound to fail when the first non-ascii str occurs:

u"".join(["a", "b"]) u'ab' u"".join(["a", chr(174)]) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)
Apart from that, Python automatically switches to unicode if the list
contains unicode items:
"".join(["a", u"o"])

u'ao'

Peter

Jul 18 '05 #5

moma

Jim Hefferon wrote:

Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)

What about unichr() ?
#!/usr/bin/python2.3 -u
t="abc"+unichr(174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

Jul 18 '05 #6

moma

Jim Hefferon wrote:

Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)

What about unichr() ?
#!/usr/bin/python2.3 -u
t="abc"+unichr(174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z']
r.append(t)
# k=u"".join(r)
k="".join(r)
print k
// moma
http://www.futuredesktop.org

Jul 18 '05 #7

Skip Montanaro

Peter> Skip Montanaro wrote:

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

Peter> This is bound to fail when the first non-ascii str occurs:

...

Yeah I realized that later. I missed that he was appending non-ASCII
strings to his list. I thought he was only appending unicode objects and
ASCII strings (in which case what he was trying should have worked). Serves
me right for trying to respond with a head cold.

Skip

Jul 18 '05 #8

Ivan Voras

Peter Otten wrote:

Skip Montanaro wrote:

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

This is bound to fail when the first non-ascii str occurs:

Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)
--
C isn't that hard: void (*(*f[])())() defines f as an array of
unspecified size, of pointers to functions that return pointers to
functions that return void.

Jul 18 '05 #9

John Roth

"Ivan Voras" <ivoras@__geri.cc.fer.hr> wrote in message
news:c8**********@bagan.srce.hr...

Peter Otten wrote:
Skip Montanaro wrote:

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bitsin your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

This is bound to fail when the first non-ascii str occurs:

Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)

Does the encoding line (1st or second line of program) do this?
I don't remember if it does or not - although I'd suspect not.
Otherwise it seems like a reasonably straightforward function
to write.

John Roth

--
C isn't that hard: void (*(*f[])())() defines f as an array of
unspecified size, of pointers to functions that return pointers to
functions that return void.

Jul 18 '05 #10

Peter Otten

John Roth wrote:

"Ivan Voras" <ivoras@__geri.cc.fer.hr> wrote in message
news:c8**********@bagan.srce.hr...

Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)

Does the encoding line (1st or second line of program) do this?
I don't remember if it does or not - although I'd suspect not.
Otherwise it seems like a reasonably straightforward function
to write.

As a str does not preserve information about the encoding, the
# -*- coding: XXX -*-
comment does not help here. It does however control the decoding of unicode
strings. I suppose using unicode for non-ascii literals plus the above
coding comment is as close as you can get to the desired effect.

With some more work you could probably automate string conversion like it is
done with quixote's htmltext. Not sure if that would be worth the effort,
though.

Peter

Jul 18 '05 #11

Jim Hefferon

Peter Otten <__*******@web.de> wrote

So why doesn't it just concatenate? Because there is no way of knowing how
to properly decode chr(174) or any other non-ascii character to unicode:
chr(174).decode("latin1") u'\xae' chr(174).decode("latin2") u'\u017d'

Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.
Use either unicode or str, but don't mix them. That should keep you out of
trouble.

Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

Thanks; I am often struck by how helpful this group is,
Jim

Jul 18 '05 #12

John Roth

"Jim Hefferon" <jh*******@smcvt.edu> wrote in message
news:54**************************@posting.google.c om...

Peter Otten <__*******@web.de> wrote
So why doesn't it just concatenate? Because there is no way of knowing how to properly decode chr(174) or any other non-ascii character to unicode:
>> chr(174).decode("latin1") u'\xae'
>> chr(174).decode("latin2")

u'\u017d'
>>
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.

Maybe I can simplify it? The result has to be in a single encoding,
which will be UTF-8 if any of the strings is a unicode string.
Ascii-7 is a proper subset of UTF-8, so there is no difficulty with
the concatination. 8-bit encodings are not, so the concatination
checks that any normal strings are, in fact, Ascii-7. The encoding
is actually doing the validity check, not an encoding conversion.

The only way the system could do a clean concatination between
unicode and one of the 8-bit encodings is to know beforehand which
of the 8-bit encodings it is dealing with, and there is no way that it
currently has of knowing that.

The people who implemented unicode (in 2.0, I believe) seem to
have decided not to guess. That's in line with the "explicit is better
than implicit" principle.
Use either unicode or str, but don't mix them. That should keep you out of trouble.

Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

Ah. The issue then is rather simple: what is the encoding of the normal
strings? I'd presume Latin-1. So simply run the list of strings through a
function that converts any normal string to unicode using the Latin-1
codec, and then they should concatinate fine.

As far as the web goes, I'd suggest you make sure you specify UTF-8
in both the HTTP headers and in a <meta> tag in the HTML header,
and make sure that what you write out is, indeed, UTF-8.

John Roth

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

Thanks; I am often struck by how helpful this group is,
Jim

Jul 18 '05 #13

Erik Max Francis

Jim Hefferon wrote:

Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now?
Because you're mixing normal strings and Unicode strings. To do that,
it needs to convert the normal strings to Unicode, and to do that it has
to know what encoding you want.
As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.

It's the process by which you turn an arbitrary string into a Unicode
string and back. When you're adding normal strings and Unicode strings,
you end up with a Unicode string, which means the normal strings have to
be implicitly converted. That's why you're getting the error.

Work with strings or Unicode strings, not a mixture, and you won't have
this problem.

--
__ Erik Max Francis && ma*@alcyone.com && http://www.alcyone.com/max/
/ \ San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
\__/ She glanced at her watch ... It was 9:23.
-- James Clavell

Jul 18 '05 #14

Peter Otten

Jim Hefferon wrote:

Peter Otten <__*******@web.de> wrote
So why doesn't it just concatenate? Because there is no way of knowing
how to properly decode chr(174) or any other non-ascii character to
unicode:
>>> chr(174).decode("latin1") u'\xae'
>>> chr(174).decode("latin2")

u'\u017d'
>>>
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.

Perhaps another example will help in addition to the answers already given:

1 + 2.0 3.0

In the above 1 is converted to 1.0 before it can be added to 2.0, i. e. we
have
float(1) + 2.0 3.0

In the same spirit
u"a" + "b" u'ab'

"b" is converted to unicode before u"a" and u"b" can be concatenated. The
same goes for string formatting:
"a%s" % u"b" u'ab' u"a%s" % "b" u'ab'

The following might be the conversion function:
def tounicode(s, encoding="ascii"): .... return s.decode(encoding)
.... u"a" + tounicode("b") u'ab'

Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:
u"a" + tounicode(chr(174), "latin1") u'a\xae' u"a" + tounicode(chr(174), "latin2") u'a\u017d'

By the way, in the real conversion routine the encoding isn't hardcoded, see
sys.get/setdefaultencoding() for the details. Therefore you _could_ modify
site.py to assume e. g. latin1 as the encoding of 8 bit strings. The
practical benefit of that is limited as you cannot make assumptions about
machines not under your control and therefore are stuck with ascii as the
least common denominator for scripts meant to be portable - which brings us
back to:
Use either unicode or str, but don't mix them. That should keep you out
of trouble.

Or make all conversions explicit with the str.decode()/unicode.encode()
methods.
Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

While details are often helpful to identify a problem that is different from
the poster's guess, unicode handling is pretty general, and it was rather
my post that was lacking clarity.

Peter

Jul 18 '05 #15

Jim Hefferon

Peter Otten <__*******@web.de> wrote:

Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:

Thanks, Peter and others, you have been enlightening. I understand
you to say that Python insists that I explicitly decide the decoding,
and not just smoosh the strings. Thanks.

I will write to the documentation person with the suggestion that the
documentation of .join(seq) at
http://docs.python.org/lib/string-methods.html#l2h-188 might be
updated from:
"Return a string which is the concatenation of the strings in the
sequence seq."

Use either unicode or str, but don't mix them. That should keep you out
of trouble.

Or make all conversions explicit with the str.decode()/unicode.encode()
methods.

Now I only have to figure out whic codec's are available and
appropriate.
Thanks again,

Jim

Jul 18 '05 #16

Terry Reedy

"Jim Hefferon" <jh*******@smcvt.edu> wrote in message
news:54*************************@posting.google.co m...

Thanks, Peter and others, you have been enlightening. I understand
you to say that Python insists that I explicitly decide the decoding,
and not just smoosh the strings. Thanks.

Abstractly, byte strings and unicode strings are different types of beasts.
If you forget what you know about the CPython computer implementation and
linear computer memories, it make little sense to combine them. The result
would have to be some currently nonexistent byte-unicode string.

Terry J. Reedy

Jul 18 '05 #17

Similar topics

Why does this fail?

by: Dave Murray | last post by:

New to Python question, why does this fail? Thanks, Dave ---testcase.py--- import sys, urllib, htmllib def Checkit(URL): try: print "Opening", URL

Python

What does this reg expression mean?

by: Lyn | last post by:

I'm trying to get my head around regular expressions while maintaining some semblance of sanity. I'm canabalising a similar program to construct my own, and it keeps using an expression I'm not...

Perl

Does this allocate memory?

by: Curt | last post by:

If this is the complete program (ie, the address of the const is never taken, only its value used) is it likely the compiler will allocate ram for constantA or constantB? Or simply substitute the...

C / C++

Does "Setup and Deployment projects" are included in VB2005 Standard Ed.

by: James | last post by:

I was told they are included in VB2005 Professional Ed. and not in free Express Ed. I wondered if Standard Ed. has it. Is anyone here using Standard Ed.? Based on Microsoft's Product Feature...

Visual Basic .NET

How does "new" work in a loop?

by: Tony Sinclair | last post by:

I'm just learning C#. I'm writing a program (using Visual C# 2005 on WinXP) to combine several files into one (HKSplit is a popular freeware program that does this, but it requires all input and...

C# / C Sharp

Does "float" always occupy 32 bits

by: chandanlinster | last post by:

As far as I know floating point variables, that are declared as float follow IEEE format representation (which is 32-bit in size). But chapter1-page no 9 of the book "The C programming language"...

C / C++

Why does Join method call sit there forever?

by: Dachshund Digital | last post by:

Why does Join method call sit there forever? This code works, including the delegate call, but if the join method is ever called, it seems the main thread blocks, and it is hung. HELP! This is...

Visual Basic .NET

Why does "TypeConverter.GetProperties" exist

by: Larry Smith | last post by:

Hi there, Can anyone provide any insight on why MSFT introduced "TypeConverter.GetProperties()". There are two classes for dealing with metadata in .NET, 'Type" and "TypeDescriptor". Each has a...

C# / C Sharp

What exactly does "=" (equal) equal? A shallow copy?

by: raylopez99 | last post by:

In C++, you have symbolic overriding of "+", "=", etc, which C# also has. This question is not really about that. Rather, in C#, when you say: MyObject X = new MyObject(); MyObject Y = new...

C# / C Sharp

What does this mean?

by: JoeC | last post by:

m_iWidth = (int)pBitmapInfo->bmiHeader.biWidth; m_iHeight = (int)pBitmapInfo->bmiHeader.biHeight; What does this mean? I have seen v=&var->member.thing; but what does it mean when you...

C / C++

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++