By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,922 Members | 1,689 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,922 IT Pros & Developers. It's quick & easy.

Character Encodings and display of strings

P: n/a
I am trying to understand why, with nonwestern strings, I sometimes get
a hex display and sometimes get the string printed as characters.

With my Python locale set to Japanese and with or without a # coding of
cp932 (this is Windows) at the top of the file, I read a list of
Japanese strings into a list, say, catlis.

With this code
for item in catlis:
print item
print catlis
print " ".join(catlis)

the first print (print item) displays Japanese text as characters..
The second print (print catlis) displays a list with the double byte
characters in hex notation.
The third print (print " ".join(catlis)) prints a combined string of
Japanese characters properly.

According to the print documentation,
"If an object is not a string, it is first converted to a string using
the rules for string conversions"

but the result is different with a list of strings.

The hex display looks like this:
['id', '\x90\xab\x95\xca', '\x90\xb6\x94N\x8c\x8e\x93\xfa',
'\x8fA\x8aw\x94N\x90\x94', '\x90E\x8e\xed', '\x8b\x8b\x97^',
'\x8f\x89\x94C\x8b\x8b', '\x8d\xdd\x90\xd0\x8c\x8e\x90\x94',
'\x90E\x96\xb1\x8co\x97\xf0', '\x90l\x8e\xed']

and correctly shows the hex values of the Japanese characters.

Why are these different?

TIA,
Jon Peck

Nov 13 '06 #1
Share this Question
Share on Google+
6 Replies


P: n/a
"JKPeck" wrote:
>I am trying to understand why, with nonwestern strings, I sometimes get
a hex display and sometimes get the string printed as characters.

With my Python locale set to Japanese and with or without a # coding of
cp932 (this is Windows) at the top of the file, I read a list of
Japanese strings into a list, say, catlis.

With this code
for item in catlis:
print item
print catlis
print " ".join(catlis)

the first print (print item) displays Japanese text as characters..
The second print (print catlis) displays a list with the double byte
characters in hex notation.
The third print (print " ".join(catlis)) prints a combined string of
Japanese characters properly.

According to the print documentation,
"If an object is not a string, it is first converted to a string using
the rules for string conversions"

but the result is different with a list of strings.
a list is not a string, so it's converted to one using the standard list representation
rules -- which is to do repr() on all the items, and add brackets and commas as
necessary.

for some more tips on printing, see:

http://effbot.org/zone/python-list.htm#printing

</F>

Nov 13 '06 #2

P: n/a
Thanks for the quick answer. I thought repr was involved here, but
when I use repr explicitly I get a notation where the backslashes are
escaped. I also though that with the encoding explictily declared in
the source, that repr would take that into account and use the
character form, but obviously it doesn't.
Fredrik Lundh wrote:
"JKPeck" wrote:
I am trying to understand why, with nonwestern strings, I sometimes get
a hex display and sometimes get the string printed as characters.

With my Python locale set to Japanese and with or without a # coding of
cp932 (this is Windows) at the top of the file, I read a list of
Japanese strings into a list, say, catlis.

With this code
for item in catlis:
print item
print catlis
print " ".join(catlis)

the first print (print item) displays Japanese text as characters..
The second print (print catlis) displays a list with the double byte
characters in hex notation.
The third print (print " ".join(catlis)) prints a combined string of
Japanese characters properly.

According to the print documentation,
"If an object is not a string, it is first converted to a string using
the rules for string conversions"

but the result is different with a list of strings.

a list is not a string, so it's converted to one using the standard list representation
rules -- which is to do repr() on all the items, and add brackets and commas as
necessary.

for some more tips on printing, see:

http://effbot.org/zone/python-list.htm#printing

</F>
Nov 13 '06 #3

P: n/a
JKPeck wrote:
Thanks for the quick answer. I thought repr was involved here, but
when I use repr explicitly I get a notation where the backslashes are
escaped. I also though that with the encoding explictily declared in
the source, that repr would take that into account and use the
character form, but obviously it doesn't.
The encoding in the source has nothing to do with that. How should an
encoding (and possibly a gazillion different ones in gazillion other
sourcefiles of yours) influence the list repr code?

The encoding in the source-file is solely used to correctly parse unicode
literals, as these need a specific encoding to be generated from the
byte-string they are in the sourcecode.

Diez
Nov 13 '06 #4

P: n/a
It seemed to me that this sentence

For many types, this function makes an attempt to return a string that
would yield an object with the same value when passed to eval().

might mean that the encoding setting of the source file might influence
how repr represented the contents of the string. Nothing to do with
Unicode. If a source file could have a declared encoding of, say,
cp932 via the # coding comment, I thought there was a chance that eval
would respond to that, too.
Diez B. Roggisch wrote:
JKPeck wrote:
Thanks for the quick answer. I thought repr was involved here, but
when I use repr explicitly I get a notation where the backslashes are
escaped. I also though that with the encoding explictily declared in
the source, that repr would take that into account and use the
character form, but obviously it doesn't.

The encoding in the source has nothing to do with that. How should an
encoding (and possibly a gazillion different ones in gazillion other
sourcefiles of yours) influence the list repr code?

The encoding in the source-file is solely used to correctly parse unicode
literals, as these need a specific encoding to be generated from the
byte-string they are in the sourcecode.

Diez
Nov 13 '06 #5

P: n/a

JKPeck wrote:
It seemed to me that this sentence

For many types, this function makes an attempt to return a string that
would yield an object with the same value when passed to eval().

might mean that the encoding setting of the source file might influence
how repr represented the contents of the string. Nothing to do with
Unicode. If a source file could have a declared encoding of, say,
cp932 via the # coding comment, I thought there was a chance that eval
would respond to that, too.
Not a chance :) Encoding is a property of an input/output object
(console, web page, plain text file, MS Word file, etc...). All
input/output object have specific rules determining their encoding,
there is absolutely no connection between encoding of the source file
and any other input/output object.

repr escapes bytes 128..255 because it doesn't know where you're going
to output its result so repr uses the safest encoding: ascii.

-- Leo

Nov 13 '06 #6

P: n/a
It is possible derive your own string class from the built-in one and
override what 'repr' does (and make it do whatever you want). Here's an
example of what I mean:

##### Sample #####

# -*- coding: iso-8859-1 -*-

# Special string class to override the default
# representation method. Main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord 128
class MsgStr(str):

def __repr__(self):
asciispace = ord(' ')
if self.count("'") >= self.count('"'):
quotechar = '"'
else:
quotechar = "'"

rep = [quotechar]
for ch in self:
if ord(ch) < asciispace:
rep += repr(str(ch)).strip("'")
elif ch == quotechar:
rep += "\\"
rep += ch
else:
rep += ch
rep += quotechar

return "".join(rep)

if __name__ == "__main__":
s = MsgStr("\tWürttemberg\"")
print s
print repr(s)
print str(s)
print repr(str(s))

Nov 14 '06 #7

This discussion thread is closed

Replies have been disabled for this discussion.