By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
439,932 Members | 1,944 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 439,932 IT Pros & Developers. It's quick & easy.

Changing the default text codec

P: n/a
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.
*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............
I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

Fuzzy
Jul 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Fuzzyman wrote:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.
*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............
I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )


You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")

or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Peter
Jul 18 '05 #2

P: n/a
Fuzzyman wrote:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)
I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.
For example - if I run my program from IDLE and give it the word
'degri' (containing e-acute) then I get the error :
What do you mean "give it the word". Through raw_input()? Through a file?

However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?
What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............
Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.
I can't work out how to change the default codec (no matter what the
locale) ?


I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod
Jul 18 '05 #3

P: n/a
Paul Prescod <pa**@prescod.net> wrote in message news:<ma**************************************@pyt hon.org>...
Fuzzyman wrote:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)
I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.


If that was the case then I'm sure you'd be right... good not to
quibble about terminology eh ;-)

(in a few other test cases the success-fail pattern was the opposite
way round)

For example - if I run my program from IDLE and give it the word
'degri' (containing e-acute) then I get the error :
What do you mean "give it the word". Through raw_input()? Through a file?


Right - it is fetching the words from a Tkinter entry box using the
get() method.
However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")
Great - that might do the job.
I'll try it.
Thanks.
> if letter in self.valid_letters:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
> 26: ordinal not in range(128)
Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:


Self.valid_letters *in fact* is string.lowercase - which I thought
included the 8 bit latin-1 letters as well. (the letters are converted
to lowercase by using the .lower() string method )

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?


I'm iterating over the string.
What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............


Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.

Great - sounds good.
I can't work out how to change the default codec (no matter what the
locale) ?


I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod


Thanks for your help.

Fuzzyman

http://www.voidspace.org.uk/atlantib...thonutils.html
Jul 18 '05 #4

P: n/a
Peter Otten <__*******@web.de> wrote in message news:<c1*************@news.t-online.com>...
Fuzzyman wrote:
[snip..]
I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )


You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")


I'll try this.
Some of the errors said (something to the effect of) 'character not in
range(128)' which sounds like some standard 'methods' (or functions)
are only prepared to deal with the default 7-bit ascii. That could be
a bugger.
or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Short of me actually looking... where is site.py :-)
Thanks
Fuzzyman
http://www.voidspace.org.uk/atlantib...thonutils.html
Peter

Jul 18 '05 #5

P: n/a
Fuzzyman wrote:
Short of me actually looking... where is site.py :-)

import site
site.__file__

'/usr/local/lib/python2.3/site.py'

Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.