Changing the default text codec

Fuzzyman

Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.
*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............
I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

Fuzzy

Jul 18 '05 #1

Subscribe Post Reply

4542

Peter Otten

Fuzzyman wrote:

Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)

I've written an anagram finder that produces anagrams from a
dictionary of words. The user can load their own dictionary.

( http://www.voidspace.org.uk/atlantibots/nanagram.html )

It's particularly difficult for me to understand what is happening -
because python's behaviour *seems* intermittent.

For example - if I run my program from IDLE and give it the word
'degré' (containing e-acute) then I get the error :

Exception in Tkinter callback
Traceback (most recent call last):
[snip..]
File "D:\Python Projects\Nanagram1.3\Nanagram-GUI.pyw", line 123, in
prepare
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Traceback (most recent call last):

It is testing each character of the users input to remove invalid
characters (like "-" and "'")... It crashes when it comes tot he
e-acute.
*However* - If I run it by double clicking on the file then it appears
to work fine (e.g. if I ask it find anagrams of 'degré hello ma' then
it strips out the e-acute (thinking it's an invalid character) and
finds anagrams of the rest :

gleam holder
hallo merged

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............
I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")

or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Peter

Jul 18 '05 #2

Paul Prescod

Fuzzyman wrote:

Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)
I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.
For example - if I run my program from IDLE and give it the word
'degri' (containing e-acute) then I get the error :
What do you mean "give it the word". Through raw_input()? Through a file?

However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")
if letter in self.valid_letters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
26: ordinal not in range(128)
Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?
What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............
Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.
I can't work out how to change the default codec (no matter what the
locale) ?

I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod

Jul 18 '05 #3

Fuzzyman

Paul Prescod <pa**@prescod.net> wrote in message news:<ma**************************************@pyt hon.org>...

Fuzzyman wrote:
Sorry if my terminology is wrong..... but I'm having intermittent
problems dealing with accented characters in python. (Only from the 8
bit latin-1 character set I think..)
I would say that if you get a 100% failure rate in IDLE and a 100%
success rate from a console program then your problem is not
intermittent but environment specific.

If that was the case then I'm sure you'd be right... good not to
quibble about terminology eh ;-)

(in a few other test cases the success-fail pattern was the opposite
way round)

For example - if I run my program from IDLE and give it the word
'degri' (containing e-acute) then I get the error :
What do you mean "give it the word". Through raw_input()? Through a file?

Right - it is fetching the words from a Tkinter entry box using the
get() method.
However you are getting this information, it seems to me that in IDLE
you are getting a Unicode object rather than an 8-bit string object.
Convert it to an 8-bit string:

mydata.encode("latin-1")
Great - that might do the job.
I'll try it.
Thanks.

> if letter in self.valid_letters:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position
> 26: ordinal not in range(128)
Something looks suspicious here. I wouldn't expect self.valid_letters to
have a 0x83 character in it because I would expect it to be hard-coded
to ASCII in your program like:

Self.valid_letters *in fact* is string.lowercase - which I thought
included the 8 bit latin-1 letters as well. (the letters are converted
to lowercase by using the .lower() string method )

valid_letters = "abcdefghijklmnopqrstuvwxyzABCDEF..."

On the other hand I wouldn't expect "letter" to have more than one
character so how could it have a problem at position 26?

I'm iterating over the string.

What I'd like to do is switch by default to an 8 bit codec (latin-1 I
think ?????) and then offer the user the choice of either mapping the
accented characters to their nearest equivalent (e-acute to e for
example) *or* treating them as seperate characters.............

Why change the default codec rather than explicitly using the codec you
care about? If you want to work in the 8-bit world rather than the
Unicode world, just use the "encode" function on the Unicode object. If
you want to work in the Unicode world.

Great - sounds good.

I can't work out how to change the default codec (no matter what the
locale) ?

I'd advise against fixing the problem in that way. Convert data
appropriately when you bring it from the outside world into the Python
program and ignore the default codec.

Paul Prescod

Thanks for your help.

Fuzzyman

http://www.voidspace.org.uk/atlantib...thonutils.html

Jul 18 '05 #4

Fuzzyman

Peter Otten <__*******@web.de> wrote in message news:<c1*************@news.t-online.com>...

Fuzzyman wrote:
[snip..]
I can't work out how to change the default codec (no matter what the
locale) ?

Anyone able to help - or point me to a useful resource ?? (I've tried
google - b4 u suggest it )

You can either explicitly convert your unicode strings:

unicodeword.encode("latin-1")

I'll try this.
Some of the errors said (something to the effect of) 'character not in
range(128)' which sounds like some standard 'methods' (or functions)
are only prepared to deal with the default 7-bit ascii. That could be
a bugger.
or try to modify your site.py from the default

encoding = "ascii"

to

encoding = "latin-1"

Short of me actually looking... where is site.py :-)
Thanks
Fuzzyman
http://www.voidspace.org.uk/atlantib...thonutils.html
Peter

Jul 18 '05 #5

Peter Otten

Fuzzyman wrote:

Short of me actually looking... where is site.py :-)

import site
site.__file__

'/usr/local/lib/python2.3/site.py'

Jul 18 '05 #6

by: Max M | last post by:

Is there any codec available for handling The special UTF-7 codec for IMAP? I have searched the web for info, but there only seem to be discussions about it. Not actual implementations. This...

Python

Trouble saving unicode text to file

by: Svennglenn | last post by:

I'm working on a program that is supposed to save different information to text files. Because the program is in swedish i have to use unicode text for ÅÄÖ letters. When I run the following...

Python

making 'utf-8' default codec

by: Nikola Skoric | last post by:

Hi there, Is there a way of making 'utf-8' default codec for the whole program, so I don't have to do .encode('utf-8') every time I print out a string? -- "Now the storm has passed over me...

Python

Which codec is required?

by: UJ | last post by:

If I've got a video/audio file, how can I tell what Codec it needs? I want to be able to let the user upload a file to a server but I want to make sure before hand that the codec is already...

C# / C Sharp

Where is the ucs-32 codec?

by: beni.cherniavsky | last post by:

Python seems to be missing a UCS-32 codec, even in wide builds (not that it the build should matter). Is there some deep reason or should I just contribute a patch? If it's just a bug, should I...

Python

Wanted: safe codec for filenames

by: Torsten Bronger | last post by:

Hallöchen! I'd like to map general unicode strings to safe filename. I tried punycode but it is case-sensitive, which Windows is not. Thus, "Hallo" and "hallo" are mapped to "Hallo-" and...

Python

Problems changing image resolution in VB.NET

by: kombu67 | last post by:

I'm reading a series of images from a MS SQL table and saving them to directory. These are staff ID pictures from our security card app. Once I've extracted the ID photo from the security app to...

Visual Basic .NET

How to get Python to default to UTF8

by: weheh | last post by:

I'm developing a cgi-bin application that must be unicode sensitive. I'm striving for a UTF8 implementation. I'm running python 2.3 on a development machine (windows xp) and a server (windows xp...

Python

Re: Changing the (codec) error handler for the stdout/stderr streamsin Python 3.0

by: John Nagle | last post by:

Jukka Aho wrote: Python 5 is even stricter. Only ASCII (chars 0..127) can be sent to standard output by default. John Nagle

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Changing the default text codec

Similar topics