473,396 Members | 1,987 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Problem with national characters

I'm developing a routine that will parse user input. For simplicity, I'm
converting the entire input string to upper case. One of the words that
will have special meaning for the parser is the word "før", (before in
English). However, this word is not recognized. A test in the
interactive shell reveals this:

leif@balapapa leif $ python
Python 2.3.4 (#1, Feb 7 2005, 21:31:38)
[GCC 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
'før'.upper() 'F\xf8R' 'FØR' 'F\xd8R'
In Windows, the result is slightly different, but no better:

C:\Python23>python
ActivePython 2.3.2 Build 232 (ActiveState Corp.) based on
Python 2.3.2 (#49, Nov 13 2003, 10:34:54) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information. 'før'.upper() 'F\x9bR' 'FØR' 'F\x9dR'


Is there a way around this problem? My character set in Linux is
ISO-8859-1. In Windows 2000 it should be the equivavent Latin-1, though
I'm not sure about which character set the command shell is using.
--
Leif Biberg Kristensen
http://solumslekt.org/
Jul 18 '05 #1
7 2549
Is there a way around this problem?


put

import sys
sys.setdefaultencoding('UTF-8')

into sitecustomize.py in the top level of your PYTHONPATH .

Jul 18 '05 #2
da**********@gmail.com wrote:
put

import sys
sys.setdefaultencoding('UTF-8')

into sitecustomize.py in the top level of your PYTHONPATH .


Uh ... it doesn't seem like I've got PYTHONPATH defined on my system in
the first place:

leif@balapapa leif $ env |grep -i python
PYTHONDOCS=/usr/share/doc/python-docs-2.3.4/html

I ran this small snippet that I found after a search on Gentoo-forums:
import sys
import string
path_list = sys.path
for eachPath in path_list: print eachPath ....

/usr/lib/python23.zip
/usr/lib/python2.3
/usr/lib/python2.3/plat-linux2
/usr/lib/python2.3/lib-tk
/usr/lib/python2.3/lib-dynload
/usr/lib/portage/pym
/usr/lib/python2.3/site-packages
/usr/lib/python2.3/site-packages/gtk-2.0


What should my PYTHONPATH look like, and where do you suggest to put the
sitecustomize.py file?
--
Leif Biberg Kristensen
http://solumslekt.org/
Jul 18 '05 #3
> da**********@gmail.com wrote:
put

import sys
sys.setdefaultencoding('UTF-8')

into sitecustomize.py in the top level of your PYTHONPATH .


I found out of it, sort of. Now I've got a PYTHONPATH that points to my
home directory, and followed your instructions. The first time I got an
error message due to a typo. I corrected it, and now Python starts
without an error message. But it didn't solve my problem with the
uppercase Ø at all. Is there something else I have to do?
--
Leif Biberg Kristensen
http://solumslekt.org/
Jul 18 '05 #4
Leif B. Kristensen skrev:
Is there something else I have to do?


Please forgive me for talking with myself here :-) I should have looked
up Unicode in "Learning Python" before I asked. This seems to work:
u'før'.upper() u'F\xd8R' u'FØR' u'F\xd8R' 'FØR' 'F\xd8R'

So far, so good. Note that the Unicode representation of the uppercase
version is identical to the default. But when I try the builtin
function unicode(), weird things happen:
s='FØR'
s 'F\xd8R' unicode(s)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2:
invalid data

The ActivePython 2.3.2 doesn't even seem to understand the 'u' prefix.
So even if I can get this to work on my own Linux machine, it hardly
looks like a portable solution.

Seems like the "solution" is to keep away from letters above ASCII-127,
like we've done since the dawn of computing ...
--
Leif Biberg Kristensen
http://solumslekt.org/
Jul 18 '05 #5
Leif B. Kristensen wrote:
Is there a way around this problem? My character set in Linux is
ISO-8859-1. In Windows 2000 it should be the equivavent Latin-1, though
I'm not sure about which character set the command shell is using.


The unicode methods seems to do it correctly. So you can decode your
strings as unicode, do the transfom, and encode it back as latin1.

print repr('før'.decode('latin-1').upper().encode('latin-1')) #
'F\xd8R'
print repr('FØR'.decode('latin-1').encode('latin-1'))
'F\xd8R'

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
Jul 18 '05 #6
Leif B. Kristensen wrote:
Is there a way around this problem? My character set in Linux is
ISO-8859-1. In Windows 2000 it should be the equivavent Latin-1, though
I'm not sure about which character set the command shell is using.


You need to do locale.setlocale(locale.LC_ALL, "") to get
locale-specific upper-casing.

Notice that things are more difficult in the Windows terminal window,
as this uses an encoding different from the one that the system's
locale functions expect.

Regards,
Martin
Jul 18 '05 #7
"Martin v. Löwis" skrev:
You need to do locale.setlocale(locale.LC_ALL, "") to get
locale-specific upper-casing.
That makes a lot of sense. Thank you.
'før'.upper() 'F\xf8R' 'FØR' 'F\xd8R' import locale
locale.setlocale(locale.LC_ALL, "") 'no_NO' 'før'.upper() 'F\xd8R' 'FØR'

'F\xd8R'

I must make a note of the LC_ALL variable in the installation README.
I for one have been running Gentoo Linux for two years without ever
setting the locale, - but now I've finally gotten around to write my
own /etc/env.d/02locale file :-)
Notice that things are more difficult in the Windows terminal window,
as this uses an encoding different from the one that the system's
locale functions expect.


The real input will come from a GUI or a browser interface, so the
Windows terminal problem isn't really an issue.
--
Leif Biberg Kristensen
http://solumslekt.org/
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Jacek Stepniewski | last post by:
I would like to display on the cell phone my regional characters (polish characters). The best way to do this is to use entity, but the mobile control do HtmlEncode and converts "&" sign to "&amp"....
0
by: Peter Hemmingsen | last post by:
Hi, I've an asp.net (cs code behind) application where I want the user to be able to download a file with national (danish) characters. Ex: Response.Clear(); // clear the current output...
0
by: Eric Carr | last post by:
Hi, we have been using WMI from VB6 to automate configuration of new DNS zones on our win2000 servers, and are now trying to move the system to a vb.net application. We recently added some...
4
by: B.D. | last post by:
Can anyone explain way the transformation to upper case doesn't work correctly in the following code if PROBLEM is defined but works correctly if it's not defined? I'm using VC 7.1 #include...
3
by: ianeruda | last post by:
Hi, I created database using CREATE DATABASE TEST ON D: USING CODESET 1250 TERRITORY HR COLLATE USING SYSTEM in order to sort national specific characters in right order. But this still...
0
by: wolf2300 | last post by:
readline method and national charset problem hi how can i copy line from text file which contains polish characters. i use readline method from StreamReader class but it doesn't copy polish...
3
by: Dariusz Tomon | last post by:
Hi My problem is like that: I'm trying to pass values branza and jezyk in querysting: http://localhost/euroadres/pokazbranza.aspx?branza=transport lotniczy&jezyk=1 Branza is type...
1
by: skyson2ye | last post by:
Hi, guys: I have written a piece of code which utilizes Javascript in PHP to create a three level dynamic list box(Country, States/Province, Market). However, I have encountered a strange problem,...
2
by: swethak | last post by:
hi , i write the code in .htm file. It is in cgi-bin/searches/one.htm.In that i write a form submitting and validations.But validations are not worked in that .htm file. I used the same code in my...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.