i18n: looking for expertise

klappnase

Hello all,

I am trying to internationaliz e my Tkinter program using gettext and
encountered various problems, so it looks like it's not a trivial
task.
After some "research" I made up a few rules for a concept that I hope
lets me avoid further encoding trouble, but I would feel more
confident if some of the experts here would have a look at the
thoughts I made so far and told me if I'm still going wrong somewhere
(BTW, the program is supposed to run on linux only). So here is what I
have so far:

1. use unicode instead of byte strings wherever possible. This can be
a little tricky, because in some situations I cannot know in advance
if a certain string is unicode or byte string; I wrote a helper module
for this which defines convenience methods for fail-safe
decoding/encoding of strings and a Tkinter.Unicode Var class which I
use to convert user input to unicode on the fly (see the code below).

2. so I will have to call gettext.install () with unicode=1

3. make sure to NEVER mix unicode and byte strings within one
expression

4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.

5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before; The filename manipulations by the os.path
methods seem to be simply string manipulations so encoding the
filenames doesn't seem to be necessary.

6. messages that are printed to stdout should be encoded first, too;
the same with strings I use to call external shell commands.

############ file UnicodeHandler. py ############### ############### ####
# -*- coding: iso-8859-1 -*-
import Tkinter
import sys
import locale
import codecs

def _find_codec(enc oding):
# return True if the requested codec is available, else return
False
try:
codecs.lookup(e ncoding)
return 1
except LookupError:
print 'Warning: codec %s not found' % encoding
return 0

def _sysencoding():
# try to guess the system default encoding
try:
enc = locale.getprefe rredencoding(). lower()
if _find_codec(enc ):
print 'Setting locale to %s' % enc
return enc
except AttributeError:
# our python is too old, try something else
pass
enc = locale.getdefau ltlocale()[1].lower()
if _find_codec(enc ):
print 'Setting locale to %s' % enc
return enc
# the last try
enc = sys.stdin.encod ing.lower()
if _find_codec(enc ):
print 'Setting locale to %s' % enc
return enc
# aargh, nothing good found, fall back to latin1 and hope for the
best
print 'Warning: cannot find usable locale, using latin-1'
return 'iso-8859-1'

sysencoding = _sysencoding()

def fsdecode(input, errors='strict' ):
'''Fail-safe decodes a string into unicode.'''
if not isinstance(inpu t, unicode):
return unicode(input, sysencoding, errors)
return input

def fsencode(input, errors='strict' ):
'''Fail-safe encodes a unicode string into system default
encoding.'''
if isinstance(inpu t, unicode):
return input.encode(sy sencoding, errors)
return input
class UnicodeVar(Tkin ter.StringVar):
def __init__(self, master=None, errors='strict' ):
Tkinter.StringV ar.__init__(sel f, master)
self.errors = errors
self.trace('w', self._str2unico de)

def _str2unicode(se lf, *args):
old = self.get()
if not isinstance(old, unicode):
new = fsdecode(old, self.errors)
self.set(new)
############### ############### ############### ############### ###########

So before I start to mess up all of my code, maybe someone can give me
a hint if I still forgot something I should keep in mind or if I am
completely wrong somewhere.

Thanks in advance

Michael

Jul 18 '05 #1

Subscribe Reply

1546

Neil Hodgson

Michael:

5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;

This can lead to failure on Windows when the true Unicode file name can
not be encoded in the current system encoding.

Neil

Jul 18 '05 #2

klappnase

"Neil Hodgson" <nh******@bigpo nd.net.au> wrote in message news:<6O******* *************@n ews-server.bigpond. net.au>...

Michael:
5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;

This can lead to failure on Windows when the true Unicode file name can
not be encoded in the current system encoding.

Neil

Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:

1. already existing files

2. automatically generated filenames, which result from adding an
ascii-only suffix to an existing filename (like xy --> xy_bak2)

3. filenames created by user input

?
If yes, how to avoid these?

Any hints are appreciated

Michael

Jul 18 '05 #3

Neil Hodgson

Michael:

Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:
...
3. filenames created by user input

Have you worked out how you want to handle user input that is not
representable in the encoding? It is easy for users to input any characters
into a Unicode enabled UI either through invoking an input method or by
copying and pasting from another application or character chooser applet.

Neil

Jul 18 '05 #4

klappnase

"Neil Hodgson" <nh******@bigpo nd.net.au> wrote in message news:<Pq******* **********@news-server.bigpond. net.au>...

Michael:
Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:
...
3. filenames created by user input

Have you worked out how you want to handle user input that is not
representable in the encoding? It is easy for users to input any characters
into a Unicode enabled UI either through invoking an input method or by
copying and pasting from another application or character chooser applet.

Neil

As I must admit, no. I just couldn't figure out that someone will really do this.

I guess I could add a test like (pseudo code):

try:
test = fsdecode(input) # convert to unicode
test.encode(sys encoding)
except:
# show a message box with something like "Invalid file name"

Please tell me if you find any other possible gotchas.

Thanks so far

Michael

Jul 18 '05 #5

Serge Orlov

klappnase wrote:

Hello all,

I am trying to internationaliz e my Tkinter program using gettext and
encountered various problems, so it looks like it's not a trivial
task.
Considered that you decided to support old python versions, it's true.
Unicode support has gradually improved. If you choose to target old
python version, basically you're dealing with years old unicode
support.
After some "research" I made up a few rules for a concept that I hope
lets me avoid further encoding trouble, but I would feel more
confident if some of the experts here would have a look at the
thoughts I made so far and told me if I'm still going wrong somewhere
(BTW, the program is supposed to run on linux only). So here is what
I have so far:

1. use unicode instead of byte strings wherever possible. This can be
a little tricky, because in some situations I cannot know in advance
if a certain string is unicode or byte string; I wrote a helper
module for this which defines convenience methods for fail-safe
decoding/encoding of strings and a Tkinter.Unicode Var class which I
use to convert user input to unicode on the fly (see the code below).
I've never used tkinter, but I heard good things about it. Are you
sure it's not you who made it to return byte string sometimes?
Anyway, your idea is right, make IO libraries always return unicode.
3. make sure to NEVER mix unicode and byte strings within one
expression
As a rule of thumb you should convert byte strings into unicode
strings at input and back to byte strings at output. This way
the core of your program will have to deal only with unicode
strings.
4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.
I don't think so. Either you need decode/encode or you don't.
5. file operations seem to be delicate;
You should be ready to handle unicode errors at file operations as
well as for example ENAMETOOLONG error. Any file system with path
argument can throw it, I don't think anything changed here with
introduction of unicode. For example access can return 11 (on
my linux system) error codes, consider unicode error to be twelveth.
at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;
I think python 2.3 handles that for you. (I'm not sure about the
version)
If you have to support older versions, you have to do it yourself.

6. messages that are printed to stdout should be encoded first, too;
the same with strings I use to call external shell commands.

If you use stdout as dump device just install the encoder in the
beginning of your program, something like

sys.stdout = codecs.getwrite r(...) ...
sys.stderr = codecs.getwrite r(...) ...
Serge.

Jul 18 '05 #6

klappnase

"Serge Orlov" <Se*********@gm ail.com> wrote in message news:<11******* **************@ o13g2000cwo.goo glegroups.com>. ..

I've never used tkinter, but I heard good things about it. Are you
sure it's not you who made it to return byte string sometimes?
Yes, I used a Tkinter.StringV ar to keep track of the contents of an
Entry widget; as long as I entered only ascii characters get() returns
a byte string, as soon as a special character is entered it returns
unicode.
Anyway, my UnicodeVar() class seems to be a handy way to avoid
problems here.

4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.

I don't think so. Either you need decode/encode or you don't.

I use a bunch of modules that contain helper functions for frequently
repeated tasks. So it sometimes happens for example that I call one of
my module functions to convert user input into unicode and then call
the next module function to convert it back to byte string to start
some file operation; that's what I meant with "excess decode/encode
cycles". However, trying to avoid these ended in totally messing up
the code.

5. file operations seem to be delicate;

You should be ready to handle unicode errors at file operations as
well as for example ENAMETOOLONG error. Any file system with path
argument can throw it, I don't think anything changed here with
introduction of unicode. For example access can return 11 (on
my linux system) error codes, consider unicode error to be twelveth.
at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;

I think python 2.3 handles that for you. (I'm not sure about the
version)
If you have to support older versions, you have to do it yourself.

I am using python-2.3.4 and get unicode errors:

f = os.path.join(u'/home/pingu/phonoripper', u'\xc3\u20ac')
os.path.isfile( f) True os.access(f, os.R_OK) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode characters in position
24-25: ordinal not in range(128) f = f.encode('iso-8859-15')
os.access(f, os.R_OK) True

Thanks for the feedback

Michael

Jul 18 '05 #7

Martin v. Löwis

klappnase wrote:

I am using python-2.3.4 and get unicode errors:

f = os.path.join(u'/home/pingu/phonoripper', u'\xc3\u20ac')
os.path.isf ile(f)
True
os.access(f , os.R_OK)

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode characters in position
24-25: ordinal not in range(128)

That's apparently a bug in os.access, which doesn't support Unicode file
names. As a work around, do

def access(name, mode, orig=os.access) :
try:
return orig(name, mode)
except UnicodeError:
return orig(name.encod e(sys.getfilesy stemencoding(), mode))
os.access=acces s

Apparently, access is used so rarely that nobody has noticed yet (or
didn't bother to report). os.path.isfile( ) builds on os.stat(), which
does support Unicode file names.

Regards,
Martin

Jul 18 '05 #8

klappnase

"Martin v. Löwis" <ma****@v.loewi s.de> wrote in message news:<42******* *************** @news.freenet.d e>...

That's apparently a bug in os.access, which doesn't support Unicode file
names. As a work around, do

def access(name, mode, orig=os.access) :
try:
return orig(name, mode)
except UnicodeError:
return orig(name.encod e(sys.getfilesy stemencoding(), mode))
os.access=acces s

Apparently, access is used so rarely that nobody has noticed yet (or
didn't bother to report). os.path.isfile( ) builds on os.stat(), which
does support Unicode file names.

Regards,
Martin

Ah, thanks!

Now another question arises: you use sys.getfilesyst emencoding() to
encode the
file name, which looks like it's the preferred method. However when I
tried to
find out how this works I got a little confused again (from the
library reference):

getfilesystemen coding()

Return the name of the encoding used to convert Unicode filenames into
system file names, or None if the system default encoding is used. The
result value depends on the operating system:
(...)
* On Unix, the encoding is the user's preference according to the
result of nl_langinfo(COD ESET), or None if the nl_langinfo(COD ESET)
failed.

On my box (mandrake-10.1) sys.getfilesyst emencoding() returns
'ISO-8859-15',
however :

locale.nl_langi nfo(locale.CODE SET) 'ANSI_X3.4-1968'

Anyway, my app currently runs with python-2.2 and I would like to keep
it that way if possible, so I wonder which is the preferred
replacement for sys.getfilesyst emencoding() on versions < 2.3 , or in
particular, will the method I use to determine "sysencodin g" I
described in my original post be safe or are there any traps I missed
(it's supposed to run on linux only)?

Thanks and best regards

Michael

Jul 18 '05 #9

stewart.midwinter

Michael:

on my box, (winXP SP2), sys.getfilesyst emencoding() returns 'mbcs'.

If you post your revised solution to this unicode problem, I'd be
delighted to test it on Windows. I'm working on a Tkinter front-end
for Vivian deSmedt's rsync.py and would like to address the issue of
accented characters in folder names.

thanks
Stewart
stewart dot midwinter at gmail dot com

Jul 18 '05 #10

Similar topics

2028

Where is the PEAR::I18N DB??

by: ProgDario | last post by:

HI, I downloaded and installed the I18N pear package, but the link on the doc referring to the DB is broken. Where can I find the I18N DB? Without it I can't make it work! Thanks in advance. :)ario

PHP

1946

wxPython i18n question

by: Logan | last post by:

Is it possible to tell the wxPython widgets (e.g. file dialogs) to use another language (instead of English)? Thanks in advance for any hints! -- mailto: logan@phreaker(NoSpam).net

Python

2964

i18n'ed Character Set in DBMS and tables

by: Albretch | last post by:

.. Can you define the Character Set for particular tables instead of databases? . Which DBMSs would let you do that? . How do you store in a DBMS i18n'ed users' from input, coming over the web (basically from everywhere) store it and properly serve it back to users, . . .? . Can you point me to info on this? I would preferably use...

MySQL Database

8633

I18n for JavaScript

by: Guido Wesdorp | last post by:

Hi! I've just released a JavaScript library to allow internationalizing JavaScript code and/or to do HTML translation from JavaScript. It's a first release, and it doesn't have all the features I'm interested in (e.g. it doesn't support domains, although I don't think that's much of a problem in most JavaScript applications, and it uses a...

Javascript

1248

The right way to do i18n

by: Laszlo Zsolt Nagy | last post by:

Hello, I wonder if there is a standard for making i18n in Python projects. I have several Python projects that are internationalized. I also have Python packages with i18n. But it is still not clean to me what is the recommended way to do it. Currently, I use a module called 'localization.py' with this code: from i18n_domain import...

Python

1622

i18n via XSLT import?

by: Darren Davison | last post by:

Hi, I have a documentation tool based on Java and XSLT that I want to add i18n capability to. There are around 8 stylesheets that process a Source generated by the Java code and some of the static labels across the stylesheets are the same. Ideally I'd like to import a set of variables into each template, and preferably based on an XSLT...

.NET Framework

1668

MS WEFT and i18n

by: Alan J. Flavell | last post by:

OK, I guess I'm about ready to expose this page for public discussion: http://ppewww.ph.gla.ac.uk/~flavell/charset/i18n-weft.html Please concentrate on the content. I'm well aware that my old stylesheet is in need of modernisation, but this isn't the moment to get sidetracked by that. If anyone is previewing IE7 (which I am not), they...

HTML / CSS

1090

Your message to I18n awaits moderator approval

by: i18n-bounces | last post by:

Your mail to 'I18n' with the subject Mail Delivery (failure i18n@mova.org) Is being held until the list moderator can review it for approval. The reason it is being held: Post by non-member to a members-only list

Python

1615

i18n hell

by: fyleow | last post by:

I just spent hours trying to figure out why even after I set my SQL table attributes to UTF-8 only garbage kept adding into the database. Apparently you need to execute "SET NAMES 'utf8'" before inserting into the tables. Does anyone have experience working with other languages using Django or Turbogears? I just need to be able to retrieve...

Python

961

i18n a Python app

by: Donn Ingle | last post by:

Hi, I have been going spare looking for a tutorial or howto from my pov as a total beginner to i18n. I understand that one must use gettext, but there seems to be no good info about *how* one uses it. What command line utilities does one use to: 1. make a .pot file 2. make a .mo file Are there specific Python aspects to the above, or is it...

Python

7675

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...

C / C++

7928

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7775

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...

General

5997

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

5344

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

3470

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...

Networking - Hardware / Configuration

1902

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1030

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

726

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

General