473,545 Members | 1,947 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

i18n: looking for expertise

Hello all,

I am trying to internationaliz e my Tkinter program using gettext and
encountered various problems, so it looks like it's not a trivial
task.
After some "research" I made up a few rules for a concept that I hope
lets me avoid further encoding trouble, but I would feel more
confident if some of the experts here would have a look at the
thoughts I made so far and told me if I'm still going wrong somewhere
(BTW, the program is supposed to run on linux only). So here is what I
have so far:

1. use unicode instead of byte strings wherever possible. This can be
a little tricky, because in some situations I cannot know in advance
if a certain string is unicode or byte string; I wrote a helper module
for this which defines convenience methods for fail-safe
decoding/encoding of strings and a Tkinter.Unicode Var class which I
use to convert user input to unicode on the fly (see the code below).

2. so I will have to call gettext.install () with unicode=1

3. make sure to NEVER mix unicode and byte strings within one
expression

4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.

5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before; The filename manipulations by the os.path
methods seem to be simply string manipulations so encoding the
filenames doesn't seem to be necessary.

6. messages that are printed to stdout should be encoded first, too;
the same with strings I use to call external shell commands.

############ file UnicodeHandler. py ############### ############### ####
# -*- coding: iso-8859-1 -*-
import Tkinter
import sys
import locale
import codecs

def _find_codec(enc oding):
# return True if the requested codec is available, else return
False
try:
codecs.lookup(e ncoding)
return 1
except LookupError:
print 'Warning: codec %s not found' % encoding
return 0

def _sysencoding():
# try to guess the system default encoding
try:
enc = locale.getprefe rredencoding(). lower()
if _find_codec(enc ):
print 'Setting locale to %s' % enc
return enc
except AttributeError:
# our python is too old, try something else
pass
enc = locale.getdefau ltlocale()[1].lower()
if _find_codec(enc ):
print 'Setting locale to %s' % enc
return enc
# the last try
enc = sys.stdin.encod ing.lower()
if _find_codec(enc ):
print 'Setting locale to %s' % enc
return enc
# aargh, nothing good found, fall back to latin1 and hope for the
best
print 'Warning: cannot find usable locale, using latin-1'
return 'iso-8859-1'

sysencoding = _sysencoding()

def fsdecode(input, errors='strict' ):
'''Fail-safe decodes a string into unicode.'''
if not isinstance(inpu t, unicode):
return unicode(input, sysencoding, errors)
return input

def fsencode(input, errors='strict' ):
'''Fail-safe encodes a unicode string into system default
encoding.'''
if isinstance(inpu t, unicode):
return input.encode(sy sencoding, errors)
return input
class UnicodeVar(Tkin ter.StringVar):
def __init__(self, master=None, errors='strict' ):
Tkinter.StringV ar.__init__(sel f, master)
self.errors = errors
self.trace('w', self._str2unico de)

def _str2unicode(se lf, *args):
old = self.get()
if not isinstance(old, unicode):
new = fsdecode(old, self.errors)
self.set(new)
############### ############### ############### ############### ###########

So before I start to mess up all of my code, maybe someone can give me
a hint if I still forgot something I should keep in mind or if I am
completely wrong somewhere.

Thanks in advance

Michael
Jul 18 '05 #1
15 1546
Michael:
5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;


This can lead to failure on Windows when the true Unicode file name can
not be encoded in the current system encoding.

Neil
Jul 18 '05 #2
"Neil Hodgson" <nh******@bigpo nd.net.au> wrote in message news:<6O******* *************@n ews-server.bigpond. net.au>...
Michael:
5. file operations seem to be delicate; at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;


This can lead to failure on Windows when the true Unicode file name can
not be encoded in the current system encoding.

Neil


Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:

1. already existing files

2. automatically generated filenames, which result from adding an
ascii-only suffix to an existing filename (like xy --> xy_bak2)

3. filenames created by user input

?
If yes, how to avoid these?

Any hints are appreciated

Michael
Jul 18 '05 #3
Michael:
Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:
...
3. filenames created by user input


Have you worked out how you want to handle user input that is not
representable in the encoding? It is easy for users to input any characters
into a Unicode enabled UI either through invoking an input method or by
copying and pasting from another application or character chooser applet.

Neil
Jul 18 '05 #4
"Neil Hodgson" <nh******@bigpo nd.net.au> wrote in message news:<Pq******* **********@news-server.bigpond. net.au>...
Michael:
Like I said, it's only supposed to run on linux; anyway, is it likely
that problems will arise when filenames I have to handle have
basically three sources:
...
3. filenames created by user input


Have you worked out how you want to handle user input that is not
representable in the encoding? It is easy for users to input any characters
into a Unicode enabled UI either through invoking an input method or by
copying and pasting from another application or character chooser applet.

Neil


As I must admit, no. I just couldn't figure out that someone will really do this.

I guess I could add a test like (pseudo code):

try:
test = fsdecode(input) # convert to unicode
test.encode(sys encoding)
except:
# show a message box with something like "Invalid file name"

Please tell me if you find any other possible gotchas.

Thanks so far

Michael
Jul 18 '05 #5
klappnase wrote:
Hello all,

I am trying to internationaliz e my Tkinter program using gettext and
encountered various problems, so it looks like it's not a trivial
task.
Considered that you decided to support old python versions, it's true.
Unicode support has gradually improved. If you choose to target old
python version, basically you're dealing with years old unicode
support.
After some "research" I made up a few rules for a concept that I hope
lets me avoid further encoding trouble, but I would feel more
confident if some of the experts here would have a look at the
thoughts I made so far and told me if I'm still going wrong somewhere
(BTW, the program is supposed to run on linux only). So here is what
I have so far:

1. use unicode instead of byte strings wherever possible. This can be
a little tricky, because in some situations I cannot know in advance
if a certain string is unicode or byte string; I wrote a helper
module for this which defines convenience methods for fail-safe
decoding/encoding of strings and a Tkinter.Unicode Var class which I
use to convert user input to unicode on the fly (see the code below).
I've never used tkinter, but I heard good things about it. Are you
sure it's not you who made it to return byte string sometimes?
Anyway, your idea is right, make IO libraries always return unicode.
3. make sure to NEVER mix unicode and byte strings within one
expression
As a rule of thumb you should convert byte strings into unicode
strings at input and back to byte strings at output. This way
the core of your program will have to deal only with unicode
strings.
4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.
I don't think so. Either you need decode/encode or you don't.
5. file operations seem to be delicate;
You should be ready to handle unicode errors at file operations as
well as for example ENAMETOOLONG error. Any file system with path
argument can throw it, I don't think anything changed here with
introduction of unicode. For example access can return 11 (on
my linux system) error codes, consider unicode error to be twelveth.
at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;
I think python 2.3 handles that for you. (I'm not sure about the
version)
If you have to support older versions, you have to do it yourself.

6. messages that are printed to stdout should be encoded first, too;
the same with strings I use to call external shell commands.


If you use stdout as dump device just install the encoder in the
beginning of your program, something like

sys.stdout = codecs.getwrite r(...) ...
sys.stderr = codecs.getwrite r(...) ...
Serge.

Jul 18 '05 #6
"Serge Orlov" <Se*********@gm ail.com> wrote in message news:<11******* **************@ o13g2000cwo.goo glegroups.com>. ..

I've never used tkinter, but I heard good things about it. Are you
sure it's not you who made it to return byte string sometimes?
Yes, I used a Tkinter.StringV ar to keep track of the contents of an
Entry widget; as long as I entered only ascii characters get() returns
a byte string, as soon as a special character is entered it returns
unicode.
Anyway, my UnicodeVar() class seems to be a handy way to avoid
problems here.
4. in order to maintain code readability it's better to risk excess
decode/encode cycles than having one too few.


I don't think so. Either you need decode/encode or you don't.


I use a bunch of modules that contain helper functions for frequently
repeated tasks. So it sometimes happens for example that I call one of
my module functions to convert user input into unicode and then call
the next module function to convert it back to byte string to start
some file operation; that's what I meant with "excess decode/encode
cycles". However, trying to avoid these ended in totally messing up
the code.
5. file operations seem to be delicate;


You should be ready to handle unicode errors at file operations as
well as for example ENAMETOOLONG error. Any file system with path
argument can throw it, I don't think anything changed here with
introduction of unicode. For example access can return 11 (on
my linux system) error codes, consider unicode error to be twelveth.
at least I got an error when I
passed a filename that contains special characters as unicode to
os.access(), so I guess that whenever I do file operations
(os.remove(), shutil.copy() ...) the filename should be encoded back
into system encoding before;


I think python 2.3 handles that for you. (I'm not sure about the
version)
If you have to support older versions, you have to do it yourself.


I am using python-2.3.4 and get unicode errors:
f = os.path.join(u'/home/pingu/phonoripper', u'\xc3\u20ac')
os.path.isfile( f) True os.access(f, os.R_OK) Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode characters in position
24-25: ordinal not in range(128) f = f.encode('iso-8859-15')
os.access(f, os.R_OK) True


Thanks for the feedback

Michael
Jul 18 '05 #7
klappnase wrote:
I am using python-2.3.4 and get unicode errors:

f = os.path.join(u'/home/pingu/phonoripper', u'\xc3\u20ac')
os.path.isf ile(f)
True
os.access(f , os.R_OK)


Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeEr ror: 'ascii' codec can't encode characters in position
24-25: ordinal not in range(128)


That's apparently a bug in os.access, which doesn't support Unicode file
names. As a work around, do

def access(name, mode, orig=os.access) :
try:
return orig(name, mode)
except UnicodeError:
return orig(name.encod e(sys.getfilesy stemencoding(), mode))
os.access=acces s

Apparently, access is used so rarely that nobody has noticed yet (or
didn't bother to report). os.path.isfile( ) builds on os.stat(), which
does support Unicode file names.

Regards,
Martin
Jul 18 '05 #8
"Martin v. Löwis" <ma****@v.loewi s.de> wrote in message news:<42******* *************** @news.freenet.d e>...
That's apparently a bug in os.access, which doesn't support Unicode file
names. As a work around, do

def access(name, mode, orig=os.access) :
try:
return orig(name, mode)
except UnicodeError:
return orig(name.encod e(sys.getfilesy stemencoding(), mode))
os.access=acces s

Apparently, access is used so rarely that nobody has noticed yet (or
didn't bother to report). os.path.isfile( ) builds on os.stat(), which
does support Unicode file names.

Regards,
Martin


Ah, thanks!

Now another question arises: you use sys.getfilesyst emencoding() to
encode the
file name, which looks like it's the preferred method. However when I
tried to
find out how this works I got a little confused again (from the
library reference):

getfilesystemen coding()

Return the name of the encoding used to convert Unicode filenames into
system file names, or None if the system default encoding is used. The
result value depends on the operating system:
(...)
* On Unix, the encoding is the user's preference according to the
result of nl_langinfo(COD ESET), or None if the nl_langinfo(COD ESET)
failed.

On my box (mandrake-10.1) sys.getfilesyst emencoding() returns
'ISO-8859-15',
however :
locale.nl_langi nfo(locale.CODE SET) 'ANSI_X3.4-1968'


Anyway, my app currently runs with python-2.2 and I would like to keep
it that way if possible, so I wonder which is the preferred
replacement for sys.getfilesyst emencoding() on versions < 2.3 , or in
particular, will the method I use to determine "sysencodin g" I
described in my original post be safe or are there any traps I missed
(it's supposed to run on linux only)?

Thanks and best regards

Michael
Jul 18 '05 #9
Michael:

on my box, (winXP SP2), sys.getfilesyst emencoding() returns 'mbcs'.

If you post your revised solution to this unicode problem, I'd be
delighted to test it on Windows. I'm working on a Tkinter front-end
for Vivian deSmedt's rsync.py and would like to address the issue of
accented characters in folder names.

thanks
Stewart
stewart dot midwinter at gmail dot com

Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
2028
by: ProgDario | last post by:
HI, I downloaded and installed the I18N pear package, but the link on the doc referring to the DB is broken. Where can I find the I18N DB? Without it I can't make it work! Thanks in advance. :)ario
4
1946
by: Logan | last post by:
Is it possible to tell the wxPython widgets (e.g. file dialogs) to use another language (instead of English)? Thanks in advance for any hints! -- mailto: logan@phreaker(NoSpam).net
10
2964
by: Albretch | last post by:
.. Can you define the Character Set for particular tables instead of databases? . Which DBMSs would let you do that? . How do you store in a DBMS i18n'ed users' from input, coming over the web (basically from everywhere) store it and properly serve it back to users, . . .? . Can you point me to info on this? I would preferably use...
13
8633
by: Guido Wesdorp | last post by:
Hi! I've just released a JavaScript library to allow internationalizing JavaScript code and/or to do HTML translation from JavaScript. It's a first release, and it doesn't have all the features I'm interested in (e.g. it doesn't support domains, although I don't think that's much of a problem in most JavaScript applications, and it uses a...
0
1248
by: Laszlo Zsolt Nagy | last post by:
Hello, I wonder if there is a standard for making i18n in Python projects. I have several Python projects that are internationalized. I also have Python packages with i18n. But it is still not clean to me what is the recommended way to do it. Currently, I use a module called 'localization.py' with this code: from i18n_domain import...
3
1622
by: Darren Davison | last post by:
Hi, I have a documentation tool based on Java and XSLT that I want to add i18n capability to. There are around 8 stylesheets that process a Source generated by the Java code and some of the static labels across the stylesheets are the same. Ideally I'd like to import a set of variables into each template, and preferably based on an XSLT...
8
1668
by: Alan J. Flavell | last post by:
OK, I guess I'm about ready to expose this page for public discussion: http://ppewww.ph.gla.ac.uk/~flavell/charset/i18n-weft.html Please concentrate on the content. I'm well aware that my old stylesheet is in need of modernisation, but this isn't the moment to get sidetracked by that. If anyone is previewing IE7 (which I am not), they...
0
1090
by: i18n-bounces | last post by:
Your mail to 'I18n' with the subject Mail Delivery (failure i18n@mova.org) Is being held until the list moderator can review it for approval. The reason it is being held: Post by non-member to a members-only list
3
1615
by: fyleow | last post by:
I just spent hours trying to figure out why even after I set my SQL table attributes to UTF-8 only garbage kept adding into the database. Apparently you need to execute "SET NAMES 'utf8'" before inserting into the tables. Does anyone have experience working with other languages using Django or Turbogears? I just need to be able to retrieve...
0
961
by: Donn Ingle | last post by:
Hi, I have been going spare looking for a tutorial or howto from my pov as a total beginner to i18n. I understand that one must use gettext, but there seems to be no good info about *how* one uses it. What command line utilities does one use to: 1. make a .pot file 2. make a .mo file Are there specific Python aspects to the above, or is it...
0
7675
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7928
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
0
7775
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5997
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5344
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3470
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in...
1
1902
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1030
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
726
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.