Unicode support in python

sonald

Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal

Oct 20 '06 #1

Subscribe Reply

2260

Fredrik Lundh

"sonald" wrote:

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Python has built-in Unicode support (which you would probably have noticed
if you'd looked "Unicode" up in the documentation index). for a list of tutorials
and other documentation, see

http://www.google.com/search?q=python+unicode

</F>

Oct 20 '06 #2

Fredrik Lundh

http://www.google.com/search?q=python+unicode

(and before anyone starts screaming about how they hate RTFM replies, look
at the search result)

</F>

Oct 20 '06 #3

sonald

Fredrik Lundh wrote:

http://www.google.com/search?q=python+unicode

(and before anyone starts screaming about how they hate RTFM replies, look
at the search result)

</F>

Thanks!! but i have already tried this...
and let me tell you what i am trying now...

I have added the following line in the script

# -*- coding: utf-8 -*-

I have also modified the site.py in ./Python24/Lib as
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init ()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefau ltlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError. ..
sys.setdefaulte ncoding(encodin g) # Needs Python Unicode build !

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

some junk character (box like) is added as the first character
what must be the reason for this?
and how do I handle it?

Oct 20 '06 #4

Diez B. Roggisch

sonald schrieb:

Fredrik Lundh wrote:

>> http://www.google.com/search?q=python+unicode
(and before anyone starts screaming about how they hate RTFM replies, look
at the search result)

</F>
Thanks!! but i have already tried this...

Tried - might be. But you certainly didn't understand it. So I suggest
that you read it again.

and let me tell you what i am trying now...

I have added the following line in the script

# -*- coding: utf-8 -*-

This will _only_ affect unicode literals inside the script itself -
nothing else! No files read, no files written, and additionally the path
of sun, earth and moon are unaffected as well - just in case you wondered.

This is an example of what is affected now:
--------
# -*- coding: utf-8 -*-
# this string is a byte string. it is created as such,
# regardless of the above encoding. instead, only
# what is in the bytes of the file itself is taken into account
some_string = "büchsenböl ler"

# this is a unicode literal (note the leading u).
# it will be _decoded_ using the above
# mentioned encoding. So make sure, your file is written in the
# proper encoding
some_unicode_ob ject = u"büchsenböller "
---------

I have also modified the site.py in ./Python24/Lib as
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init ()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefau ltlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError. ..
sys.setdefaulte ncoding(encodin g) # Needs Python Unicode build !

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

some junk character (box like) is added as the first character
what must be the reason for this?
and how do I handle it?

You shouldn't tamper with the site-wide encoding, as this will mask
errors you made in the best case, let alone not producing new ones.

And what do you think it would help you anyway? Pythons unicode support
would be stupid to say the least if it required the installation changed
before dealing with files of different encodings - don't you think?

As you don't show us the code you actually use to read that file, I'm
down to guessing here, but if you just open it as binary file with

content = open("test.txt" ).read()

there won't be any magic decoding happening.

What you need to do instead is this (if you happen to know that test.txt
is encoded in utf-8):

content = open("test.txt" ).read().decode ("utf-8")
Then you have a unicode object. Now if you need that to be written to a
terminal (or wherever your "boxes" appear - guessing here too, no code,
you remember?), you need to make sure that

- you know the terminals encoding

- you properly endcode the unicode content to that encoding before
printing, as otherwise the default-encoding will be used
So, in case your terminal uses utf-8, you do

print content.encode( "utf-8")
Diez

Oct 20 '06 #5

Fredrik Lundh

"sonald" wrote:

I have added the following line in the script

# -*- coding: utf-8 -*-

that's good.

I have also modified the site.py

that's bad, because this means that your code won't work on standard
Python installations.

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

what does the word "validate" mean here?

some junk character (box like) is added as the first character
what must be the reason for this?

what did you do to determine that there's a box-like character at the start
of the file?

can you post some code?

</F>

Oct 20 '06 #6

John Roth

sonald wrote:

Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal

As both of the other responders have said, the
coding comment at the front only affects source
text; it has absolutely no effect at run time. In
particular, it's not even necessary to use it to
handle non-English languages as long as you
don't want to write literals in those languages.

What seems to be missing is the notion that
external files are _always_ byte files, and have to
be _explicitly_ decoded into unicode strings,
and then encoded back to whatever the external
encoding needs to be, each and every time you
read or write a file, or copy string data from
byte strings to unicode strings and back.
There is no good way of handling this implicitly:
you can't simply say "utf-8" or "iso-8859-whatever"
in one place and expect it to work.

You've got to specify the encoding on each and
every open, or else use the encode and decode
string methods. This is a great motivation for
eliminating duplication and centralizing your
code!

For your other question: the general words
are localization and locale. Look up locale in
the index. It's a strange subject which I don't
know much about, but that should get you
started.

John Roth

Oct 20 '06 #7

sonald

Fredrik Lundh wrote:

>
what does the word "validate" mean here?

Let me explain our module.
We receive text files (with comma separated values, as per some
predefined format) from a third party.
for example account file comes as "abc.acc" {.acc is the extension for
account file as per our code}
it must contain account_code, account_descrip tion, account_balance in
the same order.

So, from the text file("abc.acc") we receive for 2 or more records,
will look like
A001, test account1, 100000
A002, test account2, 500000

We may have multiple .acc files

Our job is to validate the incoming data on the basis of its datatype,
field number, etc and copy all the error free records in acc.txt

for this, we use a schema as follows
----------------------------------------------------------------------------------------------------------
if account_flg == 1:
start = time()

# the input fields
acct_schema = {
0: Text('AccountCo de', 50),
1: Text('AccountDe scription', 100),
2: Text('AccountBa lance', 50)
}

validate( schema = acct_schema,
primary_keys = [acct_pk],
infile = '../data/ACC/*.acc',
outfile = '../data/acc.txt',
update_freq = 10000)
----------------------------------------------------------------------------------------------------------
In a core.py, we have defined a function validate, which checks for the
datatypes & other validations.
All the erroneous records are copied in a error log file, and the
correct records are copied to a clean acc.text file

The validate function is as given below...
---------------------------------------------------------------------------------------------------------------------------
def validate(infile , outfile, schema, primary_keys=[], foreign_keys=[],
record_checks=[], buffer_size=0, update_freq=0):

show("intitaliz ing ... ")

# find matching input files
all_files = glob.glob(infil e)
if not all_files:
raise ValueError('No input files were found.')

# initialize data structures
freq = update_freq or DEFAULT_UPDATE
input = fileinput.FileI nput(all_files, bufsize = buffer_size
or DEFAULT_BUFFER)
output = open(outfile, 'wb+')
logs = {}
for name in all_files:
logs[name] = open(name + DEFAULT_SUFFIX, 'wb+')
#logs[name] = open(name + DEFAULT_SUFFIX, 'a+')

errors = []
num_fields = len(schema)
pk_length = range(len(prima ry_keys))
fk_length = range(len(forei gn_keys))
rc_length = range(len(recor d_checks))

# initialize the PKs and FKs with the given schema
for idx in primary_keys:
idx.setup(schem a)
for idx in foreign_keys:
idx.setup(schem a)

# start processing: collect all lines which have errors
for line in input:
rec_num = input.lineno()
if rec_num % freq == 0:
show("processed %d records ... " % (rec_num))
for idx in primary_keys:
idx.flush()
for idx in foreign_keys:
idx.flush()

if BLANK_LINE.matc h(line):
continue

try:
data = csv.parse(line)

# check number of fields
if len(data) != num_fields:
errors.append( (rec_num, LINE_ERROR, 'incorrect number
of fields') )
continue

# check for well-formed fields
fields_ok = True
for i in range(num_field s):
if not schema[i].validate(data[i]):
errors.append( (rec_num, FIELD_ERROR, i) )
fields_ok = False
break

# check the PKs
for i in pk_length:
if fields_ok and not primary_keys[i].valid(rec_num,
data):
errors.append( (rec_num, PK_ERROR, i) )
break

# check the FKs
for i in fk_length:
if fields_ok and not foreign_keys[i].valid(rec_num,
data):
#print 'here ---%s, rec_num : %d'%(data,rec_n um)
errors.append( (rec_num, FK_ERROR, i) )
break

# perform record-level checks
for i in rc_length:
if fields_ok and not record_checks[i](schema, data):
errors.append( (rec_num, REC_ERROR, i) )
break

except fastcsv.Error, err:
errors.append( (rec_num, LINE_ERROR, err.__str__()) )

# finalize the indexes to check for any more errors
for i in pk_length:
error_list = primary_keys[i].finalize()
primary_keys[i].save()
if error_list:
errors.extend( [ (rec_num, PK_ERROR, i) for rec_num in
error_list ] )

for i in fk_length:
error_list = foreign_keys[i].finalize()
if error_list:
errors.extend( [ (rec_num, FK_ERROR, i) for rec_num in
error_list ] )
# sort the list of errors by the cumulative line number
errors.sort( lambda l, r: cmp(l[0], r[0]) )

show("saving output ... ")

# reopen input and sort it into either the output file or error log
file
input = fileinput.FileI nput(all_files, bufsize = buffer_size
or DEFAULT_BUFFER)
error_list = iter(errors)
count = input.lineno
filename = input.filename
line_no = input.filelinen o
try:
line_num, reason, i = error_list.next ()
except StopIteration:
line_num = -1
for line in input:
line = line + '\r\n'
#print '%d,%d'%(line_n um,count())
if line_num == count():

if reason == FIELD_ERROR:
logs[filename()].write(ERROR_FO RMAT % (line_no(),
INVALID_FIELD % (schema[i].name), line))
elif reason == LINE_ERROR:
logs[filename()].write(ERROR_FO RMAT % (line_no(), i,
line))
elif reason == PK_ERROR:
logs[filename()].write(ERROR_FO RMAT % (line_no(),
INVALID_PK % (primary_keys[i].name), line))
elif reason == FK_ERROR:
#print 'Test FK %s, rec_num : %d, line :
%s'%(foreign_ke ys[i].name,line_no() ,line)
logs[filename()].write(ERROR_FO RMAT % (line_no(),
INVALID_FK % (foreign_keys[i].name), line))
elif reason == REC_ERROR:
logs[filename()].write(ERROR_FO RMAT % (line_no(),
INVALID_REC % (record_checks[i].__doc__), line))
else:
raise RuntimeError("s houldn't reach here")

try:
#print 'CURRENT ITERATION, line_num : %d, line :
%s'%(line_num,l ine)
line_num1 = line_num
line_num, reason, i = error_list.next ()
if line_num1 == line_num :
line_num, reason, i = error_list.next ()

#print 'FOR NEXT ITERATION, line_num : %d, line :
%s'%(line_num,l ine)

except StopIteration:
line_num = -1
continue

if not BLANK_LINE.matc h(line):
output.write(li ne)

output.close()
for f in logs.values():
f.close()
-----------------------------------------------------------------------------------------------------------------------------

now when I open the error log file, it contains the error message for
each erroneous record, along with the original record copied from the
*.acc file.
Now this record is preceeded with a box like character.

Do you want me to post the complete code , just incase...
It might help... you might then understand my problem well..
plz let me know soon

Oct 25 '06 #8

sonald

HI
Can u please tell me if there is any package or class that I can import
for internationaliz ation, or unicode support?

This module is just a small part of our application, and we are not
really supposed to alter the code.
We do not have nobody here to help us with python here. and are
supposed to just try and understand the program. Today I am in a
position, that I can fix the bugs arising from the code, but cannot
really try something like internationaliz ation on my own. Can u help?
Do you want me to post the complete code for your reference?
plz lemme know asap.
John Roth wrote:

sonald wrote:
Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal

As both of the other responders have said, the
coding comment at the front only affects source
text; it has absolutely no effect at run time. In
particular, it's not even necessary to use it to
handle non-English languages as long as you
don't want to write literals in those languages.

What seems to be missing is the notion that
external files are _always_ byte files, and have to
be _explicitly_ decoded into unicode strings,
and then encoded back to whatever the external
encoding needs to be, each and every time you
read or write a file, or copy string data from
byte strings to unicode strings and back.
There is no good way of handling this implicitly:
you can't simply say "utf-8" or "iso-8859-whatever"
in one place and expect it to work.

You've got to specify the encoding on each and
every open, or else use the encode and decode
string methods. This is a great motivation for
eliminating duplication and centralizing your
code!

For your other question: the general words
are localization and locale. Look up locale in
the index. It's a strange subject which I don't
know much about, but that should get you
started.

John Roth

Oct 25 '06 #9

Similar topics

7087

Windows XP - Environment variable - Unicode

by: sebastien.hugues | last post by:

Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name: sébastien. The second character is not an ascii one and when i try to encode the path that contains this name in utf-8,

Python

25948

convert Unicode to lower/uppercase?

by: Hallvard B Furuseth | last post by:

Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner, so I assume it's simplest to convert to lower/upper first. Possibly all strings will be from the latin-1 character set, so I could convert to 8-bit latin-1, map to lowercase, and convert back, but that seems rather cumbersome.

Python

11888

Unicode and Zipfile problems

by: Gerson Kurz | last post by:

AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it to, but because its for example read from _winreg which returns unicode. You do an os.listdir(directory). Note that all filenames returned are now unicode. (Change introduced I believe in 2.3).

Python

2225

Shrinky-dink Python (also, non-Unicode Python build is broken)

by: Larry Hastings | last post by:

I'm an indie shareware Windows game developer. In indie shareware game development, download size is terribly important; conventional wisdom holds that--even today--your download should be 5MB or less. I'd like to use Python in my games. However, python24.dll is 1.86MB, and zips down to 877k. I can't afford to devote 1/6 of my download to just the scripting interpreter; I've got music, and textures, and my own crappy code to ship. ...

Python

2509

A question about unicode() function

by: JTree | last post by:

Hi,all I encountered a problem when using unicode() function to fetch a webpage, I don't know why this happenned. My codes and error messages are: Code: #!/usr/bin/python #Filename: test.py #Modified: 2006-12-31

Python

4533

Python's handling of unicode surrogates

by: Adam Olsen | last post by:

As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own functions do this on occasion. This leads to different behaviour across platforms and makes it unnecessarily difficult to properly support all languages. To solve this I propose Python's unicode type using UTF-16 should have gaps in its index,...

Python

4040

Unexpected exception from socket.getaddrinfo on Unicode URL

by: John Nagle | last post by:

Here's a strange little bug. "socket.getaddrinfo" blows up if given a bad domain name containing ".." in Unicode. The same string in ASCII produces the correct "gaierror" exception. Actually, this deserves a documentation mention. The "socket" module, given a Unicode string, calls the International Domain Name parser, "idna.py", which has a a whole error system of its own. The IDNA documentation says that "Furthermore, the socket...

Python

3387

LANG, locale, unicode, setup.py and Debian packaging

by: Donn Ingle | last post by:

Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything without 'utf8' in it, then things start to go downhill: 2a. The app assumes unicode objects internally. i.e. Whenever there is

Python

5776

Python beginner, unicode encode/decode Q

by: anonymous | last post by:

1 Objective to write little programs to help me learn German. See code after numbered comments. //Thanks in advance for any direction or suggestions. tk 2 Want keyboard answer input, for example: answer_str = raw_input(' Enter answer ') Herr Üü

Python

10211

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10045

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9994

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9863

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8872

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5299

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5447

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3562

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2815

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General