Unicode support in python

sonald

Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal

Oct 20 '06 #1

Subscribe Post Reply

2237

Fredrik Lundh

"sonald" wrote:

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Python has built-in Unicode support (which you would probably have noticed
if you'd looked "Unicode" up in the documentation index). for a list of tutorials
and other documentation, see

http://www.google.com/search?q=python+unicode

</F>

Oct 20 '06 #2

Fredrik Lundh

http://www.google.com/search?q=python+unicode

(and before anyone starts screaming about how they hate RTFM replies, look
at the search result)

</F>

Oct 20 '06 #3

sonald

Fredrik Lundh wrote:

http://www.google.com/search?q=python+unicode

(and before anyone starts screaming about how they hate RTFM replies, look
at the search result)

</F>

Thanks!! but i have already tried this...
and let me tell you what i am trying now...

I have added the following line in the script

# -*- coding: utf-8 -*-

I have also modified the site.py in ./Python24/Lib as
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

some junk character (box like) is added as the first character
what must be the reason for this?
and how do I handle it?

Oct 20 '06 #4

Diez B. Roggisch

sonald schrieb:

Fredrik Lundh wrote:

>> http://www.google.com/search?q=python+unicode
(and before anyone starts screaming about how they hate RTFM replies, look
at the search result)

</F>
Thanks!! but i have already tried this...

Tried - might be. But you certainly didn't understand it. So I suggest
that you read it again.

and let me tell you what i am trying now...

I have added the following line in the script

# -*- coding: utf-8 -*-

This will _only_ affect unicode literals inside the script itself -
nothing else! No files read, no files written, and additionally the path
of sun, earth and moon are unaffected as well - just in case you wondered.

This is an example of what is affected now:
--------
# -*- coding: utf-8 -*-
# this string is a byte string. it is created as such,
# regardless of the above encoding. instead, only
# what is in the bytes of the file itself is taken into account
some_string = "büchsenböller"

# this is a unicode literal (note the leading u).
# it will be _decoded_ using the above
# mentioned encoding. So make sure, your file is written in the
# proper encoding
some_unicode_object = u"büchsenböller"
---------

I have also modified the site.py in ./Python24/Lib as
def setencoding():
"""Set the string encoding used by the Unicode implementation. The
default is 'ascii', but if you're willing to experiment, you can
change this."""
encoding = "utf-8" # Default value set by _PyUnicode_Init()
if 0:
# Enable to support locale aware default string encodings.
import locale
loc = locale.getdefaultlocale()
if loc[1]:
encoding = loc[1]
if 0:
# Enable to switch off string to Unicode coercion and implicit
# Unicode to string conversion.
encoding = "undefined"
if encoding != "ascii":
# On Non-Unicode builds this will raise an AttributeError...
sys.setdefaultencoding(encoding) # Needs Python Unicode build !

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

some junk character (box like) is added as the first character
what must be the reason for this?
and how do I handle it?

You shouldn't tamper with the site-wide encoding, as this will mask
errors you made in the best case, let alone not producing new ones.

And what do you think it would help you anyway? Pythons unicode support
would be stupid to say the least if it required the installation changed
before dealing with files of different encodings - don't you think?

As you don't show us the code you actually use to read that file, I'm
down to guessing here, but if you just open it as binary file with

content = open("test.txt").read()

there won't be any magic decoding happening.

What you need to do instead is this (if you happen to know that test.txt
is encoded in utf-8):

content = open("test.txt").read().decode("utf-8")
Then you have a unicode object. Now if you need that to be written to a
terminal (or wherever your "boxes" appear - guessing here too, no code,
you remember?), you need to make sure that

- you know the terminals encoding

- you properly endcode the unicode content to that encoding before
printing, as otherwise the default-encoding will be used
So, in case your terminal uses utf-8, you do

print content.encode("utf-8")
Diez

Oct 20 '06 #5

Fredrik Lundh

"sonald" wrote:

I have added the following line in the script

# -*- coding: utf-8 -*-

that's good.

I have also modified the site.py

that's bad, because this means that your code won't work on standard
Python installations.

Now when I try to validate the data in the text file
say abc.txt (saved as with utf-8 encoding) containing either english or
russian text,

what does the word "validate" mean here?

some junk character (box like) is added as the first character
what must be the reason for this?

what did you do to determine that there's a box-like character at the start
of the file?

can you post some code?

</F>

Oct 20 '06 #6

John Roth

sonald wrote:

Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal

As both of the other responders have said, the
coding comment at the front only affects source
text; it has absolutely no effect at run time. In
particular, it's not even necessary to use it to
handle non-English languages as long as you
don't want to write literals in those languages.

What seems to be missing is the notion that
external files are _always_ byte files, and have to
be _explicitly_ decoded into unicode strings,
and then encoded back to whatever the external
encoding needs to be, each and every time you
read or write a file, or copy string data from
byte strings to unicode strings and back.
There is no good way of handling this implicitly:
you can't simply say "utf-8" or "iso-8859-whatever"
in one place and expect it to work.

You've got to specify the encoding on each and
every open, or else use the encode and decode
string methods. This is a great motivation for
eliminating duplication and centralizing your
code!

For your other question: the general words
are localization and locale. Look up locale in
the index. It's a strange subject which I don't
know much about, but that should get you
started.

John Roth

Oct 20 '06 #7

sonald

Fredrik Lundh wrote:

>
what does the word "validate" mean here?

Let me explain our module.
We receive text files (with comma separated values, as per some
predefined format) from a third party.
for example account file comes as "abc.acc" {.acc is the extension for
account file as per our code}
it must contain account_code, account_description, account_balance in
the same order.

So, from the text file("abc.acc") we receive for 2 or more records,
will look like
A001, test account1, 100000
A002, test account2, 500000

We may have multiple .acc files

Our job is to validate the incoming data on the basis of its datatype,
field number, etc and copy all the error free records in acc.txt

for this, we use a schema as follows
----------------------------------------------------------------------------------------------------------
if account_flg == 1:
start = time()

# the input fields
acct_schema = {
0: Text('AccountCode', 50),
1: Text('AccountDescription', 100),
2: Text('AccountBalance', 50)
}

validate( schema = acct_schema,
primary_keys = [acct_pk],
infile = '../data/ACC/*.acc',
outfile = '../data/acc.txt',
update_freq = 10000)
----------------------------------------------------------------------------------------------------------
In a core.py, we have defined a function validate, which checks for the
datatypes & other validations.
All the erroneous records are copied in a error log file, and the
correct records are copied to a clean acc.text file

The validate function is as given below...
---------------------------------------------------------------------------------------------------------------------------
def validate(infile, outfile, schema, primary_keys=[], foreign_keys=[],
record_checks=[], buffer_size=0, update_freq=0):

show("intitalizing ... ")

# find matching input files
all_files = glob.glob(infile)
if not all_files:
raise ValueError('No input files were found.')

# initialize data structures
freq = update_freq or DEFAULT_UPDATE
input = fileinput.FileInput(all_files, bufsize = buffer_size
or DEFAULT_BUFFER)
output = open(outfile, 'wb+')
logs = {}
for name in all_files:
logs[name] = open(name + DEFAULT_SUFFIX, 'wb+')
#logs[name] = open(name + DEFAULT_SUFFIX, 'a+')

errors = []
num_fields = len(schema)
pk_length = range(len(primary_keys))
fk_length = range(len(foreign_keys))
rc_length = range(len(record_checks))

# initialize the PKs and FKs with the given schema
for idx in primary_keys:
idx.setup(schema)
for idx in foreign_keys:
idx.setup(schema)

# start processing: collect all lines which have errors
for line in input:
rec_num = input.lineno()
if rec_num % freq == 0:
show("processed %d records ... " % (rec_num))
for idx in primary_keys:
idx.flush()
for idx in foreign_keys:
idx.flush()

if BLANK_LINE.match(line):
continue

try:
data = csv.parse(line)

# check number of fields
if len(data) != num_fields:
errors.append( (rec_num, LINE_ERROR, 'incorrect number
of fields') )
continue

# check for well-formed fields
fields_ok = True
for i in range(num_fields):
if not schema[i].validate(data[i]):
errors.append( (rec_num, FIELD_ERROR, i) )
fields_ok = False
break

# check the PKs
for i in pk_length:
if fields_ok and not primary_keys[i].valid(rec_num,
data):
errors.append( (rec_num, PK_ERROR, i) )
break

# check the FKs
for i in fk_length:
if fields_ok and not foreign_keys[i].valid(rec_num,
data):
#print 'here ---%s, rec_num : %d'%(data,rec_num)
errors.append( (rec_num, FK_ERROR, i) )
break

# perform record-level checks
for i in rc_length:
if fields_ok and not record_checks[i](schema, data):
errors.append( (rec_num, REC_ERROR, i) )
break

except fastcsv.Error, err:
errors.append( (rec_num, LINE_ERROR, err.__str__()) )

# finalize the indexes to check for any more errors
for i in pk_length:
error_list = primary_keys[i].finalize()
primary_keys[i].save()
if error_list:
errors.extend( [ (rec_num, PK_ERROR, i) for rec_num in
error_list ] )

for i in fk_length:
error_list = foreign_keys[i].finalize()
if error_list:
errors.extend( [ (rec_num, FK_ERROR, i) for rec_num in
error_list ] )
# sort the list of errors by the cumulative line number
errors.sort( lambda l, r: cmp(l[0], r[0]) )

show("saving output ... ")

# reopen input and sort it into either the output file or error log
file
input = fileinput.FileInput(all_files, bufsize = buffer_size
or DEFAULT_BUFFER)
error_list = iter(errors)
count = input.lineno
filename = input.filename
line_no = input.filelineno
try:
line_num, reason, i = error_list.next()
except StopIteration:
line_num = -1
for line in input:
line = line + '\r\n'
#print '%d,%d'%(line_num,count())
if line_num == count():

if reason == FIELD_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_FIELD % (schema[i].name), line))
elif reason == LINE_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(), i,
line))
elif reason == PK_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_PK % (primary_keys[i].name), line))
elif reason == FK_ERROR:
#print 'Test FK %s, rec_num : %d, line :
%s'%(foreign_keys[i].name,line_no(),line)
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_FK % (foreign_keys[i].name), line))
elif reason == REC_ERROR:
logs[filename()].write(ERROR_FORMAT % (line_no(),
INVALID_REC % (record_checks[i].__doc__), line))
else:
raise RuntimeError("shouldn't reach here")

try:
#print 'CURRENT ITERATION, line_num : %d, line :
%s'%(line_num,line)
line_num1 = line_num
line_num, reason, i = error_list.next()
if line_num1 == line_num :
line_num, reason, i = error_list.next()

#print 'FOR NEXT ITERATION, line_num : %d, line :
%s'%(line_num,line)

except StopIteration:
line_num = -1
continue

if not BLANK_LINE.match(line):
output.write(line)

output.close()
for f in logs.values():
f.close()
-----------------------------------------------------------------------------------------------------------------------------

now when I open the error log file, it contains the error message for
each erroneous record, along with the original record copied from the
*.acc file.
Now this record is preceeded with a box like character.

Do you want me to post the complete code , just incase...
It might help... you might then understand my problem well..
plz let me know soon

Oct 25 '06 #8

sonald

HI
Can u please tell me if there is any package or class that I can import
for internationalization, or unicode support?

This module is just a small part of our application, and we are not
really supposed to alter the code.
We do not have nobody here to help us with python here. and are
supposed to just try and understand the program. Today I am in a
position, that I can fix the bugs arising from the code, but cannot
really try something like internationalization on my own. Can u help?
Do you want me to post the complete code for your reference?
plz lemme know asap.
John Roth wrote:

sonald wrote:
Hi,
I am using python2.4.1

I need to pass russian text into python and validate the same.
Can u plz guide me on how to make my existing code support the
russian text.

Is there any module that can be used for unicode support in python?

Incase of decimal numbers, how to handle "comma as a decimal point"
within a number

Currently the existing code is woking fine for English text
Please help.

Thanks in advance.

regards
sonal

As both of the other responders have said, the
coding comment at the front only affects source
text; it has absolutely no effect at run time. In
particular, it's not even necessary to use it to
handle non-English languages as long as you
don't want to write literals in those languages.

What seems to be missing is the notion that
external files are _always_ byte files, and have to
be _explicitly_ decoded into unicode strings,
and then encoded back to whatever the external
encoding needs to be, each and every time you
read or write a file, or copy string data from
byte strings to unicode strings and back.
There is no good way of handling this implicitly:
you can't simply say "utf-8" or "iso-8859-whatever"
in one place and expect it to work.

You've got to specify the encoding on each and
every open, or else use the encode and decode
string methods. This is a great motivation for
eliminating duplication and centralizing your
code!

For your other question: the general words
are localization and locale. Look up locale in
the index. It's a strange subject which I don't
know much about, but that should get you
started.

John Roth

Oct 25 '06 #9

by: sebastien.hugues | last post by:

Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name:...

Python

convert Unicode to lower/uppercase?

by: Hallvard B Furuseth | last post by:

Has someone got a Python routine or module which converts Unicode strings to lowercase (or uppercase)? What I actually need to do is to compare a number of strings in a case-insensitive manner,...

Python

Unicode and Zipfile problems

by: Gerson Kurz | last post by:

AAAAAAAARG I hate the way python handles unicode. Here is a nice problem for y'all to enjoy: say you have a variable thats unicode directory = u"c:\temp" Its unicode not because you want it...

Python

Shrinky-dink Python (also, non-Unicode Python build is broken)

by: Larry Hastings | last post by:

I'm an indie shareware Windows game developer. In indie shareware game development, download size is terribly important; conventional wisdom holds that--even today--your download should be 5MB or...

Python

A question about unicode() function

by: JTree | last post by:

Hi,all I encountered a problem when using unicode() function to fetch a webpage, I don't know why this happenned. My codes and error messages are: Code: #!/usr/bin/python #Filename: test.py...

Python

Python's handling of unicode surrogates

by: Adam Olsen | last post by:

As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...

Python

Unexpected exception from socket.getaddrinfo on Unicode URL

by: John Nagle | last post by:

Here's a strange little bug. "socket.getaddrinfo" blows up if given a bad domain name containing ".." in Unicode. The same string in ASCII produces the correct "gaierror" exception. Actually,...

Python

LANG, locale, unicode, setup.py and Debian packaging

by: Donn Ingle | last post by:

Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...

Python

Python beginner, unicode encode/decode Q

by: anonymous | last post by:

1 Objective to write little programs to help me learn German. See code after numbered comments. //Thanks in advance for any direction or suggestions. tk 2 Want keyboard answer input, for...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Unicode support in python

Similar topics