Problem with lower() for unicode strings in russian

Alexey Moskvin

Hi!
I have a set of strings (all letters are capitalized) at utf-8,
russian language. I need to lower it, but
my_string.lower(). Doesn't work.
See sample script:
# -*- coding: utf-8 -*-
[skip]
s1 = self.title
s2 = self.title.lower()
print s1 == s2

returns true.
I have no problems with lower() for english letters:, or with
something like this:
u'russian_letters_here'.lower(), but I don't need constants, I need to
modify variables, but there is no any changs, when I apply lower()
function to mine strings.

Oct 5 '08 #1

Subscribe Post Reply

7797

Diez B. Roggisch

Alexey Moskvin schrieb:

Hi!
I have a set of strings (all letters are capitalized) at utf-8,
russian language. I need to lower it, but
my_string.lower(). Doesn't work.
See sample script:
# -*- coding: utf-8 -*-
[skip]
s1 = self.title
s2 = self.title.lower()
print s1 == s2

returns true.
I have no problems with lower() for english letters:, or with
something like this:
u'russian_letters_here'.lower(), but I don't need constants, I need to
modify variables, but there is no any changs, when I apply lower()
function to mine strings.

Can you give a concrete example? I doubt that there is anything
different between lowering a unicode object given as literal or acquired
somewhere else. And because my russian skills equal my chinese - total
of zero - I can't create a test myself :)

Oct 5 '08 #2

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

I have a set of strings (all letters are capitalized) at utf-8,

That's the problem. If these are really utf-8 encoded byte strings,
then .lower likely won't work. It uses the C library's tolower API,
which works on a byte level, i.e. can't work for multi-byte encodings.

What you need to do is to operate on Unicode strings. I.e. instead
of

s.lower()

do

s.decode("utf-8").lower()

or (if you need byte strings back)

s.decode("utf-8").lower().encode("utf-8")

If you find that you write the latter, I recommend that you redesign
your application. Don't use byte strings to represent text, but use
Unicode strings all the time, except at the system boundary (where
you decode/encode as appropriate).

There are some limitations with Unicode .lower also, but I don't
think they apply to Russian (specifically, SpecialCasing.txt is
not considered).

HTH,
Martin

Oct 5 '08 #3

Alexey Moskvin

Martin, thanks for fast reply, now anything is ok!
On Oct 6, 1:30 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:

I have a set of strings (all letters are capitalized) at utf-8,

That's the problem. If these are really utf-8 encoded byte strings,
then .lower likely won't work. It uses the C library's tolower API,
which works on a byte level, i.e. can't work for multi-byte encodings.

What you need to do is to operate on Unicode strings. I.e. instead
of

s.lower()

do

s.decode("utf-8").lower()

or (if you need byte strings back)

s.decode("utf-8").lower().encode("utf-8")

If you find that you write the latter, I recommend that you redesign
your application. Don't use byte strings to represent text, but use
Unicode strings all the time, except at the system boundary (where
you decode/encode as appropriate).

There are some limitations with Unicode .lower also, but I don't
think they apply to Russian (specifically, SpecialCasing.txt is
not considered).

HTH,
Martin

Oct 6 '08 #4

konstantin

On Oct 6, 8:39 am, Alexey Moskvin <d...@inbox.ruwrote:

Martin, thanks for fast reply, now anything is ok!
On Oct 6, 1:30 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:

I have a set of strings (all letters are capitalized) at utf-8,

That's the problem. If these are really utf-8 encoded byte strings,
then .lower likely won't work. It uses the C library's tolower API,
which works on a byte level, i.e. can't work for multi-byte encodings.

What you need to do is to operate on Unicode strings. I.e. instead
of

s.lower()

do

s.decode("utf-8").lower()

or (if you need byte strings back)

s.decode("utf-8").lower().encode("utf-8")

If you find that you write the latter, I recommend that you redesign
your application. Don't use byte strings to represent text, but use
Unicode strings all the time, except at the system boundary (where
you decode/encode as appropriate).

There are some limitations with Unicode .lower also, but I don't
think they apply to Russian (specifically, SpecialCasing.txt is
not considered).

HTH,
Martin

Alexey,

if your strings stored in some text file you can use "codecs" package

import codecs
handler = codecs.open('somefile', 'r', 'utf-8')
# ... do the job
handler.close()

I prefer this way to deal with russian in utf-8.

Konstantin.

Oct 6 '08 #5

Similar topics

Non-unicode strings & Python.

by: Jonathon Blake | last post by:

All: Question Python is currently Unicode Compliant. What happens when strings are read in from text files that were created using GB 2312-1980, or KPS 9566-2003, or other, equally...

Python

Unicode strings

by: Andrew L | last post by:

Hello all, What strategy should I use in solving the following problem? I have a list of unicode strings which I would like to compare with its English language 'equivalent.' eg "reykjavík"...

C / C++

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

Reading Unicode Strings from File

by: Jamie | last post by:

I have a file that was written using Java and the file has unicode strings. What is the best way to deal with these in C? The file definition reads: Data Field Description CHAR File...

C / C++

unicode strings and network byte ordering ?

by: srikant | last post by:

I am writing a client in C# that needs to communicate over the network to a legacy C++ application that uses Unicode strings. I realize that C# strings are already in Unicode, however, how do I...

C# / C Sharp

bug in 7.4.2, concern unicode and russian content of db

by: Alexander S. | last post by:

There is bug in 7.4.2, concerning unicode and russian letters. For db in unicode russian data doesn`t order in alphabetical order (rows group with the same first letter but not in alphabetical...

PostgreSQL Database

WTF? Printing unicode strings

by: Ron Garret | last post by:

>>> u'\xbd' u'\xbd' >>> print _ Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeEncodeError: 'ascii' codec can't encode character u'\xbd' in position 0: ordinal not in...

Python

Problem with sets and Unicode strings

by: Dennis Benzinger | last post by:

Hi! The following program in an UTF-8 encoded file: # -*- coding: UTF-8 -*- FIELDS = ("Fächer", ) FROZEN_FIELDS = frozenset(FIELDS) FIELDS_SET = set(FIELDS)

Python

Math with unicode strings?

by: erikcw | last post by:

Hi, I'm parsing xml data with xml.sax and I need to perform some arithmetic on some of the xml attributes. The problem is they are all being "extracted" as unicode strings, so whenever I try to...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing