469,267 Members | 1,007 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,267 developers. It's quick & easy.

Problem with lower() for unicode strings in russian

Hi!
I have a set of strings (all letters are capitalized) at utf-8,
russian language. I need to lower it, but
my_string.lower(). Doesn't work.
See sample script:
# -*- coding: utf-8 -*-
[skip]
s1 = self.title
s2 = self.title.lower()
print s1 == s2

returns true.
I have no problems with lower() for english letters:, or with
something like this:
u'russian_letters_here'.lower(), but I don't need constants, I need to
modify variables, but there is no any changs, when I apply lower()
function to mine strings.
Oct 5 '08 #1
4 7232
Alexey Moskvin schrieb:
Hi!
I have a set of strings (all letters are capitalized) at utf-8,
russian language. I need to lower it, but
my_string.lower(). Doesn't work.
See sample script:
# -*- coding: utf-8 -*-
[skip]
s1 = self.title
s2 = self.title.lower()
print s1 == s2

returns true.
I have no problems with lower() for english letters:, or with
something like this:
u'russian_letters_here'.lower(), but I don't need constants, I need to
modify variables, but there is no any changs, when I apply lower()
function to mine strings.
Can you give a concrete example? I doubt that there is anything
different between lowering a unicode object given as literal or acquired
somewhere else. And because my russian skills equal my chinese - total
of zero - I can't create a test myself :)
Oct 5 '08 #2
I have a set of strings (all letters are capitalized) at utf-8,

That's the problem. If these are really utf-8 encoded byte strings,
then .lower likely won't work. It uses the C library's tolower API,
which works on a byte level, i.e. can't work for multi-byte encodings.

What you need to do is to operate on Unicode strings. I.e. instead
of

s.lower()

do

s.decode("utf-8").lower()

or (if you need byte strings back)

s.decode("utf-8").lower().encode("utf-8")

If you find that you write the latter, I recommend that you redesign
your application. Don't use byte strings to represent text, but use
Unicode strings all the time, except at the system boundary (where
you decode/encode as appropriate).

There are some limitations with Unicode .lower also, but I don't
think they apply to Russian (specifically, SpecialCasing.txt is
not considered).

HTH,
Martin
Oct 5 '08 #3
Martin, thanks for fast reply, now anything is ok!
On Oct 6, 1:30 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
I have a set of strings (all letters are capitalized) at utf-8,

That's the problem. If these are really utf-8 encoded byte strings,
then .lower likely won't work. It uses the C library's tolower API,
which works on a byte level, i.e. can't work for multi-byte encodings.

What you need to do is to operate on Unicode strings. I.e. instead
of

s.lower()

do

s.decode("utf-8").lower()

or (if you need byte strings back)

s.decode("utf-8").lower().encode("utf-8")

If you find that you write the latter, I recommend that you redesign
your application. Don't use byte strings to represent text, but use
Unicode strings all the time, except at the system boundary (where
you decode/encode as appropriate).

There are some limitations with Unicode .lower also, but I don't
think they apply to Russian (specifically, SpecialCasing.txt is
not considered).

HTH,
Martin
Oct 6 '08 #4
On Oct 6, 8:39 am, Alexey Moskvin <d...@inbox.ruwrote:
Martin, thanks for fast reply, now anything is ok!
On Oct 6, 1:30 am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
I have a set of strings (all letters are capitalized) at utf-8,
That's the problem. If these are really utf-8 encoded byte strings,
then .lower likely won't work. It uses the C library's tolower API,
which works on a byte level, i.e. can't work for multi-byte encodings.
What you need to do is to operate on Unicode strings. I.e. instead
of
s.lower()
do
s.decode("utf-8").lower()
or (if you need byte strings back)
s.decode("utf-8").lower().encode("utf-8")
If you find that you write the latter, I recommend that you redesign
your application. Don't use byte strings to represent text, but use
Unicode strings all the time, except at the system boundary (where
you decode/encode as appropriate).
There are some limitations with Unicode .lower also, but I don't
think they apply to Russian (specifically, SpecialCasing.txt is
not considered).
HTH,
Martin
Alexey,

if your strings stored in some text file you can use "codecs" package
import codecs
handler = codecs.open('somefile', 'r', 'utf-8')
# ... do the job
handler.close()
I prefer this way to deal with russian in utf-8.

Konstantin.
Oct 6 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Jonathon Blake | last post: by
10 posts views Thread by Andrew L | last post: by
2 posts views Thread by Neil Schemenauer | last post: by
5 posts views Thread by Jamie | last post: by
29 posts views Thread by Ron Garret | last post: by
14 posts views Thread by Dennis Benzinger | last post: by
2 posts views Thread by erikcw | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.