convert Unicode to lower/uppercase?

Hallvard B Furuseth

Has someone got a Python routine or module which converts Unicode
strings to lowercase (or uppercase)?

What I actually need to do is to compare a number of strings in a
case-insensitive manner, so I assume it's simplest to convert to
lower/upper first.

Possibly all strings will be from the latin-1 character set, so I could
convert to 8-bit latin-1, map to lowercase, and convert back, but that
seems rather cumbersome.

--
Hallvard

Jul 18 '05 #1

Subscribe Post Reply

25862

Peter Otten

nospam wrote:

Has someone got a Python routine or module which converts Unicode
strings to lowercase (or uppercase)?

Toiled and came up with:

print u"abcäöüß".upper() ABCÄÖÜß
u"ABCÄÖÜ".lower()

u'abc\xe4\xf6\xfc'

Peter

Jul 18 '05 #2

Hallvard B Furuseth

Thanks!

--
Hallvard

Jul 18 '05 #3

jallan

Peter Otten <__*******@web.de> wrote in message news:<bk*************@news.t-online.com>...

nospam wrote:
Has someone got a Python routine or module which converts Unicode
strings to lowercase (or uppercase)?

Toiled and came up with:
print u"abcäöüß".upper() ABCÄÖÜß
u"ABCÄÖÜ".lower()

u'abc\xe4\xf6\xfc'

Peter

But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.

Jim Allan

Jul 18 '05 #4

Martin v. Löwis

jallan wrote:

But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.

Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

Also, in German, the uppercase mapping of ß is of ongoing debate.
For example, the Duden from 1919 says

| Für ß wird in großer Schrift SZ angewandt [...]. Die Verwendung
| _zweier_ Buchstaben für _einen_ Laut ist nur ein Notbehelf, der
| aufhören muß, sobald ein geeigneter Druckbuchstabe für das
| große ß geschaffen ist.

The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

Regards,
Martin

Jul 18 '05 #5

Asun Friere

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message news:<bk*************@news.t-online.com>...

The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

And replaced with what? ie. is there now a single capital for SZ?

Jul 18 '05 #6

Gerhard Häring

Asun Friere wrote:

"Martin v. Löwis" <ma****@v.loewis.de> wrote in message news:<bk*************@news.t-online.com>...
The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

And replaced with what? ie. is there now a single capital for SZ?

ß (sz) has not been completely eliminated. After *short* vocals it has
been replace with ss (Kuß => Kuss, Fluß, => Fluss). But after *long*
vocals, it is still used (Maß, Gruß, ...).

-- Gerhard

PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.

Jul 18 '05 #7

Peter Otten

"Martin v. Löwis" wrote:

jallan wrote:
But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.
Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

It seems like UnicodeData.txt does not give the full story. Quoting from
http://www.unicode.org/Public/UNIDAT...ialCasing.txt:

[...]
# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.
Also, in German, the uppercase mapping of ß is of ongoing debate.

My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.
For the "official" rule see http://www.ids-mannheim.de/reform/a2-3.html.

Peter

Jul 18 '05 #8

jallan

Peter Otten <__*******@web.de> wrote in message news:<bk*************@news.t-online.com>...

"Martin v. Löwis" wrote:
jallan wrote:
But that really doesn't work properly. According to Unicode specs and
German usage the uppercase of "ß" is actually "SS", that is the single
character "ß" should uppercase to two characters.
Can you cite exact chapter and verse of the Unicode specs that say so?
According to the Unicode database,

http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

has neither an uppercase mapping, nor a lowercase mapping.

It seems like UnicodeData.txt does not give the full story. Quoting from
http://www.unicode.org/Public/UNIDAT...ialCasing.txt:

[...]

# (For compatibility, the UnicodeData.txt file only contains case mappings
for
# characters where they are 1-1, and does not have locale-specific
mappings.)
[...]
# <code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
[...]
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

Yes.

Also the Unicode main charts in the annotation for 00DF state:

uppercase is "SS"

See http://www.unicode.org/charts/PDF/U0080.pdf

This note on the character first appeared in Unicode 1.0 (published in
1991) and has been in every revision.

Unicode 1.0, Volume One also lists this in the lower case to upper
case casing tables on page 453.

There is nothing new about this casing requirement.

A further mention occurs in the Unicode 4.0 specifications in Table
4-1 in section 4.2 Case--Normative. See
http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf

This contains the warning:

<< Only legacy implementations that cannot handle case mappings that
increase sring lengths should use UnicodeData case mappings alone. The
single-character mappings are insufficient for languages such as
German. >>

So is Python just another shit legacy implementation?

Jim Allan

Jul 18 '05 #9

Martin v. Löwis

af*****@yahoo.co.uk (Asun Friere) writes:

The usage of SZ has only been eliminated in the recent change of
the amtliche Rechtschreibung.

And replaced with what? ie. is there now a single capital for SZ?

Unfortunately, I don't have a current Duden here, but I *think* you
now have to write double-S. There is, of course, the old MASSE vs
MASZE issue - I don't know whether this is considered relevant, as
capitalization is rare, anyway, and ambiguities can be clarified from
the context.

Regards,
Martin

Jul 18 '05 #10

Martin v. Löwis

Peter Otten <__*******@web.de> writes:

# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.
No. It would be required if .upper would claim to implement
SpecialCasing - but it makes no such claim.
My personal impression is that, even before the orthography reform in 1998,
the SZ variant was seldom used.

There is, of course, the famous "MASSE oder MASZE" example, in particular
in the form "WIR TRINKEN BIER IN MASSEN".

Regards,
Martin

Jul 18 '05 #11

Martin v. Löwis

ja****@smrtytrek.com (jallan) writes:

So is Python just another shit legacy implementation?

Yes :-)

Regards,
Martin

Jul 18 '05 #12

Asun Friere

Gerhard Häring <gh@ghaering.de> wrote in message news:<ma**********************************@python. org>...

PS: I was quite disappointed with the reform of German ortography. I'd
have favoured much more radical steps, like elimination of
capitalization of the noun.

As an English speaker, who occasionally finds himself trying to
decipher German text, let me tell you that little flags like that
--"pick me! I'm a noun!" --are actually quite useful.

Jul 18 '05 #13

jallan

ma****@v.loewis.de (Martin v. Löwis) wrote in message news:<m3************@mira.informatik.hu-berlin.de>...

Peter Otten <__*******@web.de> writes:
# The German es-zed is special--the normal mapping is to SS.
# Note: the titlecase should never occur in practice. It is equal to
titlecase(uppercase(<es-zed>))

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S
[...]

Thus, to comply with the standard, "ß".upper() --> "SS" is required.

No. It would be required if .upper would claim to implement
SpecialCasing - but it makes no such claim.

Of course not. From http://www.python.org/doc/current/li....html#l2h-203:

<<
*upper( )*
Return a copy of the string converted to uppercase.

This makes no claim about how the magic is done. But there is
certainly an implied claim that it is done correctly.

Unicode specifications are easily available at
http://www.unicode.org/versions/Unicode4.0.0/.

At 3.13 is indicated:

<< The full case mappings for Unicode characters are obtained by using
the mappings from SpecialCasing.txt _plus_ the mappings from
UnicodeData.txt, excluding any latter mappings that would conflict. >>

Case mappings for Unicode require use of SpecialCasing otherwise the
results are not in accord with the Unicode standard.

At 4.2 is found:

<< Only legacy implementations that cannot handle case mappings that
increase string lengths should use UnicodeData case mappings alone.
The single-character mappings are insufficient for languages such as
German >>

I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Unicode again warns that using UnicodeData.txt alone is not
sufficient.

The text continues on "SpecialCasting.txt":

<< Contains additional case mappings that map to more than one
character, such as "ß" to "SS". >>

Section 5.18 Case Mappings goes into further detail about casing
issues and specifically mentions:

<< Case mappings may produce strings of different length than the
original. For example the German character U+00DF ß LATIN SMALL LETTER
SHAPR S expands when uppercase to the sequence of two characters "SS".
This also occurs where there is no prcomposed character corresponding
to a case mapping, such as with U+0149 'n LATIN SMALL LETTER N
PRECEDED BY APOSTROPHE. >>

See also http://www.unicode.org/faq/casemap_charprop-old.html for the
Unicode FAQ which contains:

<<
Q: Why is there no upper-case SHARP S (ß)?

A: There are 139 lower-case letters in Unicode 2.1 that have no direct
uppercase equivalent. Should there be introduced new bogus characters
for all of them, so that when you see an "fl" ligature you can
uppercase it to "FL" without expanding anything? Of course not.

Note that case conversion is inherently language-sensitive, notably in
the case of IPA, which needs to be left strictly alone even when
embedded in another language which is being case converted. The best
you can get is an approximate fit. [JC]

Q: Is all of the Unicode case mapping information in UnicodeData.txt?

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ß. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]

Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

The implied combined claim is that Python supports Unicode and
supports proper casing in Unicode.

This implied claim is false.

Truly accurate documentation for upper() should say that it uppercases
a string except for those characters where uppercasing would expand a
character to more than one character in which circumstance that
character is not uppercased or uppercased with loss of data.

Python specifications need not say how casing is done, whether by
using Unicode tables directly or by using its own methods that
accomplish the same results.

Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Jim Allan

Jul 18 '05 #14

Peter Otten

jallan wrote:

I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".

Now that's a long post. I think it essentially boils down to the above
statement.

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

static PyObject *
string_upper(PyStringObject *self)
{
char *s = PyString_AS_STRING(self), *s_new;
int i, n = PyString_GET_SIZE(self);
PyObject *new;

new = PyString_FromStringAndSize(NULL, n);
if (new == NULL)
return NULL;
s_new = PyString_AsString(new);
for (i = 0; i < n; i++) {
int c = Py_CHARMASK(*s++);
if (islower(c)) {
*s_new = toupper(c);
} else
*s_new = c;
s_new++;
}
return new;
}

The whole routine builds on the assumption that len(s) == len(s.upper()) and
nothing short of a complete rewrite will fix that. But if you volunteer...

Personally, I think it's a long way to go for a little s, sharp as it may be
:-)

Peter

Jul 18 '05 #15

Martin v. Löwis

ja****@smrtytrek.com (jallan) writes:

A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ÃŸ. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]

Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

This is a contradiction: SpecialCasing contains 1:n mappings, whereas
..upper() can only return a single result. So how do you think
SpecialCasing should be considered in the implementation of .upper()?
Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Things are more difficult than they appear to be.

Regards,
Martin

Jul 18 '05 #16

Martin v. Löwis

Peter Otten <__*******@web.de> writes:

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

You are mistaken. The implementation in unicodeobject.c is
fundamentally different. The byte string implementation uses the C
library, the Unicode implementation uses the Unicode character
database. So the former cannot be changed, whereas the latter could,
in theory, be extended to use additional data.

Regards,
Martin

Jul 18 '05 #17

Peter Otten

Martin v. Löwis wrote:

Peter Otten <__*******@web.de> writes:
Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

You are mistaken. The implementation in unicodeobject.c is
fundamentally different. The byte string implementation uses the C
library, the Unicode implementation uses the Unicode character
database. So the former cannot be changed, whereas the latter could,
in theory, be extended to use additional data.

I followed the code to fixupper() which operates on a preallocated unicode
object and thus cannot cope with a string that expands while being
transformed. I didn't actually resolve the macros.

While we are at it, would it be viable to "abuse" the encoding/decoding
mechanism to do case conversions?

Peter

Jul 18 '05 #18

jallan

Peter Otten <__*******@web.de> wrote in message news:<bk*************@news.t-online.com>...

jallan wrote:
I don't see any particular reason why Python "cannot handle case
mappings that increase string lengths".
Now that's a long post. I think it essentially boils down to the above
statement.

Looking into stringobject.c (judging from a first impression,
unicodeobject.c has essentially the same algorithm, but with a few
indirections):

static PyObject *
string_upper(PyStringObject *self)
{
char *s = PyString_AS_STRING(self), *s_new;
int i, n = PyString_GET_SIZE(self);
PyObject *new;

new = PyString_FromStringAndSize(NULL, n);
if (new == NULL)
return NULL;
s_new = PyString_AsString(new);
for (i = 0; i < n; i++) {
int c = Py_CHARMASK(*s++);
if (islower(c)) {
*s_new = toupper(c);
} else
*s_new = c;
s_new++;
}
return new;
}

The whole routine builds on the assumption that len(s) == len(s.upper()) and
nothing short of a complete rewrite will fix that. But if you volunteer...

I would love to if I had the time. Sigh! Maybe in some months.
Personally, I think it's a long way to go for a little s, sharp as it may be
:-)

If it were just ß one could thrown in a quick conversion of any ß to
ss at the beginning.

But there are over a hundred other characters that expand when
uppercased in http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt,
most of them Greek. Greek is a horror. See
http://www.tlg.uci.edu/~opoudjis/uni..._adscript.html for the
sad tale.

Unfortunately language and orthography are messy and inconsistant and
illogical and sometimes just silly. But handling orthography properly
involves dealing with these complex rules and subrules and exceptions
to rules rather than ignoring them.

Unicode gives us great power, but with great power comes great
responsibility and lots of niggling code. :-(

Fortunately only the Latin, Greek, Coptic, Cyrillic and Armenian
scripts have such a thing as casing and the Unicode people have
provided data files and algorithms that supposedly handle casing for
these languages acceptably.

From the Conformance requirements for Unicode at
http://www.unicode.org/versions/Unic...h03.pdf#G29484 (C20):

<< An implementation that purports to support the default casing
operations of case conversion, case detection, and caseless mapping
shall do so in accordance with the definitions and specifications in
Section 3.13, Default Case Operations. >>

This involves even more messy fussing about with context specification
for casing and with what values should be returned from a case
querying function, e.g. "A2" is true as either uppercase and titlecase
but not as lowercase. "3" is true as lowercase, uppercase and title
case.

Python or any applicaton or language either does or doesn't conform.

I doubt that there is currently any application that can yet honestly
purport to support Unicode default casing operations of case
conversion, case detection and caseless mapping.

Jim Allan

Jul 18 '05 #19

Martin v. Löwis

Peter Otten <__*******@web.de> writes:

While we are at it, would it be viable to "abuse" the
encoding/decoding mechanism to do case conversions?

It might be viable, but I would consider it abuse: for one thing, I'm
not in favour of codecs which do Unicode->Unicode conversions - IMO, a
codec should convert between Unicode and byte strings. Furthermore, a
codec IMO should represent a proper "encoding", which case conversions
would not do.

Instead, it would be much better to provide such functions in a
library, e.g. by wrapping ICU. Then, case conversions should be done
locale-dependent, instead of being general (as .upper currently is).
The locale-dependent way would best operate on explicit locale
objects, so you would spell

locale_object = load_locale("German", "Plattdeutsch")
up_string = locale_object.to_upper(lower_string)

In that case, the upper-case function would stop being a string
method, and be a locale method instead, taking a string argument.

Regards,
Martin

Jul 18 '05 #20

jallan

ma****@v.loewis.de (Martin v. LÃ¶wis) wrote in message news:<m3************@mira.informatik.hu-berlin.de>...

ja****@smrtytrek.com (jallan) writes:
A: No. The UnicodeData.txt file includes all of the 1:1 case mappings,
but doesn't include 1:many mappings such as the one needed for
uppercasing ÃŸ. Since many parsers now expect this file to have at most
single characters in the case mapping fields, an additional file
(SpecialCasing.txt) was added to provide the 1:many mappings. For more
information, see UTR #21- Case Mappings [MD]
>

Python specifications make an implied claim of full support for
Unicode and an implied claim that the function upper() uppercases a
string properly.

This is a contradiction: SpecialCasing contains 1:n mappings, whereas
.upper() can only return a single result. So how do you think
SpecialCasing should be considered in the implementation of .upper()?

I am not aware that it is philosophically a *necessary* feature of
..upper() that a single character not be replaced by a string of two or
more characters.

One should fix the contradition by either changing the behavior of
..upper() so that it will properly case all strings or documenting
clearly that .upper() does not handle particular kinds of casing. Of
course users often don't read the documentation. :-(

Users should not have to know such details. They may wish to know
where a particular function does not do what might be expected of it.

Things are more difficult than they appear to be.

Yes.

Again and again one thinks one has a solution for a problem and then
exceptions turn up.

Again and again one finds things that one's code doesn't handle, often
from failure to analyze fully in the intitial stages and adopting
algorithms that prove insufficient to handle the data found in
reality.

Jim Allan

Jim Allan

Jul 18 '05 #21

Neil Hodgson

jallan:

ma****@v.loewis.de (Martin v. Löwis) wrote
...
This is a contradiction: SpecialCasing contains 1:n mappings, whereas
.upper() can only return a single result. So how do you think
SpecialCasing should be considered in the implementation of .upper()?

I am not aware that it is philosophically a *necessary* feature of
.upper() that a single character not be replaced by a string of two or
more characters.

That is not the issue. The issue is that .upper would have to return a
list or map of results (for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}), which would be
difficult for the caller to make use of without performing some additional
work, finding the correct result for its locale. It is simpler for the
caller to provide a locale argument in the .upper call or in its context.

Neil

Jul 18 '05 #22

Neil Hodgson

Me:

for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}),

For a real example from the Microsoft web site, uppercasing "indigo"
(u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
(u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
with dots above the 'I's for Turkish:
(u'\u0130\u004e\u0044\u0130\u0047\u004f').

Neil

Jul 18 '05 #23

jallan

"Neil Hodgson" <nh******@bigpond.net.au> wrote in message news:<wm*******************@news-server.bigpond.net.au>...

Me:
for an illustrative but incorrect example
"ca~non".upper() -> {'portugal':'CANON','spain':'CA~NON'}),

For a real example from the Microsoft web site, uppercasing "indigo"
(u'\u0069\u006e\u0064\u0069\u0067\u006f') gives "INDIGO"
(u'\u0049\u004e\u0044\u0049\u0047\u004f') for English-US and similar but
with dots above the 'I's for Turkish:
(u'\u0130\u004e\u0044\u0130\u0047\u004f').

The file http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
purportedly contains *all* casings for all scripts for all languages
where the casings are not one-to-one or are otherwise not
straightforward.

The *only* locale oddities there are for Lithuanian and the two
languages Turkish and Azeri and concern only dot/no-dot variants of
the letters _i_, _I_, _j_, _J_ and no others.

There are *no* other locale-based oddities. The mess is thankfully
*very* limited in scope.

In my opinion, if the full Unicode casing specification is to be
followed, the most useful solution would be a parameter allowing the
user to choose among (1) normal Latin casing, (2) Turkish/Azeri or (2)
Lithuanian as the casing model for treatment of these letters.

The default for the parameter would either be based on current locale
or be normal Latin casing. I think the latter far better as it is
dangerous to have functions in a language differ from machine to
machine according to the current locale.

Also, in case someone brings it up, it was formerly standard to
generally omit diacritics on capital letters in Portuguese and in
French (in France but not in Quebec!)

This is no longer the norm for either language. See
http://www.academie-francaise.fr/lan...l#accentuation
and http://www.press.uchicago.edu/Misc/C...haracters.html.

I have seen academic style sheets with a silly rule that diacritics
should be placed on capital letters as on lowercase letters except for
the word "A". See http://www.alphaacademic.co.uk/fcs.htm and
http://www.sagepub.com/journalManusc...pid=9669&sc=1:

<< We use accents on capital letters, but capital A does not take a
grave accent. >>

It would not hurt to make a casing table customizable for such unusual
styles. But that is beyond Unicode's specifications.

A programmer who wishes odd customization beyond the norms of a
language and Unicode specifications can do it through transformations
outside of normal casing.

Jim Allan

Jul 18 '05 #24

convert Unicode to lower/uppercase?

Similar topics