By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
425,666 Members | 1,784 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 425,666 IT Pros & Developers. It's quick & easy.

Sorting strings containing special characters (german 'Umlaute')

P: n/a
Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
>>locale.strcoll("Ärger", "Beere")
1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and
>>locale.getdefaultlocale()
prints
('de_DE', 'cp1252')

TIA.

Dierk

Mar 2 '07 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Di**********@mail.com wrote:
Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
>>locale.strcoll("Ärger", "Beere")
1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and
>>>locale.getdefaultlocale()
prints
('de_DE', 'cp1252')

TIA.

Dierk
we tried this in a javascript version and it seems to work sorry for long line
and possible bad translation to Python
#coding: cp1252
def _deSpell(a):
u = a.decode('cp1252')
return
u.replace(u'\u00C4','Ae').replace(u'\u00e4','ae'). replace(u'\u00D6','OE').replace(u'\u00f6','oe').re place(u'\u00DC','Ue').replace(u'\u00fc','ue').repl ace(u'\u00C5','Ao').replace(u'\u00e5','ao')
def deSort(a,b):
return cmp(_deSpell(a),_deSpell(b))

l = ["Aber", "Ärger", "Beere"]
l.sort(deSort)
print l

--
Robin Becker

Mar 2 '07 #2

P: n/a
Di**********@mail.com wrote:
I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
I don't think so:
>>sorted(["Ast", "Ärger", "Ara"], locale.strcoll)
['Ara', '\xc3\x84rger', 'Ast']
>>sorted(["Ast", "Aerger", "Ara"])
['Aerger', 'Ara', 'Ast']
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
>>locale.strcoll("Ärger", "Beere")
1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and
>>>locale.getdefaultlocale()
prints
('de_DE', 'cp1252')
The default locale is not used by default; you have to set it explicitly
>>import locale
locale.strcoll("Ärger", "Beere")
1
>>locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>locale.strcoll("Ärger", "Beere")
-1

By the way, you will avoid a lot of "Ärger"* if you use unicode right from
the start.

Finally, for efficient sorting, a key function is preferable over a cmp
function:
>>sorted(["Ast", "Ärger", "Ara"], key=locale.strxfrm)
['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"
Mar 2 '07 #3

P: n/a
Di**********@mail.com writes:
For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]
Are you sure? Maybe I'm thinking of another language, I thought Ä shold
be sorted together with A, but after A if the words are otherwise equal.
E.g. Antwort, Ärger, Beere. A proper strcoll handles that by
translating "Ärger" to e.g. ["Arger", <something like "E\0\0\0\0">],
then it can sort first by the un-accentified name and then by the rest.

--
Hallvard
Mar 2 '07 #4

P: n/a
Hallvard B Furuseth wrote:
Di**********@mail.com writes:
>For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

Are you sure? Maybe I'm thinking of another language, I thought Ä
shold be sorted together with A, but after A if the words are
otherwise equal.
In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.

Regards,
Björn

--
BOFH excuse #277:

Your Flux Capacitor has gone bad.

Mar 2 '07 #5

P: n/a
On 2 Mrz., 15:25, Peter Otten <__pete...@web.dewrote:
DierkErdm...@mail.com wrote:
For sorting the letter "Ä" is supposed to be treated like "Ae",
There are several way of defining the sorting order. The variant "ä
equals ae" follows DINDIN 5007 (according to wikipedia); defining (a
equals ä) complies with DIN 5007-1. Therefore both options are
possible.
The default locale is not used by default; you have to set it explicitly
>import locale
locale.strcoll("Ärger", "Beere")
1
>locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>locale.strcoll("Ärger", "Beere")

-1
On my machine
>>locale.setlocale(locale.LC_ALL, "")
gives
'German_Germany.1252'

But this does not affect the sorting order as it does on your
computer.
>>locale.strcoll("Ärger", "Beere")
yields 1 in both cases.

Thank you for your hint using unicode from the beginning on, see the
difference:
>>s1 = unicode("Ärger", "latin-1")
s2 = unicode("Beere", "latin-1")
locale.strcoll(s1, s2)
1
>>locale.setlocale(locale.LC_ALL, "")
-1

compared to
>>s1 = "Ärger"
s2 = "Beere"
locale.strcoll(s1, s2)
1
>>locale.setlocale(locale.LC_ALL, "")
'German_Germany.1252'
>>locale.strcoll(s1, s2)
1

Thanks for your help.

Dierk

>
['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"

Mar 2 '07 #6

P: n/a
Bjoern Schliessmann wrote:
Hallvard B Furuseth wrote:
>Di**********@mail.com writes:
........
>
In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.
Björn, in one of our projects we are sorting in javascript in several languages
English, German, Scandinavian languages, Japanese; from somewhere (I cannot
actually remember) we got this sort spelling function for scandic languages

a
..replace(/\u00C4/g,'A~') //A umlaut
..replace(/\u00e4/g,'a~') //a umlaut
..replace(/\u00D6/g,'O~') //O umlaut
..replace(/\u00f6/g,'o~') //o umlaut
..replace(/\u00DC/g,'U~') //U umlaut
..replace(/\u00fc/g,'u~') //u umlaut
..replace(/\u00C5/g,'A~~') //A ring
..replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
--
Robin Becker

Mar 2 '07 #7

P: n/a
Robin Becker wrote:
Björn, in one of our projects we are sorting in javascript in
several languages English, German, Scandinavian languages,
Japanese; from somewhere (I cannot actually remember) we got this
sort spelling function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
If I'm not mistaken, this would sort all umlauts after the "pure"
vowels. This is, according to <http://de.wikipedia.org/wiki/
Alphabetische_Sortierung>, used in Austria.

If you can't understand german, the rules given there in
section "Einsortierungsregeln" (roughly: ordering rules) translate
as follows:

"X und Y sind gleich": "X equals Y"
"X kommt nach Y": "X comes after Y"

Regards&HTH,
Björn

--
BOFH excuse #146:

Communications satellite used by the military for star wars.

Mar 2 '07 #8

P: n/a
Robin Becker kirjoitti:
>
Björn, in one of our projects we are sorting in javascript in several
languages English, German, Scandinavian languages, Japanese; from
somewhere (I cannot actually remember) we got this sort spelling
function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
I think this order is not correct for Finnish, which is one of the
Scandinavian languages. The Finnish alphabet in alphabetical order is:

a-z, å, ä, ö

If I understand correctly your replacements cause the order of the last
3 characters to be

ä, å, ö

which is wrong.

HTH,
Jussi
Mar 4 '07 #9

This discussion thread is closed

Replies have been disabled for this discussion.