473,507 Members | 12,693 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Sorting strings containing special characters (german 'Umlaute')

Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
>>locale.strcoll("Ärger", "Beere")
1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and
>>locale.getdefaultlocale()
prints
('de_DE', 'cp1252')

TIA.

Dierk

Mar 2 '07 #1
8 8808
Di**********@mail.com wrote:
Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
>>locale.strcoll("Ärger", "Beere")
1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and
>>>locale.getdefaultlocale()
prints
('de_DE', 'cp1252')

TIA.

Dierk
we tried this in a javascript version and it seems to work sorry for long line
and possible bad translation to Python
#coding: cp1252
def _deSpell(a):
u = a.decode('cp1252')
return
u.replace(u'\u00C4','Ae').replace(u'\u00e4','ae'). replace(u'\u00D6','OE').replace(u'\u00f6','oe').re place(u'\u00DC','Ue').replace(u'\u00fc','ue').repl ace(u'\u00C5','Ao').replace(u'\u00e5','ao')
def deSort(a,b):
return cmp(_deSpell(a),_deSpell(b))

l = ["Aber", "Ärger", "Beere"]
l.sort(deSort)
print l

--
Robin Becker

Mar 2 '07 #2
Di**********@mail.com wrote:
I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
I don't think so:
>>sorted(["Ast", "Ärger", "Ara"], locale.strcoll)
['Ara', '\xc3\x84rger', 'Ast']
>>sorted(["Ast", "Aerger", "Ara"])
['Aerger', 'Ara', 'Ast']
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider
>>locale.strcoll("Ärger", "Beere")
1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and
>>>locale.getdefaultlocale()
prints
('de_DE', 'cp1252')
The default locale is not used by default; you have to set it explicitly
>>import locale
locale.strcoll("Ärger", "Beere")
1
>>locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>locale.strcoll("Ärger", "Beere")
-1

By the way, you will avoid a lot of "Ärger"* if you use unicode right from
the start.

Finally, for efficient sorting, a key function is preferable over a cmp
function:
>>sorted(["Ast", "Ärger", "Ara"], key=locale.strxfrm)
['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"
Mar 2 '07 #3
Di**********@mail.com writes:
For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]
Are you sure? Maybe I'm thinking of another language, I thought Ä shold
be sorted together with A, but after A if the words are otherwise equal.
E.g. Antwort, Ärger, Beere. A proper strcoll handles that by
translating "Ärger" to e.g. ["Arger", <something like "E\0\0\0\0">],
then it can sort first by the un-accentified name and then by the rest.

--
Hallvard
Mar 2 '07 #4
Hallvard B Furuseth wrote:
Di**********@mail.com writes:
>For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

Are you sure? Maybe I'm thinking of another language, I thought Ä
shold be sorted together with A, but after A if the words are
otherwise equal.
In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.

Regards,
Björn

--
BOFH excuse #277:

Your Flux Capacitor has gone bad.

Mar 2 '07 #5
On 2 Mrz., 15:25, Peter Otten <__pete...@web.dewrote:
DierkErdm...@mail.com wrote:
For sorting the letter "Ä" is supposed to be treated like "Ae",
There are several way of defining the sorting order. The variant "ä
equals ae" follows DINDIN 5007 (according to wikipedia); defining (a
equals ä) complies with DIN 5007-1. Therefore both options are
possible.
The default locale is not used by default; you have to set it explicitly
>import locale
locale.strcoll("Ärger", "Beere")
1
>locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>locale.strcoll("Ärger", "Beere")

-1
On my machine
>>locale.setlocale(locale.LC_ALL, "")
gives
'German_Germany.1252'

But this does not affect the sorting order as it does on your
computer.
>>locale.strcoll("Ärger", "Beere")
yields 1 in both cases.

Thank you for your hint using unicode from the beginning on, see the
difference:
>>s1 = unicode("Ärger", "latin-1")
s2 = unicode("Beere", "latin-1")
locale.strcoll(s1, s2)
1
>>locale.setlocale(locale.LC_ALL, "")
-1

compared to
>>s1 = "Ärger"
s2 = "Beere"
locale.strcoll(s1, s2)
1
>>locale.setlocale(locale.LC_ALL, "")
'German_Germany.1252'
>>locale.strcoll(s1, s2)
1

Thanks for your help.

Dierk

>
['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"

Mar 2 '07 #6
Bjoern Schliessmann wrote:
Hallvard B Furuseth wrote:
>Di**********@mail.com writes:
........
>
In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.
Björn, in one of our projects we are sorting in javascript in several languages
English, German, Scandinavian languages, Japanese; from somewhere (I cannot
actually remember) we got this sort spelling function for scandic languages

a
..replace(/\u00C4/g,'A~') //A umlaut
..replace(/\u00e4/g,'a~') //a umlaut
..replace(/\u00D6/g,'O~') //O umlaut
..replace(/\u00f6/g,'o~') //o umlaut
..replace(/\u00DC/g,'U~') //U umlaut
..replace(/\u00fc/g,'u~') //u umlaut
..replace(/\u00C5/g,'A~~') //A ring
..replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
--
Robin Becker

Mar 2 '07 #7
Robin Becker wrote:
Björn, in one of our projects we are sorting in javascript in
several languages English, German, Scandinavian languages,
Japanese; from somewhere (I cannot actually remember) we got this
sort spelling function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
If I'm not mistaken, this would sort all umlauts after the "pure"
vowels. This is, according to <http://de.wikipedia.org/wiki/
Alphabetische_Sortierung>, used in Austria.

If you can't understand german, the rules given there in
section "Einsortierungsregeln" (roughly: ordering rules) translate
as follows:

"X und Y sind gleich": "X equals Y"
"X kommt nach Y": "X comes after Y"

Regards&HTH,
Björn

--
BOFH excuse #146:

Communications satellite used by the military for star wars.

Mar 2 '07 #8
Robin Becker kirjoitti:
>
Björn, in one of our projects we are sorting in javascript in several
languages English, German, Scandinavian languages, Japanese; from
somewhere (I cannot actually remember) we got this sort spelling
function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
I think this order is not correct for Finnish, which is one of the
Scandinavian languages. The Finnish alphabet in alphabetical order is:

a-z, å, ä, ö

If I understand correctly your replacements cause the order of the last
3 characters to be

ä, å, ö

which is wrong.

HTH,
Jussi
Mar 4 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

13
8503
by: Robert Zierhofer | last post by:
Hi all, I currently face a problem with htmlentities and german "umlaute". After moving my scripts to a new box (from Linux to FreeBSD) I had to see that htmlentities is not working anymore....
39
6013
by: Erlend Fuglum | last post by:
Hi everyone, I'm having some trouble sorting lists. I suspect this might have something to do with locale settings and/or character encoding/unicode. Consider the following example, text...
7
3236
by: Federico G. Babelis | last post by:
Hi All: I have this line of code, but the syntax check in VB.NET 2003 and also in VB.NET 2005 Beta 2 shows as unknown: Dim local4 As Byte Fixed(local4 = AddressOf dest(offset)) ...
10
2322
by: Andrew L | last post by:
Hello all, What strategy should I use in solving the following problem? I have a list of unicode strings which I would like to compare with its English language 'equivalent.' eg "reykjavík"...
22
4116
by: mike | last post by:
If I had a date in the format "01-Jan-05" it does not sort properly with my sort routine: function compareDate(a,b) { var date_a = new Date(a); var date_b = new Date(b); if (date_a < date_b)...
1
1712
by: Carlo Marchesoni | last post by:
In order to have a mutli-lingual page, I load all .Text, .ToolTip etc from a resource, if the user does not work with the default-language, like this: Thread.CurrentThread.CurrentCulture = new...
0
1368
by: news.online.de | last post by:
Hello everybody, probably it's a FAQ but I didn't find anything so far concerning my problem, so I am asking here :-) I am facing the following problem: - I have developed a webservice client...
25
5293
by: Wim Cossement | last post by:
Hello, I was wondering if there are a few good pages and/or examples on how to process form data correctly for putting it in a MySQL DB. Since I'm not used to using PHP a lot, I already found...
1
3041
AMT India
by: AMT India | last post by:
I am having a list of countries, among which some of them starts with German special characters ( like Umplot). I want to sort the list independent of this German characters. So that Umplot will come...
0
7223
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7314
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7372
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
7030
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
5623
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
4702
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3179
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1540
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
411
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.