Sorting strings containing special characters (german 'Umlaute')

DierkErdmann

Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider

>>locale.strcoll("Ärger", "Beere")

1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and

>>locale.getdefaultlocale()

prints
('de_DE', 'cp1252')

TIA.

Dierk

Mar 2 '07 #1

Subscribe Post Reply

8790

Robin Becker

Di**********@mail.com wrote:

Hi !

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider

>>locale.strcoll("Ärger", "Beere")

1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and

>>>locale.getdefaultlocale()

prints
('de_DE', 'cp1252')

TIA.

Dierk

we tried this in a javascript version and it seems to work sorry for long line
and possible bad translation to Python
#coding: cp1252
def _deSpell(a):
u = a.decode('cp1252')
return
u.replace(u'\u00C4','Ae').replace(u'\u00e4','ae'). replace(u'\u00D6','OE').replace(u'\u00f6','oe').re place(u'\u00DC','Ue').replace(u'\u00fc','ue').repl ace(u'\u00C5','Ao').replace(u'\u00e5','ao')
def deSort(a,b):
return cmp(_deSpell(a),_deSpell(b))

l = ["Aber", "Ärger", "Beere"]
l.sort(deSort)
print l

--
Robin Becker

Mar 2 '07 #2

Peter Otten

Di**********@mail.com wrote:

I know that this topic has been discussed in the past, but I could not
find a working solution for my problem: sorting (lists of) strings
containing special characters like "ä", "ü",... (german umlaute).
Consider the following list:
l = ["Aber", "Beere", "Ärger"]

For sorting the letter "Ä" is supposed to be treated like "Ae",

I don't think so:

>>sorted(["Ast", "Ärger", "Ara"], locale.strcoll)

['Ara', '\xc3\x84rger', 'Ast']

>>sorted(["Ast", "Aerger", "Ara"])

['Aerger', 'Ara', 'Ast']

therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

I know about the module locale and its method strcoll(string1,
string2), but currently this does not work correctly for me. Consider

>>locale.strcoll("Ärger", "Beere")

1

Therefore "Ärger" ist sorted after "Beere", which is not correct IMO.
Can someone help?

Btw: I'm using WinXP (german) and

>>>locale.getdefaultlocale()

prints
('de_DE', 'cp1252')

The default locale is not used by default; you have to set it explicitly

>>import locale
locale.strcoll("Ärger", "Beere")

>>locale.setlocale(locale.LC_ALL, "")

'de_DE.UTF-8'

>>locale.strcoll("Ärger", "Beere")

-1

By the way, you will avoid a lot of "Ärger"* if you use unicode right from
the start.

Finally, for efficient sorting, a key function is preferable over a cmp
function:

>>sorted(["Ast", "Ärger", "Ara"], key=locale.strxfrm)

['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"

Mar 2 '07 #3

Hallvard B Furuseth

Di**********@mail.com writes:

For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

Are you sure? Maybe I'm thinking of another language, I thought Ä shold
be sorted together with A, but after A if the words are otherwise equal.
E.g. Antwort, Ärger, Beere. A proper strcoll handles that by
translating "Ärger" to e.g. ["Arger", <something like "E\0\0\0\0">],
then it can sort first by the un-accentified name and then by the rest.

--
Hallvard

Mar 2 '07 #4

Bjoern Schliessmann

Hallvard B Furuseth wrote:

Di**********@mail.com writes:

>For sorting the letter "Ä" is supposed to be treated like "Ae",
therefore sorting this list should yield
l = ["Aber, "Ärger", "Beere"]

Are you sure? Maybe I'm thinking of another language, I thought Ä
shold be sorted together with A, but after A if the words are
otherwise equal.

In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.

Regards,
Björn

--
BOFH excuse #277:

Your Flux Capacitor has gone bad.

Mar 2 '07 #5

DierkErdmann

On 2 Mrz., 15:25, Peter Otten <__pete...@web.dewrote:

DierkErdm...@mail.com wrote:
For sorting the letter "Ä" is supposed to be treated like "Ae",

There are several way of defining the sorting order. The variant "ä
equals ae" follows DINDIN 5007 (according to wikipedia); defining (a
equals ä) complies with DIN 5007-1. Therefore both options are
possible.

The default locale is not used by default; you have to set it explicitly

>import locale
locale.strcoll("Ärger", "Beere")

1

>locale.setlocale(locale.LC_ALL, "")

'de_DE.UTF-8'

>locale.strcoll("Ärger", "Beere")

-1

On my machine

>>locale.setlocale(locale.LC_ALL, "")

gives
'German_Germany.1252'

But this does not affect the sorting order as it does on your
computer.

>>locale.strcoll("Ärger", "Beere")

yields 1 in both cases.

Thank you for your hint using unicode from the beginning on, see the
difference:

>>s1 = unicode("Ärger", "latin-1")
s2 = unicode("Beere", "latin-1")
locale.strcoll(s1, s2)

>>locale.setlocale(locale.LC_ALL, "")

-1

compared to

>>s1 = "Ärger"
s2 = "Beere"
locale.strcoll(s1, s2)

>>locale.setlocale(locale.LC_ALL, "")

'German_Germany.1252'

>>locale.strcoll(s1, s2)

1

Thanks for your help.

Dierk

>
['Ara', '\xc3\x84rger', 'Ast']

Peter

(*) German for "trouble"

Mar 2 '07 #6

Robin Becker

Bjoern Schliessmann wrote:

Hallvard B Furuseth wrote:
>Di**********@mail.com writes:

........

>
In German, there are some different forms:

- the classic sorting for e.g. word lists: umlauts and plain vowels
are of same value (like you mentioned): ä = a

- name list sorting for e.g. phone books: umlauts have the same
value as their substitutes (like Dierk described): ä = ae

There are others, too, but those are the most widely used.

Björn, in one of our projects we are sorting in javascript in several languages
English, German, Scandinavian languages, Japanese; from somewhere (I cannot
actually remember) we got this sort spelling function for scandic languages

a
..replace(/\u00C4/g,'A~') //A umlaut
..replace(/\u00e4/g,'a~') //a umlaut
..replace(/\u00D6/g,'O~') //O umlaut
..replace(/\u00f6/g,'o~') //o umlaut
..replace(/\u00DC/g,'U~') //U umlaut
..replace(/\u00fc/g,'u~') //u umlaut
..replace(/\u00C5/g,'A~~') //A ring
..replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?
--
Robin Becker

Mar 2 '07 #7

Bjoern Schliessmann

Robin Becker wrote:

Björn, in one of our projects we are sorting in javascript in
several languages English, German, Scandinavian languages,
Japanese; from somewhere (I cannot actually remember) we got this
sort spelling function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?

If I'm not mistaken, this would sort all umlauts after the "pure"
vowels. This is, according to <http://de.wikipedia.org/wiki/
Alphabetische_Sortierung>, used in Austria.

If you can't understand german, the rules given there in
section "Einsortierungsregeln" (roughly: ordering rules) translate
as follows:

"X und Y sind gleich": "X equals Y"
"X kommt nach Y": "X comes after Y"

Regards&HTH,
Björn

--
BOFH excuse #146:

Communications satellite used by the military for star wars.

Mar 2 '07 #8

Jussi Salmela

Robin Becker kirjoitti:

>
Björn, in one of our projects we are sorting in javascript in several
languages English, German, Scandinavian languages, Japanese; from
somewhere (I cannot actually remember) we got this sort spelling
function for scandic languages

a
.replace(/\u00C4/g,'A~') //A umlaut
.replace(/\u00e4/g,'a~') //a umlaut
.replace(/\u00D6/g,'O~') //O umlaut
.replace(/\u00f6/g,'o~') //o umlaut
.replace(/\u00DC/g,'U~') //U umlaut
.replace(/\u00fc/g,'u~') //u umlaut
.replace(/\u00C5/g,'A~~') //A ring
.replace(/\u00e5/g,'a~~'); //a ring

does this actually make sense?

I think this order is not correct for Finnish, which is one of the
Scandinavian languages. The Finnish alphabet in alphabetical order is:

a-z, å, ä, ö

If I understand correctly your replacements cause the order of the last
3 characters to be

ä, å, ö

which is wrong.

HTH,
Jussi

Mar 4 '07 #9

by: Robert Zierhofer | last post by:

Hi all, I currently face a problem with htmlentities and german "umlaute". After moving my scripts to a new box (from Linux to FreeBSD) I had to see that htmlentities is not working anymore....

PHP

Trouble sorting lists (unicode/locale related?)

by: Erlend Fuglum | last post by:

Hi everyone, I'm having some trouble sorting lists. I suspect this might have something to do with locale settings and/or character encoding/unicode. Consider the following example, text...

Python

New functions in .NET 2.0 ???

by: Federico G. Babelis | last post by:

Hi All: I have this line of code, but the syntax check in VB.NET 2003 and also in VB.NET 2005 Beta 2 shows as unknown: Dim local4 As Byte Fixed(local4 = AddressOf dest(offset)) ...

.NET Framework

Unicode strings

by: Andrew L | last post by:

Hello all, What strategy should I use in solving the following problem? I have a list of unicode strings which I would like to compare with its English language 'equivalent.' eg "reykjavík"...

C / C++

sorting dates

by: mike | last post by:

If I had a date in the format "01-Jan-05" it does not sort properly with my sort routine: function compareDate(a,b) { var date_a = new Date(a); var date_b = new Date(b); if (date_a < date_b)...

Javascript

German Umlaute (Resources)

by: Carlo Marchesoni | last post by:

In order to have a mutli-lingual page, I load all .Text, .ToolTip etc from a resource, if the user does not work with the default-language, like this: Thread.CurrentThread.CurrentCulture = new...

ASP.NET

German special characters and english webedition

by: news.online.de | last post by:

Hello everybody, probably it's a FAQ but I didn't find anything so far concerning my problem, so I am asking here :-) I am facing the following problem: - I have developed a webservice client...

.NET Framework

How to upload form data containing special characters correctly?

by: Wim Cossement | last post by:

Hello, I was wondering if there are a few good pages and/or examples on how to process form data correctly for putting it in a MySQL DB. Since I'm not used to using PHP a lot, I already found...

PHP

Sorting strings with special characters

by: AMT India | last post by:

I am having a list of countries, among which some of them starts with German special characters ( like Umplot). I want to sort the list independent of this German characters. So that Umplot will come...

PHP

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Sorting strings containing special characters (german 'Umlaute')

Similar topics