Normalize a polish L

Peter Bengtsson

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png

I tried this:

>>import unicodedata
unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?

Oct 15 '07 #1

Subscribe Post Reply

4247

Thorsten Kampe

* Peter Bengtsson (Mon, 15 Oct 2007 16:33:26 -0000)

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png
I tried this:

>import unicodedata
unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

The 'L' is actually pronounced like the English "w"...

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

>>unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}')

'0043 0327'

>>unicodedata.normalize('NFKD', u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}').encode('ascii','ignore')

'C'

>>unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER L WITH STROKE}')

Oct 15 '07 #2

Bjoern Schliessmann

Thorsten Kampe wrote:

The 'L' is actually pronounced like the English "w"...

'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is ?ukas (short Wh-).

Regards,
Björn

--
BOFH excuse #126:

it has Intel Inside

Oct 15 '07 #3

Rob Wolfe

Peter Bengtsson <pe*****@gmail.comwrites:

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png

I tried this:

>>>import unicodedata
unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?

I had the same problem and my little research revealed that the problem
is caused by unicode standard itself. I don't know why
but characters with stroke don't have canonical equivalent.
I looked into this file:
http://unicode.org/Public/UNIDATA/UnicodeData.txt

and compared two positions:

1.
<UnicodeData.txt>
0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH \
;;0141;;0141
0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH \
;;;0142;
</UnicodeData.txt>

2.
<UnicodeData.txt>
0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK \
;;0104;;0104
</UnicodeData.txt>

In the second position there is in the 6-th field canonical equivalent
but in the 1-st there is nothing. I don't know what justification
is behind that, but probably there is something. ;)
Regards,
Rob

Oct 15 '07 #4

Thorsten Kampe

* Bjoern Schliessmann (Mon, 15 Oct 2007 21:51:54 +0200)

Thorsten Kampe wrote:
The 'L' is actually pronounced like the English "w"...

'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

There are lots of possible transcriptions for "LATIN CAPITAL LETTER L
WITH STROKE". Transcription is language dependent so the English and
German transcriptions of Polish names are different.

Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is ?ukas (short Wh-).

Why do you try to use characters in a character set that does not
contain these characters? That doesn't make any sense.
Thorsten

Oct 15 '07 #5

John Machin

On Oct 16, 2:33 am, Peter Bengtsson <pete...@gmail.comwrote:

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:http://static.peterbe.com/lukasz.png

I tried this:>>import unicodedata

>unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?

The character in question is NOT composed (in the way that Unicode
means) of an 'L' and a little slash; hence the concepts of
"normalization" and "decomposition" don't apply.

To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.

Oct 15 '07 #6

Bjoern Schliessmann

Thorsten Kampe wrote:

Why do you try to use characters in a character set that does not
contain these characters? That doesn't make any sense.

I thought KNode was smart enough to switch to UTF-8; obviously, it
isn't.

Regards,
Björn

--
BOFH excuse #121:

halon system went off and killed the operators.

Oct 15 '07 #7

Bjoern Schliessmann

Thorsten Kampe wrote:

The 'L' is actually pronounced like the English "w"...

'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is Åukas (short Wh-).

Regards,
BjÃ¶rn

--
BOFH excuse #126:

it has Intel Inside

Oct 15 '07 #8

Peter Bengtsson

On Oct 15, 10:57 pm, John Machin <sjmac...@lexicon.netwrote:

On Oct 16, 2:33 am, Peter Bengtsson <pete...@gmail.comwrote:

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:http://static.peterbe.com/lukasz.png

I tried this:>>import unicodedata
>>unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?

The character in question is NOT composed (in the way that Unicode
means) of an 'L' and a little slash; hence the concepts of
"normalization" and "decomposition" don't apply.

To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.

Thank you! That explains it.

Oct 16 '07 #9

Roberto Bonvallet

On Oct 22, 7:50 pm, Mike Orr <sluggos...@gmail.comwrote:

Well, that gets into official vs unofficial conversions. Does the
Spanish Academy really say 'ü' should be converted to 'u'?

No, but it's the only conversion that makes sense. The only Spanish
letter that doesn't have a standard common conversion by convention
is 'ñ', which is usually ASCIIfied as n, nn, gn, nh, ni, ny, ~n, n~,
or N, with all of them being frequently seen on the Internet.

But whether that should be hardcoded
into a blog URL library is different matter, and if it is there should
probably be plugin tables for different preferred standards.

Actually there is a hardcoded conversion, that is dropping all
accented letters altogether, which is IMHO the worst possible
convention. I have a gallery of pictures of Valparaíso and Viña del
Mar whose URL is .../ValparaSoViADelMar. And if I wrote a blog entry
about pingüinos and ñandúes, it would appear probably as .../ping-inos-
and-and-es. Ugly and off-topic :)

--
Roberto Bonvallet

Oct 23 '07 #10

by: Christos TZOTZIOY Georgiou | last post by:

I found at least one case where decombining and recombining a unicode character does not result in the same character (see at end). I have no extensive knowledge about Unicode, yet I believe that...

Python

Polish characters displayed incorrect after post

by: Philip Kofoed | last post by:

Greetings, I have a SQL server 2000 running on an english win2000 workstation. In a database I have a table where one varchar column is set to polish collation. Regional settings for the system...

Microsoft SQL Server

Installer in Polish in VB.NET 2003

by: Marcin Floryan | last post by:

Hello! How can I create an Installer entirely in Polish language using Deployment Project in VB.NET (I have already translated the available texts into Polish). I have .NET 2003 EN and I read I...

Visual Basic .NET

Storing and outputting Polish chars with MySql v5.0 and ASP

by: Larry Neylon | last post by:

Hi there, I'm currently trying to implement a website that will store and retrieve Polish, so I need to be able to handle Polish characters using classic ASP with MySql5. Does anybody have an...

MySQL Database

Problem with Polish accents with ASP.NET C#

by: AmigoFd | last post by:

Hello, This problem is really driving me crazy ... * I have a mySql database which is latin1_swedish_ci * In my web.config I have: <globalization requestEncoding="ISO-8859-2"...

ASP.NET

The normalize-space

by: kollatjorva | last post by:

Hi all I'm trying to get a value from an xml node 'Publisher' use the value as a name of an .css class. This works fine until I get a value from the Publisher node with white space in it. I've...

.NET Framework

Vector, matrix, normalize, rotate. What package?

by: =?iso-8859-1?B?TWF0dGlhcyBCcuRuZHN0cvZt?= | last post by:

Hello! I'm trying to find what package I should use if I want to: 1. Create 3d vectors. 2. Normalize those vectors. 3. Create a 3x3 rotation matrix from a unit 3-d vector and an angle in...

Python

strtolower with polish sign

by: robert.szczepanski | last post by:

Hi everybody; I can't change polish sign to small letter. This is my php script: <?php setlocale(LC_ALL, "pl_PL.UTF-8") ; //this function return "pl_PL.UTF-8"

PHP

Polish letters

by: Werner Partner | last post by:

I would like to write correct poloish letters, e.g. in the following page: http://www.kairos-team.de/?lang=pl There are such letters as Å‚, Å„, Ä™, and so on. I found these letter sin polish...

HTML / CSS

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Normalize a polish L

Similar topics