473,387 Members | 1,683 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

Normalize a polish L

In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png

I tried this:
>>import unicodedata
unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?

Oct 15 '07 #1
9 4247
* Peter Bengtsson (Mon, 15 Oct 2007 16:33:26 -0000)
In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png
I tried this:
>import unicodedata
unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.
The 'L' is actually pronounced like the English "w"...
I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.
>>unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}')
'0043 0327'
>>unicodedata.normalize('NFKD', u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}').encode('ascii','ignore')
'C'
>>unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER L WITH STROKE}')
''
Oct 15 '07 #2
Thorsten Kampe wrote:
The 'L' is actually pronounced like the English "w"...
'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is ?ukas (short Wh-).

Regards,
Bjrn

--
BOFH excuse #126:

it has Intel Inside

Oct 15 '07 #3
Peter Bengtsson <pe*****@gmail.comwrites:
In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png

I tried this:
>>>import unicodedata
unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?
I had the same problem and my little research revealed that the problem
is caused by unicode standard itself. I don't know why
but characters with stroke don't have canonical equivalent.
I looked into this file:
http://unicode.org/Public/UNIDATA/UnicodeData.txt

and compared two positions:

1.
<UnicodeData.txt>
0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH \
;;0141;;0141
0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH \
;;;0142;
</UnicodeData.txt>

2.
<UnicodeData.txt>
0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK \
;;0104;;0104
</UnicodeData.txt>

In the second position there is in the 6-th field canonical equivalent
but in the 1-st there is nothing. I don't know what justification
is behind that, but probably there is something. ;)
Regards,
Rob
Oct 15 '07 #4
* Bjoern Schliessmann (Mon, 15 Oct 2007 21:51:54 +0200)
Thorsten Kampe wrote:
The 'L' is actually pronounced like the English "w"...

'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.
There are lots of possible transcriptions for "LATIN CAPITAL LETTER L
WITH STROKE". Transcription is language dependent so the English and
German transcriptions of Polish names are different.
Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is ?ukas (short Wh-).
Why do you try to use characters in a character set that does not
contain these characters? That doesn't make any sense.
Thorsten
Oct 15 '07 #5
On Oct 16, 2:33 am, Peter Bengtsson <pete...@gmail.comwrote:
In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:http://static.peterbe.com/lukasz.png

I tried this:>>import unicodedata
>unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')

''

I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.

I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.

What am I doing wrong?
The character in question is NOT composed (in the way that Unicode
means) of an 'L' and a little slash; hence the concepts of
"normalization" and "decomposition" don't apply.

To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.
Oct 15 '07 #6
Thorsten Kampe wrote:
Why do you try to use characters in a character set that does not
contain these characters? That doesn't make any sense.
I thought KNode was smart enough to switch to UTF-8; obviously, it
isn't.

Regards,
Bjrn

--
BOFH excuse #121:

halon system went off and killed the operators.

Oct 15 '07 #7
Thorsten Kampe wrote:
The 'L' is actually pronounced like the English "w"...
'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.

Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is Łukas (short Wh-).

Regards,
Björn

--
BOFH excuse #126:

it has Intel Inside

Oct 15 '07 #8
On Oct 15, 10:57 pm, John Machin <sjmac...@lexicon.netwrote:
On Oct 16, 2:33 am, Peter Bengtsson <pete...@gmail.comwrote:
In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:http://static.peterbe.com/lukasz.png
I tried this:>>import unicodedata
>>unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
''
I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.
I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.
What am I doing wrong?

The character in question is NOT composed (in the way that Unicode
means) of an 'L' and a little slash; hence the concepts of
"normalization" and "decomposition" don't apply.

To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.
Thank you! That explains it.

Oct 16 '07 #9
On Oct 22, 7:50 pm, Mike Orr <sluggos...@gmail.comwrote:
Well, that gets into official vs unofficial conversions. Does the
Spanish Academy really say '' should be converted to 'u'?
No, but it's the only conversion that makes sense. The only Spanish
letter that doesn't have a standard common conversion by convention
is '', which is usually ASCIIfied as n, nn, gn, nh, ni, ny, ~n, n~,
or N, with all of them being frequently seen on the Internet.
But whether that should be hardcoded
into a blog URL library is different matter, and if it is there should
probably be plugin tables for different preferred standards.
Actually there is a hardcoded conversion, that is dropping all
accented letters altogether, which is IMHO the worst possible
convention. I have a gallery of pictures of Valparaso and Via del
Mar whose URL is .../ValparaSoViADelMar. And if I wrote a blog entry
about pinginos and andes, it would appear probably as .../ping-inos-
and-and-es. Ugly and off-topic :)

--
Roberto Bonvallet

Oct 23 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Christos TZOTZIOY Georgiou | last post by:
I found at least one case where decombining and recombining a unicode character does not result in the same character (see at end). I have no extensive knowledge about Unicode, yet I believe that...
7
by: Philip Kofoed | last post by:
Greetings, I have a SQL server 2000 running on an english win2000 workstation. In a database I have a table where one varchar column is set to polish collation. Regional settings for the system...
2
by: Marcin Floryan | last post by:
Hello! How can I create an Installer entirely in Polish language using Deployment Project in VB.NET (I have already translated the available texts into Polish). I have .NET 2003 EN and I read I...
0
by: Larry Neylon | last post by:
Hi there, I'm currently trying to implement a website that will store and retrieve Polish, so I need to be able to handle Polish characters using classic ASP with MySql5. Does anybody have an...
2
by: AmigoFd | last post by:
Hello, This problem is really driving me crazy ... * I have a mySql database which is latin1_swedish_ci * In my web.config I have: <globalization requestEncoding="ISO-8859-2"...
4
by: kollatjorva | last post by:
Hi all I'm trying to get a value from an xml node 'Publisher' use the value as a name of an .css class. This works fine until I get a value from the Publisher node with white space in it. I've...
5
by: =?iso-8859-1?B?TWF0dGlhcyBCcuRuZHN0cvZt?= | last post by:
Hello! I'm trying to find what package I should use if I want to: 1. Create 3d vectors. 2. Normalize those vectors. 3. Create a 3x3 rotation matrix from a unit 3-d vector and an angle in...
4
by: robert.szczepanski | last post by:
Hi everybody; I can't change polish sign to small letter. This is my php script: <?php setlocale(LC_ALL, "pl_PL.UTF-8") ; //this function return "pl_PL.UTF-8"
8
by: Werner Partner | last post by:
I would like to write correct poloish letters, e.g. in the following page: http://www.kairos-team.de/?lang=pl There are such letters as ł, ń, ę, and so on. I found these letter sin polish...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.