utf - string translation - Page 3

Hi,

I'm bringing over a thread that's going on on f.c.l.python.

The point was to get rid of french accents from words.

We noticed that len('à') != len('a') and I found the hack below to fix
the "problem" ... yet I do not understand - especially since 'à' is
included in the extended ASCII table, and thus can be stored in one byte.

Any clue ?

hg

# -*- coding: utf-8 -*-
import string

def convert(mot):
print len(mot)
print mot[0]
print '%x' % ord(mot[1])
table =
string.maketran s('àâäéèêëîïôöù üû','\x00a\x00a \x00a\x00e\x00e \x00e\x00e\x00i \x00i\x00o\x00o \x00u\x00u\x00u ')

return mot.translate(t able).replace(' \x00','')
c = 'àbôö a '
print convert(c)

Nov 22 '06

Subscribe Reply

2628

John Machin

Fredrik Lundh wrote:

John Machin wrote:

3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn), \xfe (small
thorn) and \xdf (sharp s aka Eszett).

I did post links to code that does this to this thread, several days ago...

Ah yes, I missed that -- and your posting doesn't advertise that the
code fixed the "one character should be mapped to two" cases :-)

This code
(http://effbot.python-hosting.com/fil...xt/unaccent.py)
looks generally very good, but I'm left wondering why "AE" and "OE" in
the table, not "Ae and "Oe":
[snip]
0xc6: u"AE", # LATIN CAPITAL LETTER AE <<<=== ??
0xd0: u"D", # LATIN CAPITAL LETTER ETH
0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE <<<=== ??
0xde: u"Th", # LATIN CAPITAL LETTER THORN
[snip]

Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukaszie wicz".translate (unaccented_map ())
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE

It looks like generating extra entries like that could be done, with
the aid of unicodedata.nam e():

LATIN CAPITAL LETTER X WITH blahblah -"X"
LATIN SMALL LETTER X WITH blahblah -"X".lower()

This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.

Cheers,
John

Nov 29 '06 #21

Fredrik Lundh

John Machin wrote:

Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukaszie wicz".translate (unaccented_map ())
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE

It looks like generating extra entries like that could be done, with
the aid of unicodedata.nam e():

LATIN CAPITAL LETTER X WITH blahblah -"X"
LATIN SMALL LETTER X WITH blahblah -"X".lower()

This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.

see the comments over at

http://effbot.org/zone/unicode-convert.htm

for an extended table, eyeballed by a regional expert (and since he
makes the same point about OE vs Oe as you do, I'll probably have to
change the code ;-)

</F>

Nov 29 '06 #22

John Machin

Fredrik Lundh wrote:

John Machin wrote:

Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
u"\u0141ukaszie wicz".translate (unaccented_map ())
doesn't work unless an entry is added to the no-decomposition table:
0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE

It looks like generating extra entries like that could be done, with
the aid of unicodedata.nam e():

LATIN CAPITAL LETTER X WITH blahblah -"X"
LATIN SMALL LETTER X WITH blahblah -"X".lower()

This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.

see the comments over at

http://effbot.org/zone/unicode-convert.htm

Don't rush me, I was getting to that next :-)

>
for an extended table, eyeballed by a regional expert (and since he
makes the same point about OE vs Oe as you do, I'll probably have to
change the code ;-)

Slightly extended. My point is that there is a large number of LATIN
(CAPITAL|SMALL) LETTER X WITH twiddly-bits that don't have a
decomposition; the table entries could be generated automatically

As well as regional experts, Google can be handy: googling for Thord,
Thordh, Thordsson and Thordhsson and noting the number of hits for each
tends to indicate that you and I are right about the treatment of
"eth"; Marcin's "dh" might better indicate how it's pronounced, but "d"
is AFAICT the standard transcription.

Cheers,
John

Nov 29 '06 #23

Similar topics

808

splitting a string and put it into an array

by: Kai Jaensch | last post by:

Hello, i am an newbie and i have to to solve this problem as fast as i can. But at this time i don´t have a lot of success. Can anybody help me (and understand my english :-))? I have a .txt-file in which the data is structured in that way: Project-Nr. ID name lastname 33 9 Lars Lundel 33 12 Emil Korla

C / C++

4649

Write a string in EBCDIC

by: John Leslie | last post by:

I need to write a string to a file in EBCDIC. Do I need to do it character by character using a translation table, or is there a function to translate the whole string? (I am aware that I can convert a whole file using Unix utilities, but this file will have only a few header records in EBCDIC)

C / C++

4100

Question about the clc string lib

by: Jeff | last post by:

In the function below, can size ever be 0 (zero)? char *clc_strdup(const char * CLC_RESTRICT s) { size_t size; char *p; clc_assert_not_null(clc_strdup, s); size = strlen(s) + 1;

C / C++

5027

To reverse a string

by: sudharsan | last post by:

could any one please give me a code to reverse a string of more than 1MB .??? Thanks in advance

C / C++

4279

Reading an xml string

by: JRD | last post by:

Greetings, I would like to search down through the following xml string that is returned to my calling app via a webservice. What I am trying to get is the following section from the xml string <component><section><code code="8716-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="VitalSigns" /><title>Vital...

.NET Framework

232

13384

Requesting advice how to clean up C code for validating string represents integer

by: robert maas, see http://tinyurl.com/uh3t | last post by:

I'm working on examples of programming in several languages, all (except PHP) running under CGI so that I can show both the source files and the actually running of the examples online. The first set of examples, after decoding the HTML FORM contents, merely verifies the text within a field to make sure it is a valid representation of an integer, without any junk thrown in, i.e. it must satisfy the regular expression: ^ *?+ *$ If the...

C / C++

1494

Set Property for Controls by Name as a String

by: Mahmoud Al-Qudsi | last post by:

To make it easier for translators to convert my program to their local language, I'm using XML files as translation dictionaries since its much easier for deployment and production purposes. This is a sample XML File: <translation> <language key="en_US" rtl="false"> <form1> <button1>Submit Form</button1>

C# / C Sharp

32079

"main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2

by: nrperry | last post by:

Hello, I have a question about this error: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2 I am doing my java development in IBM Rationale eclipse. I am compiling and running everything just fine. When I try to run my application on a commandline I get this error. I don't know why this is happening since it is working perfectly in eclipse. I am running a .bat file and here is the...

Java

2326

String concatenation

by: Pan | last post by:

#include <stdio.h> #define MYSTR "World" void foo(char *p) { puts(p); } int main() {

C / C++

6583

find and remove "\" character from string

by: Konstantinos Pachopoulos | last post by:

Hi, i have the following string s and the following code, which doesn't successfully remove the "\", but sucessfully removes the "\\". .... if i!="\\": .... newS=newS+i .... 'Sadasd\x07sd' I have also read the following, but i do not understand the "...and the

Python

9719

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9599

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10374

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10111

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9193

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5546

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5684

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3853

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

3010

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General