utf - string translation - Page 2

Hi,

I'm bringing over a thread that's going on on f.c.l.python.

The point was to get rid of french accents from words.

We noticed that len('à') != len('a') and I found the hack below to fix
the "problem" ... yet I do not understand - especially since 'à' is
included in the extended ASCII table, and thus can be stored in one byte.

Any clue ?

hg

# -*- coding: utf-8 -*-
import string

def convert(mot):
print len(mot)
print mot[0]
print '%x' % ord(mot[1])
table =
string.maketran s('àâäéèêëîïôöù üû','\x00a\x00a \x00a\x00e\x00e \x00e\x00e\x00i \x00i\x00o\x00o \x00u\x00u\x00u ')

return mot.translate(t able).replace(' \x00','')
c = 'àbôö a '
print convert(c)

Nov 22 '06

Subscribe Reply

2628

John Machin

Dan wrote:

Thank you for your answers.

In fact, I'm getting start with Python.

That was a good decision. Welcome!

>
I was looking for transform a text through elementary cryptographic
processes (Vigenère).

So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

The initial text is in a file, and my system is under UTF-8 by default
(Ubuntu)

Your system being "under UTF-8" does give you some clue, I suppose. Do
find the time to locate some data with accents and do print(repr(data ))
as I suggested, to *verify* what you've got.

Don't guess. Different underlying representations can look the same
when rendered on your screen. Don't rely on what sysadmins tell you.
Peculiar things can happen, e.g.

me: How is your data encoded?
them: XYZese [a language]
me: I'll try again; Are you using encoding A or encoding B?
them: We've heard A mentioned; what's an encoding anyway?
[snip long explanation plus investigation of what locales [plural] had
been used when configuring their workstations and servers]
them: OK, so there's more than one way of representing XYZese on a
computer. That might explain why the government regulatory authority
for our industry is very sad [to put it mildly] about not being able to
read our monthly filings!!!

Cheers,
John

Nov 22 '06 #11

David H Wild

In article <11************ **********@k70g 2000cwa.googleg roups.com>,
John Machin <sj******@lexic on.netwrote:

So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

The difference there, though, is a punctuation character, not an accent.

--
David Wild using RISC OS on broadband

Nov 22 '06 #12

John Machin

David H Wild wrote:

In article <11************ **********@k70g 2000cwa.googleg roups.com>,
John Machin <sj******@lexic on.netwrote:
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

The difference there, though, is a punctuation character, not an accent.

I did say "difference s in punctuation or accents". Yes, the only
example I could recall OTTOMH was a difference in punctuation --
according to legend, a fly-spot IIRC :-)

Nov 22 '06 #13

Klaas

David H Wild wrote:

In article <11************ **********@k70g 2000cwa.googleg roups.com>,
John Machin <sj******@lexic on.netwrote:
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

The difference there, though, is a punctuation character, not an accent.

It's not too hard to imagine an accentual difference, eg:

Le soldat protège avec le fusil --the soldier protects with the gun
Le soldat protégé avec le fusil --the soldier who is protected by
the gun (perhaps a cannon)

Contrived example, I realize, but there are scads of such instances.
(Caveat: my french is also very rusty).

-Mike

Nov 22 '06 #14

Fredrik Lundh

Klaas wrote:

It's not too hard to imagine an accentual difference, eg:

especially in languages where certain combinations really are distinct
letters, not just letters with accents or silly marks.

I have a Swedish children's book somewhere, in which some characters are
harassed by a big ugly monster who carries a sign around his neck that
says "Monster".

the protagonist ends up adding two dots to that sign, turning it into
"Mönster" (meaning "model", in the "model citizen" sense), and all ends
well.

just imagine that story in reverse.

</F>

Nov 23 '06 #15

Eric Brunel

On Wed, 22 Nov 2006 22:59:01 +0100, John Machin <sj******@lexic on.net>
wrote:
[snip]

So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

It may not be to store or even use the actual text. I stumbled on a
problem like this some time ago: I had some code building an index for a
document and wanted the entries starting with "e", "é", "è" or "ê" to be
in the same section...
--
python -c "print ''.join([chr(154 - ord(c)) for c in
'U(17zX(%,5.zmz 5(17l8(%,5.Z*(9 3-965$l7+-'])"

Nov 23 '06 #16

Dan

On 22 nov, 22:59, "John Machin" <sjmac...@lexic on.netwrote:

processes (Vigenère)
So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

of course.
My purpose is not doing something realistic on a cryptographic view.
It's for learning rudiments of programming.
In fact, coding characters is a kind of cryptography I mean, sometimes,
when friends can't read an email because of the characters used...

I wanted to strip off accents because I use the frequences of the
charactacters. If I only have 26 char, it's more easy to analyse (the
text can be shorter for example)

Nov 26 '06 #17

Frederic Rentsch

Dan wrote:

On 22 nov, 22:59, "John Machin" <sjmac...@lexic on.netwrote:

>>processes (Vigenère)

So why do you want to strip off accents? The history of communication
has several examples of significant difference in meaning caused by
minute differences in punctuation or accents including one of which you
may have heard: a will that could be read (in part) as either "a chacun
d'eux million francs" or "a chacun deux million francs" with the
remainder to a 3rd party.

of course.
My purpose is not doing something realistic on a cryptographic view.
It's for learning rudiments of programming.
In fact, coding characters is a kind of cryptography I mean, sometimes,
when friends can't read an email because of the characters used...

I wanted to strip off accents because I use the frequences of the
charactacters. If I only have 26 char, it's more easy to analyse (the
text can be shorter for example)

Try this:

from_characters =
'\xc0\xc1\xc2\x c3\xc4\xc5\xc6\ xc7\xc8\xc9\xca \xcb\xcc\xcd\xc e\xcf\xd0\xd1\x d2\xd3\xd4\xd5\ xd6\xd8\xd9\xda \xdb\xdc\xdd\xe 0\xe1\xe2\xe3\x e4\xe5\xe6\xec\ xed\xee\xef\xf0 \xf1\xf2\xf3\xf 4\xf5\xf6\xf8\x f9\xfa\xfb\xfc\ xfd\xff\xe7\xe8 \xe9\xea\xeb'
to_characters =
'AAAAAAACEEEEII IIDNOOOOOOUUUUY aaaaaaaiiiionoo oooouuuuyyceeee '
translation_tab le = string.maketran s (from_character s, to_characters)
translated_stri ng = string.translat e (original_strin g, translation_tab le)
Frederic

Nov 29 '06 #18

John Machin

Frederic Rentsch wrote:

Try this:

from_characters =
'\xc0\xc1\xc2\x c3\xc4\xc5\xc6\ xc7\xc8\xc9\xca \xcb\xcc\xcd\xc e\xcf\xd0\xd1\x d2\xd3\xd4\xd5\ xd6\xd8\xd9\xda \xdb\xdc\xdd\xe 0\xe1\xe2\xe3\x e4\xe5\xe6\xec\ xed\xee\xef\xf0 \xf1\xf2\xf3\xf 4\xf5\xf6\xf8\x f9\xfa\xfb\xfc\ xfd\xff\xe7\xe8 \xe9\xea\xeb'
to_characters =
'AAAAAAACEEEEII IIDNOOOOOOUUUUY aaaaaaaiiiionoo oooouuuuyyceeee '
translation_tab le = string.maketran s (from_character s, to_characters)
translated_stri ng = string.translat e (original_strin g, translation_tab le)

A few observations on the above:

1. This assumes that "original_strin g" is a str object, and the text is
encoded in latin1 or similar (e.g. cp1252).

2. Presentation of the map could be improved greatly, along the lines
of:

import pprint
import unicodedata
fromc = \
[snip]
toc = 'AAAAAAACEEEEII IIDNOOOOOOUUUUY aaaaaaaiiiionoo oooouuuuyyceeee '
assert len(fromc) == len(toc)
tups = list(zip(unicod e(fromc, 'latin1'), toc))
tups.sort()
tupsu = [(x[1], x[0], unicodedata.nam e(x[0], '** no name **')) for x in
tups]
pprint.pprint(t upsu)

which produces:

[('A', u'\xc0', 'LATIN CAPITAL LETTER A WITH GRAVE'),
('A', u'\xc1', 'LATIN CAPITAL LETTER A WITH ACUTE'),
[snip]
('D', u'\xd0', 'LATIN CAPITAL LETTER ETH'),
[snip]
('Y', u'\xdd', 'LATIN CAPITAL LETTER Y WITH ACUTE'),
('a', u'\xe0', 'LATIN SMALL LETTER A WITH GRAVE'),
[snip]
('o', u'\xf0', 'LATIN SMALL LETTER ETH'),
[snip]
('y', u'\xfd', 'LATIN SMALL LETTER Y WITH ACUTE'),
('y', u'\xff', 'LATIN SMALL LETTER Y WITH DIAERESIS')]

This makes it a lot easier to see what is going on, and check for
weirdness, like the inconsistent treatment of \xd0 and \xf0.

3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn), \xfe (small
thorn) and \xdf (sharp s aka Eszett).

Cheers,
John

Nov 29 '06 #19

Fredrik Lundh

John Machin wrote:

3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn), \xfe (small
thorn) and \xdf (sharp s aka Eszett).

I did post links to code that does this to this thread, several days ago...

</F>

Nov 29 '06 #20

Similar topics

808

splitting a string and put it into an array

by: Kai Jaensch | last post by:

Hello, i am an newbie and i have to to solve this problem as fast as i can. But at this time i don´t have a lot of success. Can anybody help me (and understand my english :-))? I have a .txt-file in which the data is structured in that way: Project-Nr. ID name lastname 33 9 Lars Lundel 33 12 Emil Korla

C / C++

4649

Write a string in EBCDIC

by: John Leslie | last post by:

I need to write a string to a file in EBCDIC. Do I need to do it character by character using a translation table, or is there a function to translate the whole string? (I am aware that I can convert a whole file using Unix utilities, but this file will have only a few header records in EBCDIC)

C / C++

4100

Question about the clc string lib

by: Jeff | last post by:

In the function below, can size ever be 0 (zero)? char *clc_strdup(const char * CLC_RESTRICT s) { size_t size; char *p; clc_assert_not_null(clc_strdup, s); size = strlen(s) + 1;

C / C++

5027

To reverse a string

by: sudharsan | last post by:

could any one please give me a code to reverse a string of more than 1MB .??? Thanks in advance

C / C++

4279

Reading an xml string

by: JRD | last post by:

Greetings, I would like to search down through the following xml string that is returned to my calling app via a webservice. What I am trying to get is the following section from the xml string <component><section><code code="8716-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="VitalSigns" /><title>Vital...

.NET Framework

232

13384

Requesting advice how to clean up C code for validating string represents integer

by: robert maas, see http://tinyurl.com/uh3t | last post by:

I'm working on examples of programming in several languages, all (except PHP) running under CGI so that I can show both the source files and the actually running of the examples online. The first set of examples, after decoding the HTML FORM contents, merely verifies the text within a field to make sure it is a valid representation of an integer, without any junk thrown in, i.e. it must satisfy the regular expression: ^ *?+ *$ If the...

C / C++

1494

Set Property for Controls by Name as a String

by: Mahmoud Al-Qudsi | last post by:

To make it easier for translators to convert my program to their local language, I'm using XML files as translation dictionaries since its much easier for deployment and production purposes. This is a sample XML File: <translation> <language key="en_US" rtl="false"> <form1> <button1>Submit Form</button1>

C# / C Sharp

32079

"main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2

by: nrperry | last post by:

Hello, I have a question about this error: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2 I am doing my java development in IBM Rationale eclipse. I am compiling and running everything just fine. When I try to run my application on a commandline I get this error. I don't know why this is happening since it is working perfectly in eclipse. I am running a .bat file and here is the...

Java

2326

String concatenation

by: Pan | last post by:

#include <stdio.h> #define MYSTR "World" void foo(char *p) { puts(p); } int main() {

C / C++

6583

find and remove "\" character from string

by: Konstantinos Pachopoulos | last post by:

Hi, i have the following string s and the following code, which doesn't successfully remove the "\", but sucessfully removes the "\\". .... if i!="\\": .... newS=newS+i .... 'Sadasd\x07sd' I have also read the following, but i do not understand the "...and the

Python

9719

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

10624

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10371

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10111

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7650

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6877

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5684

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4330

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3853

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP