473,799 Members | 3,197 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

utf - string translation

hg
Hi,

I'm bringing over a thread that's going on on f.c.l.python.

The point was to get rid of french accents from words.

We noticed that len('à') != len('a') and I found the hack below to fix
the "problem" ... yet I do not understand - especially since 'à' is
included in the extended ASCII table, and thus can be stored in one byte.

Any clue ?

hg

# -*- coding: utf-8 -*-
import string

def convert(mot):
print len(mot)
print mot[0]
print '%x' % ord(mot[1])
table =
string.maketran s('àâäéèêëîïôöù üû','\x00a\x00a \x00a\x00e\x00e \x00e\x00e\x00i \x00i\x00o\x00o \x00u\x00u\x00u ')

return mot.translate(t able).replace(' \x00','')
c = 'àbôö a '
print convert(c)
Nov 22 '06 #1
22 2625
hg wrote:
We noticed that len('à') != len('a')
sounds odd.
>>len('à') == len('a')
True

are you perhaps using an UTF-8 editor?

to keep your sanity, no matter what editor you're using, I recommend
adding a coding directive to the source file, and using *only* Unicode
string literals for non-ASCII text.

or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Nov 22 '06 #2
hg
Fredrik Lundh wrote:
hg wrote:
>We noticed that len('à') != len('a')

sounds odd.
>>>len('à') == len('a')
True

are you perhaps using an UTF-8 editor?

to keep your sanity, no matter what editor you're using, I recommend
adding a coding directive to the source file, and using *only* Unicode
string literals for non-ASCII text.

or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>
Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('à')

returns 1 then 2

and string.maketran s(str1, str2) requires that len(str1) == len(str2)

hg

Nov 22 '06 #3
hg
hg wrote:
Fredrik Lundh wrote:
>hg wrote:
>>We noticed that len('à') != len('a')
sounds odd.
>>>>len('à') == len('a')
True

are you perhaps using an UTF-8 editor?

to keep your sanity, no matter what editor you're using, I recommend
adding a coding directive to the source file, and using *only* Unicode
string literals for non-ASCII text.

or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('à')

returns 1 then 2

and string.maketran s(str1, str2) requires that len(str1) == len(str2)

hg


PS: I'm running this under Idle
Nov 22 '06 #4
hg <hg@nospam.comw rote:
>or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('à')

returns 1 then 2
And if you do what was suggested and write:

# -*- coding: utf-8 -*-
import string
print len(u'a')
print len(u'à')

then you get:

1
1
Nov 22 '06 #5
hg
Duncan Booth wrote:
hg <hg@nospam.comw rote:
>>or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>
Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('à')

returns 1 then 2

And if you do what was suggested and write:

# -*- coding: utf-8 -*-
import string
print len(u'a')
print len(u'à')

then you get:

1
1
OK,

How would you handle the string.maketran s then ?

hg

Nov 22 '06 #6
hg wrote:
How would you handle the string.maketran s then ?
maketrans works on bytes, not characters. what makes you think that you
can use maketrans if you haven't gotten the slightest idea what's in the
string?

if you want to get rid of accents in a Unicode string, you can do the
approaches described here

http://www.peterbe.com/plog/unicode-to-ascii

or here

http://effbot.org/zone/unicode-convert.htm

which both works on any Unicode string.

</F>

Nov 22 '06 #7
hg
Fredrik Lundh wrote:
hg wrote:
>How would you handle the string.maketran s then ?

maketrans works on bytes, not characters. what makes you think that you
can use maketrans if you haven't gotten the slightest idea what's in the
string?

if you want to get rid of accents in a Unicode string, you can do the
approaches described here

http://www.peterbe.com/plog/unicode-to-ascii

or here

http://effbot.org/zone/unicode-convert.htm

which both works on any Unicode string.

</F>
Thanks
Nov 22 '06 #8
hg wrote:
Duncan Booth wrote:
hg <hg@nospam.comw rote:
>or in other words, put this at the top of your file (where "utf-8" is
whatever your editor/system is using):

# -*- coding: utf-8 -*-

and use

u'<text>'

for all non-ASCII literals.

</F>

Hi,

The problem is that:

# -*- coding: utf-8 -*-
import string
print len('a')
print len('à')

returns 1 then 2
And if you do what was suggested and write:

# -*- coding: utf-8 -*-
import string
print len(u'a')
print len(u'à')

then you get:

1
1
Some general comments:

1. There has been at least one thread on the subject of ripping accents
off Latin1 characters in the last 3 or 4 months. Try Google.

2. About your earlier problem, when len(thing1) != len(thing2):
In that and similar situations, it can be *very* useful to use this
technique:
print repr(thing1), type(thing1)
print repr(thing2), type(thing2)
Go back now and try it out!
OK,

How would you handle the string.maketran s then ?
I suggest that you first read the documentation on the str and unicode
"translate" methods.
You can obtain this quickly at the interactive prompt by doing
help(''.transla te)
and
help(u''.transl ate)
respectively.

Next steps:

Is your *real* data (not the examples you were hard-coding earlier)
encoded (latin1, utf8) in str objects or is it in unicode objects?
After reading previous posts my head is spinning & I'm not going to
guess; you determine it yourself.

[pseudocode -- blend of Pythonic & Knuthian styles]
if latin1: (A) you can use string.maketran s and str.translate
immediately.

elif unicode: (B) either (1) encode to latin1; goto (A) or (2) use
unicode.transla te with do-it-yourself mapping

elif utf8: decode to unicode; goto (B)

else: ???

HTH,
John

Nov 22 '06 #9
Dan
Thank you for your answers.

In fact, I'm getting start with Python.

I was looking for transform a text through elementary cryptographic
processes (Vigenère).
The initial text is in a file, and my system is under UTF-8 by default
(Ubuntu)

Nov 22 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

26
808
by: Kai Jaensch | last post by:
Hello, i am an newbie and i have to to solve this problem as fast as i can. But at this time i don´t have a lot of success. Can anybody help me (and understand my english :-))? I have a .txt-file in which the data is structured in that way: Project-Nr. ID name lastname 33 9 Lars Lundel 33 12 Emil Korla
12
4648
by: John Leslie | last post by:
I need to write a string to a file in EBCDIC. Do I need to do it character by character using a translation table, or is there a function to translate the whole string? (I am aware that I can convert a whole file using Unix utilities, but this file will have only a few header records in EBCDIC)
53
4096
by: Jeff | last post by:
In the function below, can size ever be 0 (zero)? char *clc_strdup(const char * CLC_RESTRICT s) { size_t size; char *p; clc_assert_not_null(clc_strdup, s); size = strlen(s) + 1;
47
5025
by: sudharsan | last post by:
could any one please give me a code to reverse a string of more than 1MB .??? Thanks in advance
1
4279
by: JRD | last post by:
Greetings, I would like to search down through the following xml string that is returned to my calling app via a webservice. What I am trying to get is the following section from the xml string <component><section><code code="8716-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="VitalSigns" /><title>Vital...
232
13360
by: robert maas, see http://tinyurl.com/uh3t | last post by:
I'm working on examples of programming in several languages, all (except PHP) running under CGI so that I can show both the source files and the actually running of the examples online. The first set of examples, after decoding the HTML FORM contents, merely verifies the text within a field to make sure it is a valid representation of an integer, without any junk thrown in, i.e. it must satisfy the regular expression: ^ *?+ *$ If the...
1
1494
by: Mahmoud Al-Qudsi | last post by:
To make it easier for translators to convert my program to their local language, I'm using XML files as translation dictionaries since its much easier for deployment and production purposes. This is a sample XML File: <translation> <language key="en_US" rtl="false"> <form1> <button1>Submit Form</button1>
6
32078
by: nrperry | last post by:
Hello, I have a question about this error: Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2 I am doing my java development in IBM Rationale eclipse. I am compiling and running everything just fine. When I try to run my application on a commandline I get this error. I don't know why this is happening since it is working perfectly in eclipse. I am running a .bat file and here is the...
8
2326
by: Pan | last post by:
#include <stdio.h> #define MYSTR "World" void foo(char *p) { puts(p); } int main() {
2
6582
by: Konstantinos Pachopoulos | last post by:
Hi, i have the following string s and the following code, which doesn't successfully remove the "\", but sucessfully removes the "\\". .... if i!="\\": .... newS=newS+i .... 'Sadasd\x07sd' I have also read the following, but i do not understand the "...and the
0
10470
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10214
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10023
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9067
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7561
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5459
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5583
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4135
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2935
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.