473,321 Members | 1,916 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,321 software developers and data experts.

Re: convert unicode characters to visibly similar ascii characters

Peter Bulychev wrote:
Hello.

I want to convert unicode character into ascii one.
The method ".encode('ASCII') " can convert only those unicode
characters, which fit into 0..128 range.

But there are still lots of characters beyond this range, which can be
manually converted to some visibly similar ascii characters. For
instance, there are several quotation marks in unicode, which can be
converted into ascii quotation mark.
Please be more specific. There is no general solution. Unicode can
handle latin, cyrilic (russian), chinese, japanese and arabic characters
in the same string. There are thousands of possible non-ascii characters
and many of them are not similar to any ascii character.

If you only want this to work for a subset, please define that subset.

Laszlo

Jul 1 '08 #1
7 3928
Jim
Peter Bulychev wrote:
I want to convert unicode character into ascii one.
You have to make some arbitrary choices of what to translate. Based
on some materials on effbot's site, and a recipe, I made
ftp://alan.smcvt.edu/hefferon/unicode2ascii.py
which has at least some of what you are looking for.
$ grep HYPHEN unicode2ascii.py
u'\N{SOFT HYPHEN}':u'-',
u'\N{HYPHEN}':u'-',
u'\N{NON-BREAKING HYPHEN}':u'-',
u'\N{SOFT HYPHEN}': '-',
No doubt I have some terrible gaffes and some things missing.
Corrections appreciated.

Jim
Jul 2 '08 #2
Jim
Peter Bulychev wrote:
I want to convert unicode character into ascii one.
You have to make some arbitrary choices of what to translate. Based
on some materials on effbot's site, and a recipe, I made
ftp://alan.smcvt.edu/hefferon/unicode2ascii.py
which has at least some of what you are looking for.
$ grep HYPHEN unicode2ascii.py
u'\N{SOFT HYPHEN}':u'-',
u'\N{HYPHEN}':u'-',
u'\N{NON-BREAKING HYPHEN}':u'-',
u'\N{SOFT HYPHEN}': '-',
No doubt I have some terrible gaffes and some things missing.
Corrections appreciated.

Jim
Jul 2 '08 #3
On Jul 2, 9:55 am, Jim <jim.heffe...@gmail.comwrote:
Peter Bulychev wrote:
I want to convert unicode character into ascii one.

You have to make some arbitrary choices of what to translate. Based
on some materials on effbot's site, and a recipe, I made
ftp://alan.smcvt.edu/hefferon/unicode2ascii.py
which has at least some of what you are looking for.
$ grep HYPHEN unicode2ascii.py
u'\N{SOFT HYPHEN}':u'-',
u'\N{HYPHEN}':u'-',
u'\N{NON-BREAKING HYPHEN}':u'-',
u'\N{SOFT HYPHEN}': '-',
No doubt I have some terrible gaffes and some things missing.
Corrections appreciated.
Comments on the above grep output:
1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'
2. The idea of a soft hyphen is as a hint to a hyphenator about where
to insert a hyphen if one is necessary and the hyphenator is suspected
of acting cluelessly without the hint. IMHO, asciification should
substitute u'', not u'-'.
3. Read PEP 8. s/:/: /

Cheers,
John
Jul 2 '08 #4
Jim
On Jul 1, 8:29 pm, John Machin <sjmac...@lexicon.netwrote:
On Jul 2, 9:55 am, Jim <jim.heffe...@gmail.comwrote:

Comments on the above grep output:
1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'
Hmph. I'll correct that. Thanks.
2. The idea of a soft hyphen is as a hint to a hyphenator about where
to insert a hyphen if one is necessary and the hyphenator is suspected
of acting cluelessly without the hint. IMHO, asciification should
substitute u'', not u'-'.
Thanks also here. I'll think about it.
3. Read PEP 8. s/:/: /
I don't like the spacing in 8, personally.

Thanks,
Jim
Jul 2 '08 #5
Jim
On Jul 1, 8:29 pm, John Machin <sjmac...@lexicon.netwrote:
On Jul 2, 9:55 am, Jim <jim.heffe...@gmail.comwrote:

Comments on the above grep output:
1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'
Hmph. I'll correct that. Thanks.
2. The idea of a soft hyphen is as a hint to a hyphenator about where
to insert a hyphen if one is necessary and the hyphenator is suspected
of acting cluelessly without the hint. IMHO, asciification should
substitute u'', not u'-'.
Thanks also here. I'll think about it.
3. Read PEP 8. s/:/: /
I don't like the spacing in 8, personally.

Thanks,
Jim
Jul 2 '08 #6
Jim <ji**********@gmail.comwrites:
I don't like the spacing in [PEP 8], personally.
Nevertheless, your Python code will be much less effort to read by
others (and yourself in future) if it is written in conformance with
PEP 8.

Writing all your Python code to conform with that standard is the
simplest step you can take to ensure that your code won't cause other
Python programmers undue reading effort.

--
\ “There's no excuse to be bored. Sad, yes. Angry, yes. |
`\ Depressed, yes. Crazy, yes. But there's no excuse for boredom, |
_o__) ever.” —Viggo Mortensen |
Ben Finney
Jul 2 '08 #7
Jim
On Jul 1, 8:42 pm, Jim <jim.heffe...@gmail.comwrote:
On Jul 1, 8:29 pm, John Machin <sjmac...@lexicon.netwrote:
Comments on the above grep output:
1. You have SOFT HYPHEN twice, mapping it to u'-' and '-'

Hmph. I'll correct that. Thanks.
Well, maybe not. I forgot that I got the by-hand conversions from
three different sources and that's why that character appears in two
different places. (I thought that listing all cases for each source
was less confusing. Arguable, for sure.)
2. The idea of a soft hyphen is as a hint to a hyphenator about where
to insert a hyphen if one is necessary and the hyphenator is suspected
of acting cluelessly without the hint. IMHO, asciification should
substitute u'', not u'-'.

Thanks also here. I'll think about it.
Googling "soft hyphen" showed me that the question is not perfectly
clear-- some people seem to have very elaborate opinions on the
topic-- but I've gone with your suggestion. Thank you.

Again, I'd appreciate additional corrections. Not do I only speak
ASCII :-( but I admit to entering the data while watching a basketball
game, so no doubt there are some real blunders.

Thanks,
Jim
Jul 2 '08 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Markus Hmmerli | last post by:
I ' ll tra to convert a Cstring to char* without success. I working in a Unicode enabled environment this is working in Unicode CString source = _T("TestString"); TCHAR *szSource =...
8
by: Eric Lilja | last post by:
Hello, I had what I thought was normal text-file and I needed to locate a string matching a certain pattern in that file and, if found, replace that string. I thought this would be simple but I had...
3
by: culley harrelson | last post by:
It seems to me that these values should be the same: select 'lydia eugenia trevio', convert('lydia eugenia trevio' using ascii_to_utf_8); but they seem to be different. What am I missing? ...
8
by: davihigh | last post by:
My Friends: I am using std::ofstream (as well as ifstream), I hope that when i wrote in some std::string(...) with locale, ofstream can convert to UTF-8 encoding and save file to disk. So does...
8
by: csanjith | last post by:
Hi, i have a situaion where i need to convert the characters entered in an text field to upper case using C. The configuration id utf8 environment in which user can enter any character (single ,...
4
by: thinktwice | last post by:
i'm using VC++6 IDE i know i could use macros like A2T, T2A, but is there any way more decent way to do this?
24
by: Donn Ingle | last post by:
Hello, I hope someone can illuminate this situation for me. Here's the nutshell: 1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale. 2. If this returns "C" or anything...
0
by: M.-A. Lemburg | last post by:
On 2008-07-01 20:31, Peter Bulychev wrote: You could write a codec which translates Unicode into a ASCII lookalike characters, but AFAIK there is no standard for doing this. I guess the best...
19
by: est | last post by:
From python manual str( ) Return a string containing a nicely printable representation of an object. For strings, this returns the string itself. The difference with repr(object) is that...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, youll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.