473,406 Members | 2,549 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

accessing individual characters in unicode strings

Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or καιif your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.

A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))
for kai this gives the following:

u'\u039e\u038a\u039e\xb1\u039e\u0389'

so now things should be simple, yes? just go through this and identify
each character...

Not so simple at all.
k, kappa: turns out to be TWO \u strings, not one: thus \u039e\u038a
similarly, iota is also two \u strings: \u039e\u0389
alpha is a \u string followed by a \x string: \u039e\xb1

looking elsewhere in the record,

my particular favourite is the midpoint character: this comes out as
\u03b1\x90\xa7 !
and in the middle of all this, there are some non-unicode characters:
\u039e\u038fc is o followed by c!

well, I don't have many characters to deal this and I could cope with
this mess by tedious matching character by character.
But surely, there is a better way...
help please
Peter Robinson: pe***@sd-editions.com
Jun 27 '08 #1
1 1444
On Apr 12, 3:45 pm, Peter Robinson <pe...@sd-editions.comwrote:
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or êáé if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.
The utf8-encoded incarnation is three characters long and it's six
bytes long. utf-8 is not unicode.
>
A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))
Don't do that. If you have a utf8 string, convert it to unicode like
this:

ustr = unicode(the_utf8_string, 'utf8')

If you have a string encoded in iso-8859-7, convert it to unicode like
this:

ustr = unicode(the_iso_8859_7_string, 'iso-8859-7')

Then inspect it like this:
print repr(ustr)

Here's a sample interactive session:
>>thisword = '\xce\xba\xce\xb1\xce\xb9'
ustr = unicode(thisword, 'utf8')
len(ustr)
3
>>print repr(ustr)
u'\u03ba\u03b1\u03b9'
>>import unicodedata
[unicodedata.name(x) for x in ustr]
['GREEK SMALL LETTER KAPPA', 'GREEK SMALL LETTER ALPHA', 'GREEK SMALL
LETTER IOTA']

Suggested reading: the Python Unicode HOWTO at http://www.amk.ca/python/howto/unicode

This may be handy: http://unicode.org/charts/PDF/U0370.pdf

HTH,
John
Jun 27 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Tetsuo | last post by:
How do I get Python to work with foreign characters? When I try to print them, I get a unicode error (characters not ASCII). Wasn't unicode invented for the express purpose of working with...
43
by: Vladimir | last post by:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character...
3
by: Jonathan Mcdougall | last post by:
I started using boost's filesystem library a couple of days ago. In its FAQ, it states "Wide-character names would provide an illusion of portability where portability does not in fact exist....
19
by: many_years_after | last post by:
Hi,everyone: Have you any ideas? Say whatever you know about this. thanks.
2
by: joakim.hove | last post by:
Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his...
3
by: sophie_newbie | last post by:
Hi, I want to store python text strings that characters like "é" "Č" in a mysql varchar text field. Now my problem is that mysql does not seem to accept these characters. I'm wondering if there...
9
by: Abandoned | last post by:
Hi.. I want to delete all now allowed characters in my text. I use this function: def clear(s1=""): if s1: allowed = s1 = "".join(ch for ch in s1 if ch in allowed) return s1
6
by: geegeegeegee | last post by:
Hi All, I have come across a difficult problem to do with extracting UniCode characters from RTF strings. A detailed description of my problem is below, if anyone could help, it would be much...
1
by: Christian Heimes | last post by:
Peter Robinson schrieb: As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar to ASCII or Latin-1 but different in its inner workings. A single character may be encoded by up...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.