By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,353 Members | 1,235 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,353 IT Pros & Developers. It's quick & easy.

accessing individual characters in unicode strings

P: n/a
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or καιif your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.

A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))
for kai this gives the following:

u'\u039e\u038a\u039e\xb1\u039e\u0389'

so now things should be simple, yes? just go through this and identify
each character...

Not so simple at all.
k, kappa: turns out to be TWO \u strings, not one: thus \u039e\u038a
similarly, iota is also two \u strings: \u039e\u0389
alpha is a \u string followed by a \x string: \u039e\xb1

looking elsewhere in the record,

my particular favourite is the midpoint character: this comes out as
\u03b1\x90\xa7 !
and in the middle of all this, there are some non-unicode characters:
\u039e\u038fc is o followed by c!

well, I don't have many characters to deal this and I could cope with
this mess by tedious matching character by character.
But surely, there is a better way...
help please
Peter Robinson: pe***@sd-editions.com
Jun 27 '08 #1
Share this Question
Share on Google+
1 Reply


P: n/a
On Apr 12, 3:45 pm, Peter Robinson <pe...@sd-editions.comwrote:
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.
The utf8-encoded incarnation is three characters long and it's six
bytes long. utf-8 is not unicode.
>
A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))
Don't do that. If you have a utf8 string, convert it to unicode like
this:

ustr = unicode(the_utf8_string, 'utf8')

If you have a string encoded in iso-8859-7, convert it to unicode like
this:

ustr = unicode(the_iso_8859_7_string, 'iso-8859-7')

Then inspect it like this:
print repr(ustr)

Here's a sample interactive session:
>>thisword = '\xce\xba\xce\xb1\xce\xb9'
ustr = unicode(thisword, 'utf8')
len(ustr)
3
>>print repr(ustr)
u'\u03ba\u03b1\u03b9'
>>import unicodedata
[unicodedata.name(x) for x in ustr]
['GREEK SMALL LETTER KAPPA', 'GREEK SMALL LETTER ALPHA', 'GREEK SMALL
LETTER IOTA']

Suggested reading: the Python Unicode HOWTO at http://www.amk.ca/python/howto/unicode

This may be handy: http://unicode.org/charts/PDF/U0370.pdf

HTH,
John
Jun 27 '08 #2

This discussion thread is closed

Replies have been disabled for this discussion.