accessing individual characters in unicode strings

Peter Robinson

Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or ÎºÎ±Î¹if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.

A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))
for kai this gives the following:

u'\u039e\u038a\u039e\xb1\u039e\u0389'

so now things should be simple, yes? just go through this and identify
each character...

Not so simple at all.
k, kappa: turns out to be TWO \u strings, not one: thus \u039e\u038a
similarly, iota is also two \u strings: \u039e\u0389
alpha is a \u string followed by a \x string: \u039e\xb1

looking elsewhere in the record,

my particular favourite is the midpoint character: this comes out as
\u03b1\x90\xa7 !
and in the middle of all this, there are some non-unicode characters:
\u039e\u038fc is o followed by c!

well, I don't have many characters to deal this and I could cope with
this mess by tedious matching character by character.
But surely, there is a better way...
help please
Peter Robinson: pe***@sd-editions.com

Jun 27 '08 #1

Subscribe Post Reply

1444

John Machin

On Apr 12, 3:45 pm, Peter Robinson <pe...@sd-editions.comwrote:

Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or êáé if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.

The utf8-encoded incarnation is three characters long and it's six
bytes long. utf-8 is not unicode.

>
A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(thisword, 'iso-8859-7'))

Don't do that. If you have a utf8 string, convert it to unicode like
this:

ustr = unicode(the_utf8_string, 'utf8')

If you have a string encoded in iso-8859-7, convert it to unicode like
this:

ustr = unicode(the_iso_8859_7_string, 'iso-8859-7')

Then inspect it like this:
print repr(ustr)

Here's a sample interactive session:

>>thisword = '\xce\xba\xce\xb1\xce\xb9'
ustr = unicode(thisword, 'utf8')
len(ustr)

>>print repr(ustr)

u'\u03ba\u03b1\u03b9'

>>import unicodedata
[unicodedata.name(x) for x in ustr]

['GREEK SMALL LETTER KAPPA', 'GREEK SMALL LETTER ALPHA', 'GREEK SMALL
LETTER IOTA']

Suggested reading: the Python Unicode HOWTO at http://www.amk.ca/python/howto/unicode

This may be handy: http://unicode.org/charts/PDF/U0370.pdf

HTH,
John

Jun 27 '08 #2

Similar topics

Foreign characters

by: Tetsuo | last post by:

How do I get Python to work with foreign characters? When I try to print them, I get a unicode error (characters not ASCII). Wasn't unicode invented for the express purpose of working with...

Python

Byte size of characters when encoding

by: Vladimir | last post by:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character...

.NET Framework

wide characters: "illusion of portability"?

by: Jonathan Mcdougall | last post by:

I started using boost's filesystem library a couple of days ago. In its FAQ, it states "Wide-character names would provide an illusion of portability where portability does not in fact exist....

C / C++

How to get the ascii code of Chinese characters?

by: many_years_after | last post by:

Hi,everyone: Have you any ideas? Say whatever you know about this. thanks.

Python

Encoding and norwegian (non ASCII) characters.

by: joakim.hove | last post by:

Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his...

Python

Dealing with "funny" characters

by: sophie_newbie | last post by:

Hi, I want to store python text strings that characters like "Ã©" "ÄŒ" in a mysql varchar text field. Now my problem is that mysql does not seem to accept these characters. I'm wondering if there...

Python

Delete all not allowed characters..

by: Abandoned | last post by:

Hi.. I want to delete all now allowed characters in my text. I use this function: def clear(s1=""): if s1: allowed = s1 = "".join(ch for ch in s1 if ch in allowed) return s1

Python

Extracting Unicode characters from RTF

by: geegeegeegee | last post by:

Hi All, I have come across a difficult problem to do with extracting UniCode characters from RTF strings. A detailed description of my problem is below, if anyone could help, it would be much...

Microsoft Access / VBA

Re: accessing individual characters in unicode strings

by: Christian Heimes | last post by:

Peter Robinson schrieb: As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar to ASCII or Latin-1 but different in its inner workings. A single character may be encoded by up...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice