473,803 Members | 3,461 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

accessing individual characters in unicode strings

Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or καιif your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.

A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(th isword, 'iso-8859-7'))
for kai this gives the following:

u'\u039e\u038a\ u039e\xb1\u039e \u0389'

so now things should be simple, yes? just go through this and identify
each character...

Not so simple at all.
k, kappa: turns out to be TWO \u strings, not one: thus \u039e\u038a
similarly, iota is also two \u strings: \u039e\u0389
alpha is a \u string followed by a \x string: \u039e\xb1

looking elsewhere in the record,

my particular favourite is the midpoint character: this comes out as
\u03b1\x90\xa7 !
and in the middle of all this, there are some non-unicode characters:
\u039e\u038fc is o followed by c!

well, I don't have many characters to deal this and I could cope with
this mess by tedious matching character by character.
But surely, there is a better way...
help please
Peter Robinson: pe***@sd-editions.com
Jun 27 '08 #1
1 1474
On Apr 12, 3:45 pm, Peter Robinson <pe...@sd-editions.comwro te:
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or êáé if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

Should be simple, yes?
turns out to be near impossible. I tried using a simple index
character routine such as ustr[0]..ustr[1]... and this gives rubbish.
So I use len() to find out how long my simple greek string is, and of
course it is NOT three characters long.
The utf8-encoded incarnation is three characters long and it's six
bytes long. utf-8 is not unicode.
>
A day of intensive searching around the lists tells me that unicode
and python is a moving target: so many fixes are suggested for similar
problems, none apparently working with mine.

Here is the best I can do, so far
I convert the utf8 string using
ustr = repr(unicode(th isword, 'iso-8859-7'))
Don't do that. If you have a utf8 string, convert it to unicode like
this:

ustr = unicode(the_utf 8_string, 'utf8')

If you have a string encoded in iso-8859-7, convert it to unicode like
this:

ustr = unicode(the_iso _8859_7_string, 'iso-8859-7')

Then inspect it like this:
print repr(ustr)

Here's a sample interactive session:
>>thisword = '\xce\xba\xce\x b1\xce\xb9'
ustr = unicode(thiswor d, 'utf8')
len(ustr)
3
>>print repr(ustr)
u'\u03ba\u03b1\ u03b9'
>>import unicodedata
[unicodedata.nam e(x) for x in ustr]
['GREEK SMALL LETTER KAPPA', 'GREEK SMALL LETTER ALPHA', 'GREEK SMALL
LETTER IOTA']

Suggested reading: the Python Unicode HOWTO at http://www.amk.ca/python/howto/unicode

This may be handy: http://unicode.org/charts/PDF/U0370.pdf

HTH,
John
Jun 27 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
3305
by: Tetsuo | last post by:
How do I get Python to work with foreign characters? When I try to print them, I get a unicode error (characters not ASCII). Wasn't unicode invented for the express purpose of working with non-ASCII characters?
43
3795
by: Vladimir | last post by:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character in a string is defined by a Unicode scalar value, also called ...
3
2651
by: Jonathan Mcdougall | last post by:
I started using boost's filesystem library a couple of days ago. In its FAQ, it states "Wide-character names would provide an illusion of portability where portability does not in fact exist. Behavior would be completely different on operating systems (Windows, for example) that support wide-character names, than on systems which don't (POSIX). Providing functionality that appears to provide portability but in fact
19
32843
by: many_years_after | last post by:
Hi,everyone: Have you any ideas? Say whatever you know about this. thanks.
2
6722
by: joakim.hove | last post by:
Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his name. 2. I use the cgi module module to get to the input from the user: .... name = form.value
3
2282
by: sophie_newbie | last post by:
Hi, I want to store python text strings that characters like "é" "Č" in a mysql varchar text field. Now my problem is that mysql does not seem to accept these characters. I'm wondering if there is any way I can somehow "encode" these characters to appear as normal characters and then "decode" them when I want to get them out of the database again? -Thanks.
9
2080
by: Abandoned | last post by:
Hi.. I want to delete all now allowed characters in my text. I use this function: def clear(s1=""): if s1: allowed = s1 = "".join(ch for ch in s1 if ch in allowed) return s1
6
3935
by: geegeegeegee | last post by:
Hi All, I have come across a difficult problem to do with extracting UniCode characters from RTF strings. A detailed description of my problem is below, if anyone could help, it would be much appreciated. I've tried to make the problem as clear as possible, but if any clarification is needed please let me know. Task -Convert RTF2 formatted text containing foreign characters (UniCode) to PlainText. Background -We are using Stephan...
1
1139
by: Christian Heimes | last post by:
Peter Robinson schrieb: As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar to ASCII or Latin-1 but different in its inner workings. A single character may be encoded by up to 6 bytes. I highly recommend Joel's article on unicode: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
0
10550
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10317
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9125
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7604
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6844
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5633
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4275
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3799
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2972
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.