473,395 Members | 1,637 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Re: accessing individual characters in unicode strings

Peter Robinson schrieb:
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or και if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)
As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar
to ASCII or Latin-1 but different in its inner workings. A single
character may be encoded by up to 6 bytes.

I highly recommend Joel's article on unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Christian

Jun 27 '08 #1
1 1116
On Apr 12, 9:48 am, Christian Heimes <li...@cheimes.dewrote:
Peter Robinson schrieb:
Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or êáé if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar
to ASCII or Latin-1 but different in its inner workings. A single
character may be encoded by up to 6 bytes.
Up to 4 bytes in the latest versions. (the largest value is U+10FFFF
and is represented by 0xF4 0x8F 0xBF 0xBF).

I believe the proper way for returning the number of characters for
Greek would require a normalization first:

from unicodedata import normalize
def greek_text_length(utf8_string):
u = unicode(utf8_string, 'utf-8')
u = normalize('NFC', u)
return len(u)

If there are pairs of characters that count as one, things may be
worse.
>
I highly recommend Joel's article on unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)http://www.joelonsoftware.com/articles/Unicode.html

Christian
Jun 27 '08 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Tetsuo | last post by:
How do I get Python to work with foreign characters? When I try to print them, I get a unicode error (characters not ASCII). Wasn't unicode invented for the express purpose of working with...
43
by: Vladimir | last post by:
Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character...
3
by: Jonathan Mcdougall | last post by:
I started using boost's filesystem library a couple of days ago. In its FAQ, it states "Wide-character names would provide an illusion of portability where portability does not in fact exist....
19
by: many_years_after | last post by:
Hi,everyone: Have you any ideas? Say whatever you know about this. thanks.
2
by: joakim.hove | last post by:
Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his...
3
by: sophie_newbie | last post by:
Hi, I want to store python text strings that characters like "é" "Č" in a mysql varchar text field. Now my problem is that mysql does not seem to accept these characters. I'm wondering if there...
9
by: Abandoned | last post by:
Hi.. I want to delete all now allowed characters in my text. I use this function: def clear(s1=""): if s1: allowed = s1 = "".join(ch for ch in s1 if ch in allowed) return s1
1
by: Peter Robinson | last post by:
Dear list I am at my wits end on what seemed a very simple task: I have some greek text, nicely encoded in utf8, going in and out of a xml database, being passed over and beautifully displayed...
0
by: Christian Heimes | last post by:
Peter Robinson schrieb: As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar to ASCII or Latin-1 but different in its inner workings. A single character may be encoded by up...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.