Re: accessing individual characters in unicode strings

Peter Robinson schrieb:

Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or ÎºÎ±Î¹ if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar
to ASCII or Latin-1 but different in its inner workings. A single
character may be encoded by up to 6 bytes.

I highly recommend Joel's article on unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html

Christian

Jun 27 '08 #1

Subscribe Post Reply

1116

hdante

On Apr 12, 9:48 am, Christian Heimes <li...@cheimes.dewrote:

Peter Robinson schrieb:

Dear list
I am at my wits end on what seemed a very simple task:
I have some greek text, nicely encoded in utf8, going in and out of a
xml database, being passed over and beautifully displayed on the web.
For example: the most common greek word of all 'kai' (or êáé if your
mailer can see utf8)
So all I want to do is:
step through this string a character at a time, and do something for
each character (actually set a width attribute somewhere else for each
character)

As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar
to ASCII or Latin-1 but different in its inner workings. A single
character may be encoded by up to 6 bytes.

Up to 4 bytes in the latest versions. (the largest value is U+10FFFF
and is represented by 0xF4 0x8F 0xBF 0xBF).

I believe the proper way for returning the number of characters for
Greek would require a normalization first:

from unicodedata import normalize
def greek_text_length(utf8_string):
u = unicode(utf8_string, 'utf-8')
u = normalize('NFC', u)
return len(u)

If there are pairs of characters that count as one, things may be
worse.

>
I highly recommend Joel's article on unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)http://www.joelonsoftware.com/articles/Unicode.html

Christian

Jun 27 '08 #2

Similar topics

Foreign characters

by: Tetsuo | last post by:

How do I get Python to work with foreign characters? When I try to print them, I get a unicode error (characters not ASCII). Wasn't unicode invented for the express purpose of working with...

Python

Byte size of characters when encoding

by: Vladimir | last post by:

Method UnicodeEncoding.GetMaxByteCount(charCount) returns charCount * 2. Method UTF8Encoding.GetMaxByteCount(charCount) returns charCount * 4. But why that? Look: /* Each Unicode character...

.NET Framework

wide characters: "illusion of portability"?

by: Jonathan Mcdougall | last post by:

I started using boost's filesystem library a couple of days ago. In its FAQ, it states "Wide-character names would provide an illusion of portability where portability does not in fact exist....

C / C++

How to get the ascii code of Chinese characters?

by: many_years_after | last post by:

Hi,everyone: Have you any ideas? Say whatever you know about this. thanks.

Python

Encoding and norwegian (non ASCII) characters.

by: joakim.hove | last post by:

Hello, I am having great problems writing norwegian characters æøå to file from a python application. My (simplified) scenario is as follows: 1. I have a web form where the user can enter his...

Python

Dealing with "funny" characters

by: sophie_newbie | last post by:

Hi, I want to store python text strings that characters like "Ã©" "ÄŒ" in a mysql varchar text field. Now my problem is that mysql does not seem to accept these characters. I'm wondering if there...

Python

Delete all not allowed characters..

by: Abandoned | last post by:

Hi.. I want to delete all now allowed characters in my text. I use this function: def clear(s1=""): if s1: allowed = s1 = "".join(ch for ch in s1 if ch in allowed) return s1

Python

accessing individual characters in unicode strings

by: Peter Robinson | last post by:

Dear list I am at my wits end on what seemed a very simple task: I have some greek text, nicely encoded in utf8, going in and out of a xml database, being passed over and beautifully displayed...

Python

Re: accessing individual characters in unicode strings

by: Christian Heimes | last post by:

Peter Robinson schrieb: As John already said: UTF-8 ain't unicode. UTF-8 is an encoding similar to ASCII or Latin-1 but different in its inner workings. A single character may be encoded by up...

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server