Counting unicode graphemes in python

Srinath Avadhanula

Hello,

I am wondering if there is a way of counting graphemes (or
glyphs) in python. For example, in the following string:

u'\u0915\u093e\u0915'
(
or equivalently,
u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER KA}"
)

the first two "code points" represent a single character on the screen.
In my application, the GUI seems to handle that part (i.e combining
characters). However, I need to handle cursor movement myself. The GUI
can only be told to move forward by a specified number of bytes.
Therefore, to make cursor keys move over graphemes or glyps rather than
code-points, I need to figure out a way to calculate grapheme boundaries
in python. I searched the web for a long long time and came up with a
few results, the most relevant of which seems to be:

http://www.unicode.org/reports/tr29/tr29-2.html

This page contains rules for calculating grapheme boundaries for Hangul
characters or something of that sort. However, I did not find any
information about more general algorithms.

I also took a look at the unicodedata module in python and that seems to
have a function called unicodedata.category. This function seems to
returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
have been unable to find a reference for what these strings signify.
Where should I look for them? (I am hoping for something more specific
than "Look at www.unicode.org") Is this information relevant at all for
counting graphemes?

Thanks,
Srinath

Jul 18 '05 #1

Subscribe Post Reply

3035

vincent wehren

"Srinath Avadhanula" <sr*************@yahoo.com> schrieb im Newsbeitrag
news:Pi**************************************@albi noni.EECS.Berkeley.EDU...
| Hello,
|
| I am wondering if there is a way of counting graphemes (or
| glyphs) in python. For example, in the following string:
|
| u'\u0915\u093e\u0915'
| (
| or equivalently,
| u"\N{DEVANAGARI LETTER KA}\N{DEVANAGARI VOWEL SIGN AA}\N{DEVANAGARI LETTER
KA}"
| )
|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)

| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?

| Therefore, to make cursor keys move over graphemes or glyps rather than
| code-points, I need to figure out a way to calculate grapheme boundaries
| in python. I searched the web for a long long time and came up with a
| few results, the most relevant of which seems to be:
|
| http://www.unicode.org/reports/tr29/tr29-2.html
|
| This page contains rules for calculating grapheme boundaries for Hangul
| characters or something of that sort. However, I did not find any
| information about more general algorithms.
Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/do...i-HOWTO-5.html

|
| I also took a look at the unicodedata module in python and that seems to
| have a function called unicodedata.category. This function seems to
| returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
| have been unable to find a reference for what these strings signify.
| Where should I look for them? (I am hoping for something more specific
| than "Look at www.unicode.org")

Would "Look at
http://www.unicode.org/Public/UNIDAT...ategory_Values " do?

HTH,
Vincent Wehren

|
| Thanks,
| Srinath
|

Jul 18 '05 #2

Srinath Avadhanula

On Fri, 24 Oct 2003, vincent wehren wrote:

|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)
After a sleepless night, I finally found out that calculating grapheme
boundaries for devanagari is not so hard after all. It seems to work
reasonably well if I use just three simple rules:

To detect whether in the code point sequence 'ab', the junction between
'a' and 'b' is a glyph boundary.

1. If 'b' is some kind of a mark (i.e unicodedata.category(b) starts
with 'M'), then the 'ab' junction is not a glyph boundary.

2. If 'b' is not a Mark, but is a devanagari letter (i.e category 'Lo')
AND 'a' is a VIRAMA character i.e, 'VIRAMA' in unicodedata.name(a),
then the 'ab' junction is not a glyph boundary.

3. In every other situation, the 'ab' junction is a glyph boundary.

Dont really know if this is completely correct, but it performs pretty
well on quite a big sanskrit text I have... Handles things like

NA + HALANT + DHA + HALANT + YA + AA

and reports it (correctly) as a single glyph.
| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?
I am using wxPython on windows XP. There are two text display widgets,
wxTextCtrl and wxStyledTextCtrl. The former is pretty basic but the
caret positioning is pretty robust. The latter is very fancy, hanles
syntax highlighting etc, but has some serious problems with combining
characters.
Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/do...i-HOWTO-5.html
Thanks for the link!
Would "Look at
http://www.unicode.org/Public/UNIDAT...ategory_Values " do?

It does indeed. Notice my new-found fluency with unicodedata.category?
:)

Thanks,
Srinath

Jul 18 '05 #3

by: sebastien.hugues | last post by:

Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name:...

Python

Unicode from Web to MySQL

by: Bill Eldridge | last post by:

I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5,...

Python

Destructors and reference counting

by: Elbert Lev | last post by:

Please correct me if I'm wrong. Python (as I understand) uses reference counting to determine when to delete the object. As soon as the object goes out of the scope it is deleted. Python does...

Python

Unicode BOM marks

by: Francis Girard | last post by:

Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary...

Python

UTF8 / UTF16 / Unicode 3.2 / RFC 3491 - Internationalization of Strings (Framework oversite?)

by: Chris Mullins | last post by:

I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following...

.NET Framework

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

Shrinky-dink Python (also, non-Unicode Python build is broken)

by: Larry Hastings | last post by:

I'm an indie shareware Windows game developer. In indie shareware game development, download size is terribly important; conventional wisdom holds that--even today--your download should be 5MB or...

Python

Python's handling of unicode surrogates

by: Adam Olsen | last post by:

As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own...

Python

Counting utf-8 characters -special characters

by: majna | last post by:

I have character counter for textarea wich counting the characters. Special character needs same place as two normal characters because of 16-bit encoding. Counter is counting -2 when special...

Javascript

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Counting unicode graphemes in python

Similar topics