473,785 Members | 2,938 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Counting unicode graphemes in python

Hello,

I am wondering if there is a way of counting graphemes (or
glyphs) in python. For example, in the following string:

u'\u0915\u093e\ u0915'
(
or equivalently,
u"\N{DEVANAG ARI LETTER KA}\N{DEVANAGAR I VOWEL SIGN AA}\N{DEVANAGAR I LETTER KA}"
)

the first two "code points" represent a single character on the screen.
In my application, the GUI seems to handle that part (i.e combining
characters). However, I need to handle cursor movement myself. The GUI
can only be told to move forward by a specified number of bytes.
Therefore, to make cursor keys move over graphemes or glyps rather than
code-points, I need to figure out a way to calculate grapheme boundaries
in python. I searched the web for a long long time and came up with a
few results, the most relevant of which seems to be:

http://www.unicode.org/reports/tr29/tr29-2.html

This page contains rules for calculating grapheme boundaries for Hangul
characters or something of that sort. However, I did not find any
information about more general algorithms.

I also took a look at the unicodedata module in python and that seems to
have a function called unicodedata.cat egory. This function seems to
returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
have been unable to find a reference for what these strings signify.
Where should I look for them? (I am hoping for something more specific
than "Look at www.unicode.org ") Is this information relevant at all for
counting graphemes?

Thanks,
Srinath

Jul 18 '05 #1
2 3047
"Srinath Avadhanula" <sr************ *@yahoo.com> schrieb im Newsbeitrag
news:Pi******** *************** *************** @albinoni.EECS. Berkeley.EDU...
| Hello,
|
| I am wondering if there is a way of counting graphemes (or
| glyphs) in python. For example, in the following string:
|
| u'\u0915\u093e\ u0915'
| (
| or equivalently,
| u"\N{DEVANAG ARI LETTER KA}\N{DEVANAGAR I VOWEL SIGN AA}\N{DEVANAGAR I LETTER
KA}"
| )
|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)

| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?

| Therefore, to make cursor keys move over graphemes or glyps rather than
| code-points, I need to figure out a way to calculate grapheme boundaries
| in python. I searched the web for a long long time and came up with a
| few results, the most relevant of which seems to be:
|
| http://www.unicode.org/reports/tr29/tr29-2.html
|
| This page contains rules for calculating grapheme boundaries for Hangul
| characters or something of that sort. However, I did not find any
| information about more general algorithms.
Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/do...i-HOWTO-5.html

|
| I also took a look at the unicodedata module in python and that seems to
| have a function called unicodedata.cat egory. This function seems to
| returns strings 'Mn' for u'\u093f' and 'Lo' for u'\u093e'. However, I
| have been unable to find a reference for what these strings signify.
| Where should I look for them? (I am hoping for something more specific
| than "Look at www.unicode.org ")

Would "Look at
http://www.unicode.org/Public/UNIDAT...ategory_Values " do?

HTH,
Vincent Wehren

|
| Thanks,
| Srinath
|
Jul 18 '05 #2
On Fri, 24 Oct 2003, vincent wehren wrote:
|
| the first two "code points" represent a single character on the screen.

My GUESS is that you can do that unless you *know* exactly which codepoints
form ligatures. In DEVANAGARI this are e.g. the so-called dependent vowels
in range 093e - 094c, wherin 093f stands "left of the consonant" when
rendered. (My knowledge of Indic languages is limited, at best, so there may
be mor to it..)
After a sleepless night, I finally found out that calculating grapheme
boundaries for devanagari is not so hard after all. It seems to work
reasonably well if I use just three simple rules:

To detect whether in the code point sequence 'ab', the junction between
'a' and 'b' is a glyph boundary.

1. If 'b' is some kind of a mark (i.e unicodedata.cat egory(b) starts
with 'M'), then the 'ab' junction is not a glyph boundary.

2. If 'b' is not a Mark, but is a devanagari letter (i.e category 'Lo')
AND 'a' is a VIRAMA character i.e, 'VIRAMA' in unicodedata.nam e(a),
then the 'ab' junction is not a glyph boundary.

3. In every other situation, the 'ab' junction is a glyph boundary.

Dont really know if this is completely correct, but it performs pretty
well on quite a big sanskrit text I have... Handles things like

NA + HALANT + DHA + HALANT + YA + AA

and reports it (correctly) as a single glyph.
| In my application, the GUI seems to handle that part (i.e combining
| characters). However, I need to handle cursor movement myself. The GUI
| can only be told to move forward by a specified number of bytes.

What GUI are you working with?
I am using wxPython on windows XP. There are two text display widgets,
wxTextCtrl and wxStyledTextCtr l. The former is pretty basic but the
caret positioning is pretty robust. The latter is very fancy, hanles
syntax highlighting etc, but has some serious problems with combining
characters.
Some systems such as the X Server on IndiX seem to dig into the GPOS and
GSUB tables in the OpenType font. See:

http://rohini.ncst.ernet.in/indix/do...i-HOWTO-5.html
Thanks for the link!
Would "Look at
http://www.unicode.org/Public/UNIDAT...ategory_Values " do?


It does indeed. Notice my new-found fluency with unicodedata.cat egory?
:)

Thanks,
Srinath

Jul 18 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
7087
by: sebastien.hugues | last post by:
Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name: sébastien. The second character is not an ascii one and when i try to encode the path that contains this name in utf-8,
8
5277
by: Bill Eldridge | last post by:
I'm trying to grab a document off the Web and toss it into a MySQL database, but I keep running into the various encoding problems with Unicode (that aren't a problem for me with GB2312, BIG 5, etc.) What I'd like is something as simple as: CREATE TABLE junk (junklet VARCHAR(2500) CHARACTER SET UTF8)); import MySQLdb, re,urllib
6
2185
by: Elbert Lev | last post by:
Please correct me if I'm wrong. Python (as I understand) uses reference counting to determine when to delete the object. As soon as the object goes out of the scope it is deleted. Python does not use garbage collection (as Java does). So if the script runs a loop: for i in range(100): f = Obj(i)
8
3665
by: Francis Girard | last post by:
Hi, For the first time in my programmer life, I have to take care of character encoding. I have a question about the BOM marks. If I understand well, into the UTF-8 unicode binary representation, some systems add at the beginning of the file a BOM mark (Windows?), some don't. (Linux?). Therefore, the exact same text encoded in the same UTF-8 will result in two different binary files, and of a slightly different length. Right ?
12
4116
by: Chris Mullins | last post by:
I'm implementing RFC 3491 in .NET, and running into a strange issue. Step 1 of RFC 3491 is performing a set of mappings dicated by tables B.1 and B.2. I'm having trouble with the following mappings though, and it seems like a shortcoming of the .NET framework: When I see Unicode value 0x10400, I'm supposed to map it to value 0x10428. This list goes on (the left colulmn is the existing value, the right column
2
2632
by: Neil Schemenauer | last post by:
python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is feasible. It would be helpful if people could test the patched Python with their own applications and report any incompatibilities. PEP: 349
10
2227
by: Larry Hastings | last post by:
I'm an indie shareware Windows game developer. In indie shareware game development, download size is terribly important; conventional wisdom holds that--even today--your download should be 5MB or less. I'd like to use Python in my games. However, python24.dll is 1.86MB, and zips down to 877k. I can't afford to devote 1/6 of my download to just the scripting interpreter; I've got music, and textures, and my own crappy code to ship. ...
17
4533
by: Adam Olsen | last post by:
As was seen in another thread, there's a great deal of confusion with regard to surrogates. Most programmers assume Python's unicode type exposes only complete characters. Even CPython's own functions do this on occasion. This leads to different behaviour across platforms and makes it unnecessarily difficult to properly support all languages. To solve this I propose Python's unicode type using UTF-16 should have gaps in its index,...
3
2788
by: majna | last post by:
I have character counter for textarea wich counting the characters. Special character needs same place as two normal characters because of 16-bit encoding. Counter is counting -2 when special character is added like some language specific char. How to count specials like 1 char? tnx
0
9645
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10330
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10153
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10093
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9952
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8976
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5511
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2880
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.