unicode "table of character" implementation in python

Nicolas Pontoizeau

Hi,

I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

Thanks in advance

--
http://www.nicolas.pontoizeau.org/
Nicolas Pontoizeau - Promotion EFREI 2005

Aug 22 '06 #1

Subscribe Reply

4067

Brian Beck

Nicolas Pontoizeau wrote:

I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt

--
Brian Beck
Adventurer of the First Order

Aug 22 '06 #2

Nicolas Pontoizeau

2006/8/22, Brian Beck <ex****@gmail.com>:

Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt

As usual, Python has a solution that goes beyond my needs!
Thanks for the links I will dive into it.

Nicolas

--
http://www.nicolas.pontoizeau.org/
Nicolas Pontoizeau - Promotion EFREI 2005

Aug 22 '06 #3

Martin v. Löwis

Nicolas Pontoizeau schrieb:

I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

This is a bit unspecific, so likely, nothing that already exists will
be completely correct for your needs. If you need to escape characters
for latex, I would expect that there is a more precise specification
of what you need to escape - I doubt the fact that a character is used
primarily in Asia matters much to latex.

In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0750..077F; Arabic Supplement
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1780..17FF; Khmer
1800..18AF; Mongolian
1900..194F; Limbu
1950..197F; Tai Le
1980..19DF; New Tai Lue
19E0..19FF; Khmer Symbols
2D00..2D2F; Georgian Supplement
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31C0..31EF; CJK Strokes
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4DC0..4DFF; Yijing Hexagram Symbols
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
F900..FAFF; CJK Compatibility Ideographs
FB50..FDFF; Arabic Presentation Forms-A
FE30..FE4F; CJK Compatibility Forms
FE70..FEFF; Arabic Presentation Forms-B
20000..2A6DF; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement

Notice that some scripts are used both in Asia and elsewhere,
e.g. Latin and Cyrillic. Arabic probably doesn't belong in
this list, either, being used both in Asia and elsewhere
as the script of the official language.

Regards,
Martin

Aug 28 '06 #4

Tim Roberts

"Martin v. Löwis" <ma****@v.loewis.dewrote:

>
In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
...

This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Aug 30 '06 #5

Martin v. Löwis

Tim Roberts schrieb:

>0530..058F; Armenian
0590..05FF; Hebrew
...

This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?

It's part of the Unicode Consortium's database (UCD, Unicode Character
Database). This specific table is called "code blocks":

http://www.unicode.org/Public/UNIDATA/Blocks.txt

Python currently has this table not compiled in, but it should be
trivial to compile this into a pure-Python table (either as a
dictionary, or a list of triples).

Regards,
Martin

Sep 9 '06 #6

Similar topics

unicode character '\N{ }'

by: Daewon YOON | last post by:

I learned from http://www.jorendorff.com/articles/unicode/python.html that one can specify a unicode character by u'\N {name of the character}'. Is there any method that I do the reverse of this...