473,473 Members | 1,556 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

unicode "table of character" implementation in python

Hi,

I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.

Thanks in advance

--
http://www.nicolas.pontoizeau.org/
Nicolas Pontoizeau - Promotion EFREI 2005
Aug 22 '06 #1
5 4067
Nicolas Pontoizeau wrote:
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.
Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt

--
Brian Beck
Adventurer of the First Order
Aug 22 '06 #2
2006/8/22, Brian Beck <ex****@gmail.com>:
Nicolas, check out the unicodedata module:
http://docs.python.org/lib/module-unicodedata.html

Find "import unicodedata" on this page for how to use it:
http://www.amk.ca/python/howto/unicode

I'm not sure if it has built-in support for finding which language block a
character is in, but a table like this might help you:
http://www.unicode.org/Public/UNIDATA/Blocks.txt
As usual, Python has a solution that goes beyond my needs!
Thanks for the links I will dive into it.

Nicolas

--
http://www.nicolas.pontoizeau.org/
Nicolas Pontoizeau - Promotion EFREI 2005
Aug 22 '06 #3
Nicolas Pontoizeau schrieb:
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special tag for latex.
Does anybody know if there is a unicode "table of character"
implementation in python? I mean, I give a character and python replys
me with the language in which the character occurs.
This is a bit unspecific, so likely, nothing that already exists will
be completely correct for your needs. If you need to escape characters
for latex, I would expect that there is a more precise specification
of what you need to escape - I doubt the fact that a character is used
primarily in Asia matters much to latex.

In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
0600..06FF; Arabic
0700..074F; Syriac
0750..077F; Arabic Supplement
0900..097F; Devanagari
0980..09FF; Bengali
0A00..0A7F; Gurmukhi
0A80..0AFF; Gujarati
0B00..0B7F; Oriya
0B80..0BFF; Tamil
0C00..0C7F; Telugu
0D00..0D7F; Malayalam
0D80..0DFF; Sinhala
0E00..0E7F; Thai
0E80..0EFF; Lao
0F00..0FFF; Tibetan
1000..109F; Myanmar
10A0..10FF; Georgian
1100..11FF; Hangul Jamo
1780..17FF; Khmer
1800..18AF; Mongolian
1900..194F; Limbu
1950..197F; Tai Le
1980..19DF; New Tai Lue
19E0..19FF; Khmer Symbols
2D00..2D2F; Georgian Supplement
2E80..2EFF; CJK Radicals Supplement
2F00..2FDF; Kangxi Radicals
2FF0..2FFF; Ideographic Description Characters
3000..303F; CJK Symbols and Punctuation
3040..309F; Hiragana
30A0..30FF; Katakana
3100..312F; Bopomofo
3130..318F; Hangul Compatibility Jamo
3190..319F; Kanbun
31A0..31BF; Bopomofo Extended
31C0..31EF; CJK Strokes
31F0..31FF; Katakana Phonetic Extensions
3200..32FF; Enclosed CJK Letters and Months
3300..33FF; CJK Compatibility
3400..4DBF; CJK Unified Ideographs Extension A
4DC0..4DFF; Yijing Hexagram Symbols
4E00..9FFF; CJK Unified Ideographs
A000..A48F; Yi Syllables
A490..A4CF; Yi Radicals
AC00..D7AF; Hangul Syllables
F900..FAFF; CJK Compatibility Ideographs
FB50..FDFF; Arabic Presentation Forms-A
FE30..FE4F; CJK Compatibility Forms
FE70..FEFF; Arabic Presentation Forms-B
20000..2A6DF; CJK Unified Ideographs Extension B
2F800..2FA1F; CJK Compatibility Ideographs Supplement

Notice that some scripts are used both in Asia and elsewhere,
e.g. Latin and Cyrillic. Arabic probably doesn't belong in
this list, either, being used both in Asia and elsewhere
as the script of the official language.

Regards,
Martin
Aug 28 '06 #4
"Martin v. Löwis" <ma****@v.loewis.dewrote:
>
In any case, somebody pointed you to the Unicode code blocks. I think
these are Asian scripts (I may have missed some):

0530..058F; Armenian
0590..05FF; Hebrew
...
This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.
Aug 30 '06 #5
Tim Roberts schrieb:
>0530..058F; Armenian
0590..05FF; Hebrew
...

This is a fabulously useful list, Martin. Did you get this from a web
page? Can you tell me where?
It's part of the Unicode Consortium's database (UCD, Unicode Character
Database). This specific table is called "code blocks":

http://www.unicode.org/Public/UNIDATA/Blocks.txt

Python currently has this table not compiled in, but it should be
trivial to compile this into a pure-Python table (either as a
dictionary, or a list of triples).

Regards,
Martin
Sep 9 '06 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Daewon YOON | last post by:
I learned from http://www.jorendorff.com/articles/unicode/python.html that one can specify a unicode character by u'\N {name of the character}'. Is there any method that I do the reverse of this...
5
by: Michael | last post by:
I mean how to use _tprintf().
17
by: black tractor | last post by:
HI there.. l was just wondering, if l place a "table" in the "editable region" of my template, will the text, graphics placed inside the this "table" MOVE BY ITSELF?? l mean, recently l had a...
6
by: Lucvdv | last post by:
In a sorted list, certain characters are grouped together, for example "çb" (cedille b) comes after "ca" but before "cc". Is there a built-in way to find out what 'base character', such as C for...
15
by: wizardyhnr | last post by:
i want to try ANSI C99's unicode fuctions. so i write a test program. the function is simple, but i cannot compile it with dev c++ 4.9.9.2 under windows xp sp2, since the compiler always think that...
1
by: Kenneth McDonald | last post by:
Given a Python unicode character (string of length one), how would I find out the \uXXXX escape sequence for it? This isn't obvious from the docs I've been looking through. Thanks, Ken
2
by: Diilb | last post by:
I am using DOM to create an rss feed. The problem I am running into is "special characters" such as é è ç. If I try adding them to the XML as character data (CData), DOM chokes and throws out...
12
by: JBJ | last post by:
Hi, I'am very newbie in Python. For the moment I'am trying to convert an unicode character to his uppercase unaccented character. By example with locale fr_FR: a,A,à,À should return A o,O,ô,Ô...
6
by: gita ziabari | last post by:
Hello All, The following code does not work for unicode characters: keyword = dict() kw = 'ÇÅÎÓËÉÈ' keyword.setdefault(key, ).append (kw) It works fine for inserting ASCII character. Any...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
0
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.