473,327 Members | 2,081 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,327 software developers and data experts.

Identifying unicode punctuation characters with Python regex

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
Nov 14 '08 #1
6 9384
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin
Nov 14 '08 #2
On Nov 14, 11:27*am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin
Thanks Martin. I'll do this.
Nov 14 '08 #3

"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re
r=re.compile('['+Po+']')
x='我是美國人。'
x
'我是美國人。'
>>r.findall(x)
['。']

-Mark

Nov 14 '08 #4

"Mark Tolonen" <M8********@mailinator.comwrote in message
news:xs******************************@comcast.com. ..
>
"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
>Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

You can always build your own pattern. Something like (Python 3.0rc2):
>>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>>import re
r=re.compile('['+Po+']')
x='我是美國人。'
x
'我是美國人。'
>>>r.findall(x)
['。']

-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2
>>import unicodedata as u
A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A)
65536
>>len(P)
491
>>len(re.findall('['+P+']',A)) # ] was naturally
escaped
490
>>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.
len(re.findall('['+P+']',A))
491

-Mark

Nov 14 '08 #5
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.comwrote:
"Mark Tolonen" <M8R-yft...@mailinator.comwrote in message

news:xs******************************@comcast.com. ..


"Shiao" <multis...@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com....
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re
r=re.compile('['+Po+']')
x='§Ú¬O¬ü°ê¤H¡C'
x
'§Ú¬O¬ü°ê¤H¡C'
>>r.findall(x)
['¡C']
-Mark

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2>>import unicodedata as u
>A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A)
65536
>len(P)
491
>len(re.findall('['+P+']',A)) # ] was naturally
escaped
490
>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them..
len(re.findall('['+P+']',A))

491

-Mark
Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
Nov 14 '08 #6
>P=P.replace('\\','\\\\').replace(']','\\]') * # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.
Nov 19 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: mathias | last post by:
I would like to define a custom operator in Python. (It is not about overloading an existing operator but about defining special new operators) Is this possible without deeply manipulating the...
8
by: Beznas | last post by:
Hi All; I'm trying to create an ASP function called CleanX that removes the punctuation and some characters like (*&^%$#@!<>?"}|{..) from a text string I came up with this but It...
1
by: Avnish | last post by:
Hi, I am looking for some form of validation for all the alphanumeric characters in the entire unicode range e.g. the validation should also accept japanese characters but should restrict...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
6
by: Bill Nguyen | last post by:
I'm getting data from a mySQL database (default char set = UTF-8). I need to display data in Unicode but got only mongolian characters like this: Phạm Thị Ngọc I changed the textbox font to...
14
by: abhi147 | last post by:
Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should...
5
by: Nicolas Pontoizeau | last post by:
Hi, I am handling a mixed languages text file encoded in UTF-8. Theres is mainly French, English and Asian languages. I need to detect every asian characters in order to enclose it by a special...
1
by: NevilleDNZ | last post by:
Hi, Apologies first as I am not a unicode expert.... indeed I the details probably totally elude me. Not withstanding: how can I convert a binary string containing UTF-8 binary into a python...
1
by: newpuritangrant | last post by:
All Apologies for the naivety of the following question, but how can one iterate over a Managed C++ String, and identify if any of the characters belong to a certain unicode range.? For example...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.