472,354 Members | 1,985 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,354 software developers and data experts.

Identifying unicode punctuation characters with Python regex

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
Nov 14 '08 #1
6 9185
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin
Nov 14 '08 #2
On Nov 14, 11:27*am, "Martin v. Lwis" <mar...@v.loewis.dewrote:
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin
Thanks Martin. I'll do this.
Nov 14 '08 #3

"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re
r=re.compile('['+Po+']')
x='我是美國人。'
x
'我是美國人。'
>>r.findall(x)
['。']

-Mark

Nov 14 '08 #4

"Mark Tolonen" <M8********@mailinator.comwrote in message
news:xs******************************@comcast.com. ..
>
"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
>Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

You can always build your own pattern. Something like (Python 3.0rc2):
>>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>>import re
r=re.compile('['+Po+']')
x='我是美國人。'
x
'我是美國人。'
>>>r.findall(x)
['。']

-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2
>>import unicodedata as u
A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A)
65536
>>len(P)
491
>>len(re.findall('['+P+']',A)) # ] was naturally
escaped
490
>>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.
len(re.findall('['+P+']',A))
491

-Mark

Nov 14 '08 #5
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.comwrote:
"Mark Tolonen" <M8R-yft...@mailinator.comwrote in message

news:xs******************************@comcast.com. ..


"Shiao" <multis...@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com....
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re
r=re.compile('['+Po+']')
x='ڬOHC'
x
'ڬOHC'
>>r.findall(x)
['C']
-Mark

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2>>import unicodedata as u
>A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A)
65536
>len(P)
491
>len(re.findall('['+P+']',A)) # ] was naturally
escaped
490
>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them..
len(re.findall('['+P+']',A))

491

-Mark
Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
Nov 14 '08 #6
>P=P.replace('\\','\\\\').replace(']','\\]') * # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.
Nov 19 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: mathias | last post by:
I would like to define a custom operator in Python. (It is not about overloading an existing operator but about defining special new operators) Is this possible without deeply manipulating the...
8
by: Beznas | last post by:
Hi All; I'm trying to create an ASP function called CleanX that removes the punctuation and some characters like (*&^%$#@!<>?"}|{..) from a text string I came up with this but It...
1
by: Avnish | last post by:
Hi, I am looking for some form of validation for all the alphanumeric characters in the entire unicode range e.g. the validation should also accept japanese characters but should restrict...
4
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3...
6
by: Bill Nguyen | last post by:
I'm getting data from a mySQL database (default char set = UTF-8). I need to display data in Unicode but got only mongolian characters like this: Phạm Thị Ngọc I changed the textbox font to...
14
by: abhi147 | last post by:
Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should...
5
by: Nicolas Pontoizeau | last post by:
Hi, I am handling a mixed languages text file encoded in UTF-8. Theres is mainly French, English and Asian languages. I need to detect every asian characters in order to enclose it by a special...
1
by: NevilleDNZ | last post by:
Hi, Apologies first as I am not a unicode expert.... indeed I the details probably totally elude me. Not withstanding: how can I convert a binary string containing UTF-8 binary into a python...
1
by: newpuritangrant | last post by:
All Apologies for the naivety of the following question, but how can one iterate over a Managed C++ String, and identify if any of the characters belong to a certain unicode range.? For example...
2
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
0
by: Naresh1 | last post by:
What is WebLogic Admin Training? WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made but the http to https rule only works for...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
0
hi
by: WisdomUfot | last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
0
Oralloy
by: Oralloy | last post by:
Hello Folks, I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA. My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
0
by: Carina712 | last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
0
BLUEPANDA
by: BLUEPANDA | last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.