By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,275 Members | 1,745 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,275 IT Pros & Developers. It's quick & easy.

Identifying unicode punctuation characters with Python regex

P: n/a
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
Nov 14 '08 #1
Share this Question
Share on Google+
6 Replies


P: n/a
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin
Nov 14 '08 #2

P: n/a
On Nov 14, 11:27*am, "Martin v. Lwis" <mar...@v.loewis.dewrote:
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin
Thanks Martin. I'll do this.
Nov 14 '08 #3

P: n/a

"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re
r=re.compile('['+Po+']')
x='我是美國人。'
x
'我是美國人。'
>>r.findall(x)
['。']

-Mark

Nov 14 '08 #4

P: n/a

"Mark Tolonen" <M8********@mailinator.comwrote in message
news:xs******************************@comcast.com. ..
>
"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
>Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

You can always build your own pattern. Something like (Python 3.0rc2):
>>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>>import re
r=re.compile('['+Po+']')
x='我是美國人。'
x
'我是美國人。'
>>>r.findall(x)
['。']

-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2
>>import unicodedata as u
A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A)
65536
>>len(P)
491
>>len(re.findall('['+P+']',A)) # ] was naturally
escaped
490
>>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.
len(re.findall('['+P+']',A))
491

-Mark

Nov 14 '08 #5

P: n/a
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.comwrote:
"Mark Tolonen" <M8R-yft...@mailinator.comwrote in message

news:xs******************************@comcast.com. ..


"Shiao" <multis...@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com....
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re
r=re.compile('['+Po+']')
x='ڬOHC'
x
'ڬOHC'
>>r.findall(x)
['C']
-Mark

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2>>import unicodedata as u
>A=''.join(chr(i) for i in range(65536))
P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
len(A)
65536
>len(P)
491
>len(re.findall('['+P+']',A)) # ] was naturally
escaped
490
>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them..
len(re.findall('['+P+']',A))

491

-Mark
Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
Nov 14 '08 #6

P: n/a
>P=P.replace('\\','\\\\').replace(']','\\]') * # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.
Nov 19 '08 #7

This discussion thread is closed

Replies have been disabled for this discussion.