Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John 6 9384
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".
Regards,
Martin
On Nov 14, 11:27*am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".
Regards,
Martin
Thanks Martin. I'll do this.
"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re r=re.compile('['+Po+']') x='我是美國人。' x
'我是美國人。'
>>r.findall(x)
['。']
-Mark
"Mark Tolonen" <M8********@mailinator.comwrote in message
news:xs******************************@comcast.com. ..
>
"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
>Hello, I'm trying to build a regex in python to identify punctuation characters in all the languages. Some regex implementations support an extended syntax \p{P} that does just that. As far as I know, python re doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>>import re r=re.compile('['+Po+']') x='我是美國人。' x
'我是美國人。'
>>>r.findall(x)
['。']
-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.
IDLE 3.0rc2
>>import unicodedata as u A=''.join(chr(i) for i in range(65536)) P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P') len(A)
65536
>>len(P)
491
>>len(re.findall('['+P+']',A)) # ] was naturally escaped
490
>>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them. len(re.findall('['+P+']',A))
491
-Mark
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.comwrote:
"Mark Tolonen" <M8R-yft...@mailinator.comwrote in message
news:xs******************************@comcast.com. ..
"Shiao" <multis...@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com....
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re r=re.compile('['+Po+']') x='§Ú¬O¬ü°ê¤H¡C' x
'§Ú¬O¬ü°ê¤H¡C'
>>r.findall(x)
['¡C']
-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.
IDLE 3.0rc2>>import unicodedata as u
>A=''.join(chr(i) for i in range(65536)) P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P') len(A)
65536
>len(P)
491
>len(re.findall('['+P+']',A)) # ] was naturally escaped
490
>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.. len(re.findall('['+P+']',A))
491
-Mark
Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
>P=P.replace('\\','\\\\').replace(']','\\]') * # escape both of them.
re.escape() does this w/o any assumptions by your code about the regex
implementation. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: mathias |
last post by:
I would like to define a custom operator in Python.
(It is not about overloading an existing operator but about
defining special new operators)
Is this possible without deeply manipulating the...
|
by: Beznas |
last post by:
Hi All;
I'm trying to create an ASP function called CleanX that removes the
punctuation and some characters like (*&^%$#@!<>?"}|{..) from a text
string
I came up with this but It...
|
by: Avnish |
last post by:
Hi,
I am looking for some form of validation for all the alphanumeric
characters in the entire unicode range e.g. the validation should also
accept japanese characters but should restrict...
|
by: webdev |
last post by:
lo all,
some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..
so i started a python 2.3...
|
by: Bill Nguyen |
last post by:
I'm getting data from a mySQL database (default char set = UTF-8).
I need to display data in Unicode but got only mongolian characters like
this: Phạm Thị Ngọc
I changed the textbox font to...
|
by: abhi147 |
last post by:
Hi ,
I want to convert an array of bytes like :
{79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3}
into Unicode character with ISO-8859-1 standard.
Can anyone help me .. how should...
|
by: Nicolas Pontoizeau |
last post by:
Hi,
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special...
|
by: NevilleDNZ |
last post by:
Hi,
Apologies first as I am not a unicode expert.... indeed I the details
probably totally elude me. Not withstanding: how can I convert a
binary string containing UTF-8 binary into a python...
|
by: newpuritangrant |
last post by:
All
Apologies for the naivety of the following question, but how can one iterate
over a Managed C++ String, and identify if any of the characters belong to a
certain unicode range.? For example...
|
by: ryjfgjl |
last post by:
ExcelToDatabase: batch import excel into database automatically...
|
by: isladogs |
last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM).
In this month's session, we are pleased to welcome back...
|
by: Vimpel783 |
last post by:
Hello!
Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
|
by: ArrayDB |
last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
|
by: PapaRatzi |
last post by:
Hello,
I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
|
by: CloudSolutions |
last post by:
Introduction:
For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
|
by: Defcon1945 |
last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
|
by: Shællîpôpï 09 |
last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome former...
| |