Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John 6 9185
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".
Regards,
Martin
On Nov 14, 11:27*am, "Martin v. Löwis" <mar...@v.loewis.dewrote:
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".
Regards,
Martin
Thanks Martin. I'll do this.
"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re r=re.compile('['+Po+']') x='我是美國人。' x
'我是美國人。'
>>r.findall(x)
['。']
-Mark
"Mark Tolonen" <M8********@mailinator.comwrote in message
news:xs******************************@comcast.com. ..
>
"Shiao" <mu*******@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com...
>Hello, I'm trying to build a regex in python to identify punctuation characters in all the languages. Some regex implementations support an extended syntax \p{P} that does just that. As far as I know, python re doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>>import re r=re.compile('['+Po+']') x='我是美國人。' x
'我是美國人。'
>>>r.findall(x)
['。']
-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.
IDLE 3.0rc2
>>import unicodedata as u A=''.join(chr(i) for i in range(65536)) P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P') len(A)
65536
>>len(P)
491
>>len(re.findall('['+P+']',A)) # ] was naturally escaped
490
>>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them. len(re.findall('['+P+']',A))
491
-Mark
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.comwrote:
"Mark Tolonen" <M8R-yft...@mailinator.comwrote in message
news:xs******************************@comcast.com. ..
"Shiao" <multis...@gmail.comwrote in message
news:3a**********************************@l33g2000 pri.googlegroups.com....
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>import re r=re.compile('['+Po+']') x='§Ú¬O¬ü°ê¤H¡C' x
'§Ú¬O¬ü°ê¤H¡C'
>>r.findall(x)
['¡C']
-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.
IDLE 3.0rc2>>import unicodedata as u
>A=''.join(chr(i) for i in range(65536)) P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P') len(A)
65536
>len(P)
491
>len(re.findall('['+P+']',A)) # ] was naturally escaped
490
>set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.. len(re.findall('['+P+']',A))
491
-Mark
Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
>P=P.replace('\\','\\\\').replace(']','\\]') * # escape both of them.
re.escape() does this w/o any assumptions by your code about the regex
implementation. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: mathias |
last post by:
I would like to define a custom operator in Python.
(It is not about overloading an existing operator but about
defining special new operators)
Is this possible without deeply manipulating the...
|
by: Beznas |
last post by:
Hi All;
I'm trying to create an ASP function called CleanX that removes the
punctuation and some characters like (*&^%$#@!<>?"}|{..) from a text
string
I came up with this but It...
|
by: Avnish |
last post by:
Hi,
I am looking for some form of validation for all the alphanumeric
characters in the entire unicode range e.g. the validation should also
accept japanese characters but should restrict...
|
by: webdev |
last post by:
lo all,
some of the questions i'll ask below have most certainly been discussed
already, i just hope someone's kind enough to answer them again to help
me out..
so i started a python 2.3...
|
by: Bill Nguyen |
last post by:
I'm getting data from a mySQL database (default char set = UTF-8).
I need to display data in Unicode but got only mongolian characters like
this: Phạm Thị Ngọc
I changed the textbox font to...
|
by: abhi147 |
last post by:
Hi ,
I want to convert an array of bytes like :
{79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3}
into Unicode character with ISO-8859-1 standard.
Can anyone help me .. how should...
|
by: Nicolas Pontoizeau |
last post by:
Hi,
I am handling a mixed languages text file encoded in UTF-8. Theres is
mainly French, English and Asian languages. I need to detect every
asian characters in order to enclose it by a special...
|
by: NevilleDNZ |
last post by:
Hi,
Apologies first as I am not a unicode expert.... indeed I the details
probably totally elude me. Not withstanding: how can I convert a
binary string containing UTF-8 binary into a python...
|
by: newpuritangrant |
last post by:
All
Apologies for the naivety of the following question, but how can one iterate
over a Managed C++ String, and identify if any of the characters belong to a
certain unicode range.? For example...
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
|
by: Naresh1 |
last post by:
What is WebLogic Admin Training?
WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
|
by: jalbright99669 |
last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made but the http to https rule only works for...
|
by: AndyPSV |
last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and...
|
by: Arjunsri |
last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
|
by: WisdomUfot |
last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
|
by: Oralloy |
last post by:
Hello Folks,
I am trying to hook up a CPU which I designed using SystemC to I/O pins on an FPGA.
My problem (spelled failure) is with the synthesis of my design into a bitstream, not the C++...
|
by: Carina712 |
last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
|
by: BLUEPANDA |
last post by:
At BluePanda Dev, we're passionate about building high-quality software and sharing our knowledge with the community. That's why we've created a SaaS starter kit that's not only easy to use but also...
| |