473,769 Members | 2,003 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Identifying unicode punctuation characters with Python regex

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
Nov 14 '08 #1
6 9431
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.cat egory(c)
starts with "P".

Regards,
Martin
Nov 14 '08 #2
On Nov 14, 11:27*am, "Martin v. Löwis" <mar...@v.loewi s.dewrote:
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.cat egory(c)
starts with "P".

Regards,
Martin
Thanks Martin. I'll do this.
Nov 14 '08 #3

"Shiao" <mu*******@gmai l.comwrote in message
news:3a******** *************** ***********@l33 g2000pri.google groups.com...
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')
>>import re
r=re.compile( '['+Po+']')
x='æˆ‘æ˜¯ç¾Žå œ‹äººã€‚'
x
'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'
>>r.findall(x )
['。']

-Mark

Nov 14 '08 #4

"Mark Tolonen" <M8********@mai linator.comwrot e in message
news:xs******** *************** *******@comcast .com...
>
"Shiao" <mu*******@gmai l.comwrote in message
news:3a******** *************** ***********@l33 g2000pri.google groups.com...
>Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

You can always build your own pattern. Something like (Python 3.0rc2):
>>>import unicodedata
Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')
>>>import re
r=re.compile ('['+Po+']')
x='我是美 國人。'
x
'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'
>>>r.findall( x)
['。']

-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2
>>import unicodedata as u
A=''.join(chr (i) for i in range(65536))
P=''.join(chr (i) for i in range(65536) if u.category(chr( i))[0]=='P')
len(A)
65536
>>len(P)
491
>>len(re.findal l('['+P+']',A)) # ] was naturally
escaped
490
>>set(P)-set(re.findall( '['+P+']',A)) # so only missing \
{'\\'}
>>P=P.replace(' \\','\\\\').rep lace(']','\\]') # escape both of them.
len(re.findal l('['+P+']',A))
491

-Mark

Nov 14 '08 #5
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinat or.comwrote:
"Mark Tolonen" <M8R-yft...@mailinat or.comwrote in message

news:xs******** *************** *******@comcast .com...


"Shiao" <multis...@gmai l.comwrote in message
news:3a******** *************** ***********@l33 g2000pri.google groups.com....
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?
Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You can always build your own pattern. Something like (Python 3.0rc2):
>>import unicodedata
Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')
>>import re
r=re.compile( '['+Po+']')
x='§Ú¬O¬ü°ê¤H ¡C'
x
'§Ú¬O¬ü°ê¤H¡C'
>>r.findall(x )
['¡C']
-Mark

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2>>import unicodedata as u
>A=''.join(chr( i) for i in range(65536))
P=''.join(chr( i) for i in range(65536) if u.category(chr( i))[0]=='P')
len(A)
65536
>len(P)
491
>len(re.findall ('['+P+']',A)) # ] was naturally
escaped
490
>set(P)-set(re.findall( '['+P+']',A)) # so only missing \
{'\\'}
>P=P.replace('\ \','\\\\').repl ace(']','\\]') # escape both of them..
len(re.findall ('['+P+']',A))

491

-Mark
Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
Nov 14 '08 #6
>P=P.replace('\ \','\\\\').repl ace(']','\\]') * # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.
Nov 19 '08 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
4605
by: mathias | last post by:
I would like to define a custom operator in Python. (It is not about overloading an existing operator but about defining special new operators) Is this possible without deeply manipulating the Python code? Are side-effects to be expected? What about operator precedence?
8
6448
by: Beznas | last post by:
Hi All; I'm trying to create an ASP function called CleanX that removes the punctuation and some characters like (*&^%$#@!<>?"}|{..) from a text string I came up with this but It doesn't look like it's working. Can anyone help please. THANK YOU.
1
10416
by: Avnish | last post by:
Hi, I am looking for some form of validation for all the alphanumeric characters in the entire unicode range e.g. the validation should also accept japanese characters but should restrict japanese punctuation marks. I need this validation for atleast for the CJKT range if not possible for the entire unicode range. I can even make use of regular expressions. Also, please note that this validation should be in Javascript
4
6071
by: webdev | last post by:
lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning
6
5988
by: Bill Nguyen | last post by:
I'm getting data from a mySQL database (default char set = UTF-8). I need to display data in Unicode but got only mongolian characters like this: Phạm Thị Ngọc I changed the textbox font to Arial Unicode MS but still not working. Do I need conversion of data stored in mySQL database before displaying? Thanks Bill
14
6421
by: abhi147 | last post by:
Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should I go about doing it ? Thanks
5
4087
by: Nicolas Pontoizeau | last post by:
Hi, I am handling a mixed languages text file encoded in UTF-8. Theres is mainly French, English and Asian languages. I need to detect every asian characters in order to enclose it by a special tag for latex. Does anybody know if there is a unicode "table of character" implementation in python? I mean, I give a character and python replys me with the language in which the character occurs. Thanks in advance
1
2596
by: NevilleDNZ | last post by:
Hi, Apologies first as I am not a unicode expert.... indeed I the details probably totally elude me. Not withstanding: how can I convert a binary string containing UTF-8 binary into a python unicode string? cutdown example: $ cat ./uc.py #!/usr/bin/env python imported="\304\246\311\231\316\257\316\271\303\222
1
1393
by: newpuritangrant | last post by:
All Apologies for the naivety of the following question, but how can one iterate over a Managed C++ String, and identify if any of the characters belong to a certain unicode range.? For example i would like to be able to identify if a managed string contains say Balinese characters (unicode range 1B00 to 1B7F). I would be extremely grateful for any help or references to a possible solution - Many thanks
0
9589
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10211
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9994
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7409
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6673
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5299
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5447
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2815
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.