Identifying unicode punctuation characters with Python regex

Shiao

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

Nov 14 '08 #1

Subscribe Reply

9431

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

I'm trying to build a regex in python to identify punctuation

characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.cat egory(c)
starts with "P".

Regards,
Martin

Nov 14 '08 #2

Shiao

On Nov 14, 11:27*am, "Martin v. Löwis" <mar...@v.loewi s.dewrote:

I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.cat egory(c)
starts with "P".

Regards,
Martin

Thanks Martin. I'll do this.

Nov 14 '08 #3

Mark Tolonen

"Shiao" <mu*******@gmai l.comwrote in message
news:3a******** *************** ***********@l33 g2000pri.google groups.com...

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

You can always build your own pattern. Something like (Python 3.0rc2):

>>import unicodedata

Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')

>>import re
r=re.compile( '['+Po+']')
x='æˆ‘æ˜¯ç¾Žå œ‹äººã€‚'
x

'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'

>>r.findall(x )

['ã€‚']

-Mark

Nov 14 '08 #4

Mark Tolonen

"Mark Tolonen" <M8********@mai linator.comwrot e in message
news:xs******** *************** *******@comcast .com...

>
"Shiao" <mu*******@gmai l.comwrote in message
news:3a******** *************** ***********@l33 g2000pri.google groups.com...
>Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

You can always build your own pattern. Something like (Python 3.0rc2):

>>>import unicodedata

Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')

>>>import re
r=re.compile ('['+Po+']')
x='æˆ‘æ˜¯ç¾Ž åœ‹äººã€‚'
x

'æˆ‘æ˜¯ç¾Žåœ‹äº ºã€‚'

>>>r.findall( x)

['ã€‚']

-Mark

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2

>>import unicodedata as u
A=''.join(chr (i) for i in range(65536))
P=''.join(chr (i) for i in range(65536) if u.category(chr( i))[0]=='P')
len(A)

65536

>>len(P)

491

>>len(re.findal l('['+P+']',A)) # ] was naturally
escaped

490

>>set(P)-set(re.findall( '['+P+']',A)) # so only missing \

{'\\'}

>>P=P.replace(' \\','\\\\').rep lace(']','\\]') # escape both of them.
len(re.findal l('['+P+']',A))

491

-Mark

Nov 14 '08 #5

Shiao

On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinat or.comwrote:

"Mark Tolonen" <M8R-yft...@mailinat or.comwrote in message

news:xs******** *************** *******@comcast .com...

"Shiao" <multis...@gmai l.comwrote in message
news:3a******** *************** ***********@l33 g2000pri.google groups.com....
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

You can always build your own pattern. Something like (Python 3.0rc2):

>>import unicodedata
Po=''.join(chr( x) for x in range(65536) if unicodedata.cat egory(chr(x)) ==
'Po')
>>import re
r=re.compile( '['+Po+']')
x='§Ú¬O¬ü°ê¤H ¡C'
x
'§Ú¬O¬ü°ê¤H¡C'
>>r.findall(x )
['¡C']

-Mark

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2>>import unicodedata as u

>A=''.join(chr( i) for i in range(65536))
P=''.join(chr( i) for i in range(65536) if u.category(chr( i))[0]=='P')
len(A)

65536

>len(P)

491

>len(re.findall ('['+P+']',A)) # ] was naturally
escaped

490

>set(P)-set(re.findall( '['+P+']',A)) # so only missing \

{'\\'}

>P=P.replace('\ \','\\\\').repl ace(']','\\]') # escape both of them..
len(re.findall ('['+P+']',A))

491

-Mark

Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)

Nov 14 '08 #6

jhermann

>P=P.replace('\ \','\\\\').repl ace(']','\\]') * # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.

Nov 19 '08 #7

Similar topics

4605

Custom operator

by: mathias | last post by:

I would like to define a custom operator in Python. (It is not about overloading an existing operator but about defining special new operators) Is this possible without deeply manipulating the Python code? Are side-effects to be expected? What about operator precedence?

Python

6448

function that removes the punctuation and some characters like (*&^%$#@!<>?"} from a text string

by: Beznas | last post by:

Hi All; I'm trying to create an ASP function called CleanX that removes the punctuation and some characters like (*&^%$#@!<>?"}|{..) from a text string I came up with this but It doesn't look like it's working. Can anyone help please. THANK YOU.

ASP / Active Server Pages

10416

Validation for unicode alphanumeric characters?

by: Avnish | last post by:

Hi, I am looking for some form of validation for all the alphanumeric characters in the entire unicode range e.g. the validation should also accept japanese characters but should restrict japanese punctuation marks. I need this validation for atleast for the CJKT range if not possible for the entire unicode range. I can even make use of regular expressions. Also, please note that this validation should be in Javascript

Javascript

6071

minidom xml & non ascii / unicode & files

by: webdev | last post by:

lo all, some of the questions i'll ask below have most certainly been discussed already, i just hope someone's kind enough to answer them again to help me out.. so i started a python 2.3 script that grabs some web pages from the web, regex parse the data and stores it localy to xml file for further use.. at first i had no problem using python minidom and everything concerning

Python

5988

Display Unicode characters on Winforms

by: Bill Nguyen | last post by:

I'm getting data from a mySQL database (default char set = UTF-8). I need to display data in Unicode but got only mongolian characters like this: Phạm Thị Ngọc I changed the textbox font to Arial Unicode MS but still not working. Do I need conversion of data stored in mySQL database before displaying? Thanks Bill

Visual Basic .NET

6421

Array of Bytes to Unicode chars (ISO-8859-1)

by: abhi147 | last post by:

Hi , I want to convert an array of bytes like : {79,104,-37,-66,24,123,30,-26,-99,-8,80,-38,19,14,-127,-3} into Unicode character with ISO-8859-1 standard. Can anyone help me .. how should I go about doing it ? Thanks

C / C++

4087

unicode "table of character" implementation in python

by: Nicolas Pontoizeau | last post by:

Hi, I am handling a mixed languages text file encoded in UTF-8. Theres is mainly French, English and Asian languages. I need to detect every asian characters in order to enclose it by a special tag for latex. Does anybody know if there is a unicode "table of character" implementation in python? I mean, I give a character and python replys me with the language in which the character occurs. Thanks in advance

Python

2596

Q: a simple(?) raw-utf-8 conversion to internal type unicode "\304\246\311\231\316\257\316\271\303\222"

by: NevilleDNZ | last post by:

Hi, Apologies first as I am not a unicode expert.... indeed I the details probably totally elude me. Not withstanding: how can I convert a binary string containing UTF-8 binary into a python unicode string? cutdown example: $ cat ./uc.py #!/usr/bin/env python imported="\304\246\311\231\316\257\316\271\303\222

Python

1393

identifying unicode ranges in a managed string

by: newpuritangrant | last post by:

All Apologies for the naivety of the following question, but how can one iterate over a Managed C++ String, and identify if any of the characters belong to a certain unicode range.? For example i would like to be able to identify if a managed string contains say Balinese characters (unicode range 1B00 to 1B7F). I would be extremely grateful for any help or references to a possible solution - Many thanks

.NET Framework

9589

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9423

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10211

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9994

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

7409

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6673

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5299

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5447

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2815

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General