Help | Site Map
Connecting Tech Pros Worldwide
 
 
LinkBack Thread Tools
  #1  
Old September 4th, 2008, 03:45 PM
phasma
Guest
 
Posts: n/a
Default Python and Cyrillic characters in regular expression

Hi, I'm trying extract all alphabetic characters from string.

reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)

But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

Please, help.
  #2  
Old September 4th, 2008, 06:59 PM
Fredrik Lundh
Guest
 
Posts: n/a
Default Re: Python and Cyrillic characters in regular expression

phasma wrote:
Quote:
Hi, I'm trying extract all alphabetic characters from string.
>
reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)
>
But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.
can you provide a few sample strings that show this behaviour?

</F>

  #3  
Old September 5th, 2008, 12:35 PM
phasma
Guest
 
Posts: n/a
Default Re: Python and Cyrillic characters in regular expression

string = u"Привет"
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)

string = u"Hi.Привет"
(u'Hi',)

On Sep 4, 9:53 pm, Fredrik Lundh <fred...@pythonware.comwrote:
Quote:
phasma wrote:
Quote:
Hi, I'm trying extract all alphabetic characters from string.
>
Quote:
reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)
>
Quote:
But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.
>
can you provide a few sample strings that show this behaviour?
>
</F>
  #4  
Old September 5th, 2008, 03:35 PM
MRAB
Guest
 
Posts: n/a
Default Re: Python and Cyrillic characters in regular expression

On Sep 5, 12:28*pm, phasma <xpa...@gmail.comwrote:
Quote:
string = u"Привет"
All the characters are letters.
Quote:
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)
>
string = u"Hi.Привет"
The third character isn't a letter and isn't whitespace.
Quote:
(u'Hi',)
>
Quote:
On Sep 4, 9:53 pm, Fredrik Lundh <fred...@pythonware.comwrote:
>
Quote:
phasma wrote:
Quote:
Hi, I'm trying extract all alphabetic characters from string.
>
Quote:
Quote:
reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)
>
Quote:
Quote:
But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.
>
Quote:
can you provide a few sample strings that show this behaviour?
>
  #5  
Old September 5th, 2008, 06:55 PM
Fredrik Lundh
Guest
 
Posts: n/a
Default Re: Python and Cyrillic characters in regular expression

phasma wrote:
Quote:
string = u"Привет"
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)
>
string = u"Hi.Привет"
(u'Hi',)
the [\w\s] pattern you used matches letters, numbers, underscore, and
whitespace. "." doesn't fall into that category, so the "match" method
stops when it gets to that character.

maybe you could use re.sub or re.findall?
Quote:
Quote:
Quote:
>># replace all non-alphanumerics with the empty string
>>re.sub("(?u)\W+", "", string)
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'
Quote:
Quote:
Quote:
>># find runs of alphanumeric characters
>>re.findall("(?u)\w+", string)
[u'Hi', u'\u041f\u0440\u0438\u0432\u0435\u0442']
Quote:
Quote:
Quote:
>>"".join(re.findall("(?u)\w+", string))
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

(the "sub" example expects you to specify what characters you want to
skip, while "findall" expects you to specify what you want to keep.)

</F>

 

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over network members.
Post your question now . . .
It's fast and it's free

Popular Articles