By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,569 Members | 1,386 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,569 IT Pros & Developers. It's quick & easy.

re.split() not keeping matched text

P: n/a
Hello,

Given the following program:

--------------

import re

x = "The dog ran. The cat eats! The bird flies? Done."
l = re.split("[.?!]", x)

for s in l:
print s.strip()
# for
---------------

I am getting the following output:

The dog ran
The cat eats
The bird flies
Done

As you can see the end of sentence punctuation marks are being removed. Yet
the the docs for re.split() say that the matched text is supposed to be
returned. I want to keep the punctuation marks.

Where am I going wrong here?

Thanks,
--
Robert
Jul 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Hi Robert,

Robert Oschler wrote:
l = re.split("[.?!]", x) I want to keep the punctuation marks.


The docs say: If _capturing parentheses_ are used in pattern, then the text
of all groups in the pattern are also returned as part of the resulting
list.

So:

l = re.split("([.?!])", x)

will work as wanted.

Bye,
Kai
Jul 18 '05 #2

P: n/a
"Test" <us****@diefenba.ch> wrote in message
news:ce*************@news.t-online.com...
Hi Robert,

The docs say: If _capturing parentheses_ are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting
list.

So:

l = re.split("([.?!])", x)

will work as wanted.

Bye,
Kai


Kai,

That works. Unfortunately the punctuation marks (matched text) are returned
as separate list entries. Is there any way to avoid having to walk the list
by steps of 2, and rejoin the "n" and "n+1" elements, to get back the
original sentence(s)? I'm trying to save some processing time if possible.

Thanks,
--
Robert
Jul 18 '05 #3

P: n/a
On Sun, 25 Jul 2004, Robert Oschler wrote:
Given the following program:

--------------

import re

x = "The dog ran. The cat eats! The bird flies? Done."
l = re.split("[.?!]", x)

for s in l:
print s.strip()
# for
--------------- I want to keep the punctuation marks.

Where am I going wrong here?


What you need is some magic with the (?<=...), or 'look-behind assertion'
operator:

re.split(r'(?<=[.?!])\s*')

What this regex is saying is "match a string of spaces that follows one of
[.?!]". This way, it will not consume the punctuation, but will consume
the spaces (thus killing two birds with one stone by obviating the need
for the subsequent s.strip()).

Unfortunately, there is a slight bug, where if the punctuation is not
followed by whitespace, re.split won't split, because the regex returns a
zero-length string. There is a patch to fix this (SF #988761, see the end
of the message for a link), but until then, you can prevent the error by
using:

re.split(r'(?<=[.?!])\s+')

This won't match end-of-character marks not followed by whitespace, but
that may be preferable behaviour anyways (e.g. if you're parsing Python
documentation).

Hope this helps.

Patch #988761:
http://sourceforge.net/tracker/index...70&atid=305470

Jul 18 '05 #4

P: n/a
I don't know if this will save you any processing time, but you can just
replace the split with a findall like this:
l = re.findall("[^.?!]+[?!.]+", x)

This should handle your example, plus it handles multiple occurances of
the punctuation at the end of the sentence.

Robert Oschler <no_replies@fake_email_address.invalid> wrote:
Hello,

Given the following program:

--------------

import re

x = "The dog ran. The cat eats! The bird flies? Done."
l = re.split("[.?!]", x)

for s in l:
print s.strip()
# for
---------------

I am getting the following output:

The dog ran
The cat eats
The bird flies
Done

As you can see the end of sentence punctuation marks are being removed. Yet
the the docs for re.split() say that the matched text is supposed to be
returned. I want to keep the punctuation marks.

Where am I going wrong here?

Thanks,

Jul 18 '05 #5

P: n/a
ma**@wutka.com wrote:
I don't know if this will save you any processing time, but you can just
replace the split with a findall like this:
l = re.findall("[^.?!]+[?!.]+", x)

This should handle your example, plus it handles multiple occurances of
the punctuation at the end of the sentence.


One caveat: the invariant

"".join(re.findall("[^?!.]+[?!.]+", s)) == s

will no longer hold as you will lose leading punctuation and trailing
non-punctuation:
re.findall("[^?!.]+[?!.]+", "!so what! you're done? yes done") ['so what!', " you're done?"]


Peter

Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.