By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,244 Members | 1,972 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 429,244 IT Pros & Developers. It's quick & easy.

split on blank lines

P: n/a
Hi everyone,

can somebody tell me why (using Python 2.3.2)
import re
re.compile(r"^$", re.MULTILINE).split("foo\n\nbar\n\nbaz") ['foo\n\nbar\n\nbaz']

? Being used to Perl semantics, I expect

['foo\n', 'bar\n', 'baz']

or something equivalent without the '\n' characters in the result
strings. I have found that
re.compile(r"^\n", re.MULTILINE).split("foo\n\nbar\n\nbaz")

['foo\n', 'bar\n', 'baz']

I prefer the first version however because my intent is stated more
clearly. Could this be a bug in sre.py (I looked at the code for a
good two minutes but then my head started hurting)

Thanks for your help,

Jan
Jul 18 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
jb****@hotmail.com (Jan Burgy) wrote in
news:80**************************@posting.google.c om:
can somebody tell me why (using Python 2.3.2)
import re
re.compile(r"^$", re.MULTILINE).split("foo\n\nbar\n\nbaz") ['foo\n\nbar\n\nbaz']

? Being used to Perl semantics, I expect

['foo\n', 'bar\n', 'baz']

or something equivalent without the '\n' characters in the result
strings. I have found that
re.compile(r"^\n", re.MULTILINE).split("foo\n\nbar\n\nbaz") ['foo\n', 'bar\n', 'baz']

I prefer the first version however because my intent is stated more
clearly. Could this be a bug in sre.py (I looked at the code for a
good two minutes but then my head started hurting)


Given that re.compile("^$", re.MULTILINE).findall("foo\n\nbar\n\nbaz")
returns ['', ''] I would agree this looks like a bug. You could submit a
bug report on Sourceforge.

Of course, if you really want to state your intentions, you could just use:
"foo\n\nbar\n\nbaz".split('\n\n')

['foo', 'bar', 'baz']

as you aren't doing anything here that obviously benefits from regex
obfuscation.

--
Duncan Booth du****@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
Jul 18 '05 #2

P: n/a
Duncan Booth wrote:
Given that re.compile("^$", re.MULTILINE).findall("foo\n\nbar\n\nbaz")
returns ['', ''] I would agree this looks like a bug. You could submit a
bug report on Sourceforge.


I may be wrong, but I would think that the behavior is correct. "^$" matches an
empty line. This is exactly what findall returns... two empty lines.

--
Hans (ha**@zephyrfalcon.org)
http://zephyrfalcon.org/

Jul 18 '05 #3

P: n/a
Hans Nowak <ha**@zephyrfalcon.org> wrote in
news:ma*************************************@pytho n.org:
Duncan Booth wrote:
Given that re.compile("^$",
re.MULTILINE).findall("foo\n\nbar\n\nbaz") returns ['', ''] I would
agree this looks like a bug. You could submit a bug report on
Sourceforge.
I may be wrong, but I would think that the behavior is correct. "^$"
matches an empty line. This is exactly what findall returns... two
empty lines.

Perhaps you trimmed too much of the original context, but you have
misunderstood the original poster's intent.

The original post said:
can somebody tell me why (using Python 2.3.2)
import re
re.compile(r"^$", re.MULTILINE).split("foo\n\nbar\n\nbaz")

['foo\n\nbar\n\nbaz']


Notice that the string they are splitting contains two empty lines. I
pointed out that re.findall correctly spots the two empty lines, and
therefore you would expect that the split should correctly split the string
there, but it doesn't.

For the avoidance of doubt: there is an inconsistency of behaviour between
re.findall and re.split. It looks to me like a bug in the str.split method.

--
Duncan Booth du****@rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
Jul 18 '05 #4

P: n/a
Duncan Booth <du****@NOSPAMrcp.co.uk> wrote in message news:<Xn***************************@127.0.0.1>...
jb****@hotmail.com (Jan Burgy) wrote in
news:80**************************@posting.google.c om:
can somebody tell me why (using Python 2.3.2)
> import re
> re.compile(r"^$", re.MULTILINE).split("foo\n\nbar\n\nbaz")

['foo\n\nbar\n\nbaz']

? Being used to Perl semantics, I expect

['foo\n', 'bar\n', 'baz']

or something equivalent without the '\n' characters in the result
strings. I have found that
> re.compile(r"^\n", re.MULTILINE).split("foo\n\nbar\n\nbaz")

['foo\n', 'bar\n', 'baz']

I prefer the first version however because my intent is stated more
clearly. Could this be a bug in sre.py (I looked at the code for a
good two minutes but then my head started hurting)


Given that re.compile("^$", re.MULTILINE).findall("foo\n\nbar\n\nbaz")
returns ['', ''] I would agree this looks like a bug. You could submit a
bug report on Sourceforge.

Of course, if you really want to state your intentions, you could just use:
>>> "foo\n\nbar\n\nbaz".split('\n\n')

['foo', 'bar', 'baz']

as you aren't doing anything here that obviously benefits from regex
obfuscation.


Thank you Duncan for your input. You're right, I will post a bug
report on sourceforge. Why, you ask, do I split on "^$" and not simply
"\n\n"? Simply because I'm dealing with an idiotic file format (not my
own mind you) and that I really want to split on "^\t*$" (I agree with
you that it's a rather arbitrary definition of a blank line, once
again, not mine). When the above didn't work, I spent a long time
questioning my understanding of regular expressions until I could
simplify my code to the minimal amount that still yielded the error.
Sometimes I wish that Python contained more elements from AWK (in
particularly "RS" for instance)

Cheers,

Jan

--
Being an actuary is a lot harder than being a mathematician: it is
enough for a mathematician to prove that he or she is right.
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.