By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,587 Members | 1,666 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,587 IT Pros & Developers. It's quick & easy.

Help with splitting

P: n/a
I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.

There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?

Thanks in advance.

R.

Jul 18 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote:
I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.


importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no instances
of the split RE at the beginning or end. Pondering the second invocation
should show why they are there, though darned if I can think of a good way
to put it into words.
Jul 18 '05 #2

P: n/a
RickMuller wrote:
There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?


The re solution Jeremy Bowers is what you want. Here's another (probably
much slower) way for fun (with no surrounding empty strings):

py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']
I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...

--
Brian Beck
Adventurer of the First Order
Jul 18 '05 #3

P: n/a
[Brian Beck]>
py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']
Brilliant solution!

That leads to a better understanding of groupby as a tool for identifying
transitions without consuming them.

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...


Right.
attrgetter gets but does not call.

If unicode isn't an issue, then the lambda can be removed:
[''.join(g) for k, g in groupby(' test ing ', str.isspace)]

[' ', 'test', ' ', 'ing', ' ']

Raymond Hettinger
Jul 18 '05 #4

P: n/a
On Fri, 01 Apr 2005 18:01:49 -0500, Brian Beck wrote:
py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...


Unfortunately, as you pointed out, it is slower:

python timeit.py -s
"import re; x = 'a ab c' * 1000; whitespaceSplitter = re.compile('(\w+)')"

"whitespaceSplitter.split(x)"

100 loops, best of 3: 9.47 msec per loop

python timeit.py -s
"from itertools import groupby; x = 'a ab c' * 1000;"

"[''.join(g) for k, g in groupby(x, lambda y: y.isspace())]"

10 loops, best of 3: 65.8 msec per loop

(tried to break it up to be easier to read)

But I like yours much better theoretically. It's also a pretty good demo
of "groupby".
Jul 18 '05 #5

P: n/a
Thanks to everyone who responded!! I guess I have to study my regular
expressions a little more closely.

Jul 18 '05 #6

P: n/a
Jeremy Bowers wrote:
On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote:
I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2']. I was certain that there was a way to do this using the standard string functions, but I just spent some time poring over the documentation
without finding anything.
importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more

information.
import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ") [' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no

instances of the split RE at the beginning or end. Pondering the second invocation should show why they are there, though darned if I can think of a good way to put it into words.


If you don't want any null strings at the beginning or the end, an
equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']
George

Jul 18 '05 #7

P: n/a
George Sakkis wrote:
If you don't want any null strings at the beginning or the end, an
equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']


Perhaps you may want to use "\s+|\S+" if you have non-alphanumeric
characters in the string.

Reinhold
Jul 18 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.