473,395 Members | 1,870 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Help with splitting

I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.

There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?

Thanks in advance.

R.

Jul 18 '05 #1
7 1138
On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote:
I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.


importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no instances
of the split RE at the beginning or end. Pondering the second invocation
should show why they are there, though darned if I can think of a good way
to put it into words.
Jul 18 '05 #2
RickMuller wrote:
There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?


The re solution Jeremy Bowers is what you want. Here's another (probably
much slower) way for fun (with no surrounding empty strings):

py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']
I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...

--
Brian Beck
Adventurer of the First Order
Jul 18 '05 #3
[Brian Beck]>
py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']
Brilliant solution!

That leads to a better understanding of groupby as a tool for identifying
transitions without consuming them.

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...


Right.
attrgetter gets but does not call.

If unicode isn't an issue, then the lambda can be removed:
[''.join(g) for k, g in groupby(' test ing ', str.isspace)]

[' ', 'test', ' ', 'ing', ' ']

Raymond Hettinger
Jul 18 '05 #4
On Fri, 01 Apr 2005 18:01:49 -0500, Brian Beck wrote:
py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...


Unfortunately, as you pointed out, it is slower:

python timeit.py -s
"import re; x = 'a ab c' * 1000; whitespaceSplitter = re.compile('(\w+)')"

"whitespaceSplitter.split(x)"

100 loops, best of 3: 9.47 msec per loop

python timeit.py -s
"from itertools import groupby; x = 'a ab c' * 1000;"

"[''.join(g) for k, g in groupby(x, lambda y: y.isspace())]"

10 loops, best of 3: 65.8 msec per loop

(tried to break it up to be easier to read)

But I like yours much better theoretically. It's also a pretty good demo
of "groupby".
Jul 18 '05 #5
Thanks to everyone who responded!! I guess I have to study my regular
expressions a little more closely.

Jul 18 '05 #6
Jeremy Bowers wrote:
On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote:
I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2']. I was certain that there was a way to do this using the standard string functions, but I just spent some time poring over the documentation
without finding anything.
importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more

information.
import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ") [' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no

instances of the split RE at the beginning or end. Pondering the second invocation should show why they are there, though darned if I can think of a good way to put it into words.


If you don't want any null strings at the beginning or the end, an
equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']
George

Jul 18 '05 #7
George Sakkis wrote:
If you don't want any null strings at the beginning or the end, an
equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']


Perhaps you may want to use "\s+|\S+" if you have non-alphanumeric
characters in the string.

Reinhold
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

38
by: Kai Jaeger | last post by:
I am playing with setting font sizes in CSS using em as unit of measurement. All seems to be fine. Even Netscape Navigator shows the characters very similar to IE, what is not the kind if px is...
3
by: Rakesh | last post by:
Hi, I was 'googling' to look out for some ways of optimizing the code and came across this term - 'hot / cold splitting'. In short, the discussion is about splitting heavily accessed ( hot )...
4
by: Prasad S | last post by:
Hello I wish to replace all the characters in a string except those which are inside '<' & '>' characters. And there could be multiple occurences of < & > within the string. e.g. string =...
10
by: rong.guo | last post by:
Greetings! Please see my data below, for each account, I would need the lastest balance_date with the corresponding balance. Can anyone help me with the query? Thanks a lot! create table a...
2
by: Trint Smith | last post by:
Ok, My program has been formating .txt files for input into sql server and ran into a problem...the .txt is an export from an accounting package and is only supposed to contain comas (,) between...
0
by: melis | last post by:
Hi all, I am new to MFC, and cannot find a way to the following problem :( What I am trying to do is just to split the window into two parts, tyring to have a CFormView or CDialog on left and a...
2
by: shadow_ | last post by:
Hi i m new at C and trying to write a parser and a string class. Basicly program will read data from file and splits it into lines then lines to words. i used strtok function for splitting data to...
0
by: shrik | last post by:
I have following error : Total giant files in replay configuration file are : File name : /new_file/prob1.rec Given file /new_file/prob1.rec is successfully verified. Splitting for giant file...
4
by: yogi_bear_79 | last post by:
I have a simple string (i.e. February 27, 2008) that I need to split into three parts. The month, day, and year. Splitting into a string array would work, and I could convert day and years to...
37
by: xyz | last post by:
I have a string 16:23:18.659343 131.188.37.230.22 131.188.37.59.1398 tcp 168 for example lets say for the above string 16:23:18.659343 -- time 131.188.37.230 -- srcaddress 22 ...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.