I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.
For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.
There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?
Thanks in advance.
R. 7 1138
On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote: I'm trying to split a string into pieces on whitespace, but I want to save the whitespace characters rather than discarding them.
For example, I want to split the string '1 2' into ['1',' ','2']. I was certain that there was a way to do this using the standard string functions, but I just spent some time poring over the documentation without finding anything.
importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more information. import re whitespaceSplitter = re.compile("(\w+)") whitespaceSplitter.split("1 2 3 \t\n5")
['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ")
[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']
Note the null strings at the beginning and end if there are no instances
of the split RE at the beginning or end. Pondering the second invocation
should show why they are there, though darned if I can think of a good way
to put it into words.
RickMuller wrote: There's a chance I was instead thinking of something in the re module, but I also spent some time there without luck. Could someone point me to the right function, if it exists?
The re solution Jeremy Bowers is what you want. Here's another (probably
much slower) way for fun (with no surrounding empty strings):
py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']
I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...
--
Brian Beck
Adventurer of the First Order
[Brian Beck]> py> from itertools import groupby py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())] [' ', 'test', ' ', 'ing', ' ']
Brilliant solution!
That leads to a better understanding of groupby as a tool for identifying
transitions without consuming them.
I tried replacing the lambda thing with an attrgetter, but apparently my understanding of that isn't perfect... it groups by the identify of the bound method instead of calling it...
Right.
attrgetter gets but does not call.
If unicode isn't an issue, then the lambda can be removed: [''.join(g) for k, g in groupby(' test ing ', str.isspace)]
[' ', 'test', ' ', 'ing', ' ']
Raymond Hettinger
On Fri, 01 Apr 2005 18:01:49 -0500, Brian Beck wrote: py> from itertools import groupby py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())] [' ', 'test', ' ', 'ing', ' ']
I tried replacing the lambda thing with an attrgetter, but apparently my understanding of that isn't perfect... it groups by the identify of the bound method instead of calling it...
Unfortunately, as you pointed out, it is slower:
python timeit.py -s
"import re; x = 'a ab c' * 1000; whitespaceSplitter = re.compile('(\w+)')"
"whitespaceSplitter.split(x)"
100 loops, best of 3: 9.47 msec per loop
python timeit.py -s
"from itertools import groupby; x = 'a ab c' * 1000;"
"[''.join(g) for k, g in groupby(x, lambda y: y.isspace())]"
10 loops, best of 3: 65.8 msec per loop
(tried to break it up to be easier to read)
But I like yours much better theoretically. It's also a pretty good demo
of "groupby".
Thanks to everyone who responded!! I guess I have to study my regular
expressions a little more closely.
Jeremy Bowers wrote: On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote:
I'm trying to split a string into pieces on whitespace, but I want
to save the whitespace characters rather than discarding them.
For example, I want to split the string '1 2' into ['1','
','2']. I was certain that there was a way to do this using the standard
string functions, but I just spent some time poring over the documentation without finding anything. importPython 2.3.5 (#1, Mar 3 2005, 17:32:12) [GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2 Type "help", "copyright", "credits" or "license" for more
information. import re whitespaceSplitter = re.compile("(\w+)") whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ") [' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']
Note the null strings at the beginning and end if there are no
instances of the split RE at the beginning or end. Pondering the second
invocation should show why they are there, though darned if I can think of a
good way to put it into words.
If you don't want any null strings at the beginning or the end, an
equivalent regexp is: whitespaceSplitter_2 = re.compile("\w+|\s+") whitespaceSplitter_2.findall("1 2 3 \t\n5")
['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")
[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']
George
George Sakkis wrote: If you don't want any null strings at the beginning or the end, an equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+") whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")
[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']
Perhaps you may want to use "\s+|\S+" if you have non-alphanumeric
characters in the string.
Reinhold This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Kai Jaeger |
last post by:
I am playing with setting font sizes in CSS using em as unit of
measurement. All seems to be fine. Even Netscape Navigator shows the
characters very similar to IE, what is not the kind if px is...
|
by: Rakesh |
last post by:
Hi,
I was 'googling' to look out for some ways of optimizing the code
and came across this term - 'hot / cold splitting'.
In short, the discussion is about splitting heavily accessed ( hot )...
|
by: Prasad S |
last post by:
Hello
I wish to replace all the characters in a string except those which
are inside '<' & '>' characters. And there could be multiple
occurences of < & > within the string.
e.g. string =...
|
by: rong.guo |
last post by:
Greetings!
Please see my data below, for each account, I would need the lastest
balance_date with the corresponding balance. Can anyone help me with
the query? Thanks a lot!
create table a...
|
by: Trint Smith |
last post by:
Ok,
My program has been formating .txt files for input into sql server and
ran into a problem...the .txt is an export from an accounting package
and is only supposed to contain comas (,) between...
|
by: melis |
last post by:
Hi all, I am new to MFC, and cannot find a
way to the following problem :(
What I am trying to do is just to split the window into two parts,
tyring to have a CFormView or CDialog on left and a...
|
by: shadow_ |
last post by:
Hi i m new at C and trying to write a parser and a string class.
Basicly program will read data from file and splits it into lines then
lines to words. i used strtok function for splitting data to...
|
by: shrik |
last post by:
I have following error :
Total giant files in replay configuration file are :
File name : /new_file/prob1.rec
Given file /new_file/prob1.rec is successfully verified.
Splitting for giant file...
|
by: yogi_bear_79 |
last post by:
I have a simple string (i.e. February 27, 2008) that I need to split
into three parts. The month, day, and year. Splitting into a string
array would work, and I could convert day and years to...
|
by: xyz |
last post by:
I have a string
16:23:18.659343 131.188.37.230.22 131.188.37.59.1398 tcp 168
for example lets say for the above string
16:23:18.659343 -- time
131.188.37.230 -- srcaddress
22 ...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
| |