Help with splitting

RickMuller

I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.

There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?

Thanks in advance.

R.

Jul 18 '05 #1

Subscribe Post Reply

1138

Jeremy Bowers

On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote:

I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2'].
I was certain that there was a way to do this using the standard string
functions, but I just spent some time poring over the documentation
without finding anything.

importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no instances
of the split RE at the beginning or end. Pondering the second invocation
should show why they are there, though darned if I can think of a good way
to put it into words.

Jul 18 '05 #2

Brian Beck

RickMuller wrote:

There's a chance I was instead thinking of something in the re module,
but I also spent some time there without luck. Could someone point me
to the right function, if it exists?

The re solution Jeremy Bowers is what you want. Here's another (probably
much slower) way for fun (with no surrounding empty strings):

py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']
I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...

--
Brian Beck
Adventurer of the First Order

Jul 18 '05 #3

Raymond Hettinger

[Brian Beck]>

py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']
Brilliant solution!

That leads to a better understanding of groupby as a tool for identifying
transitions without consuming them.

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...

Right.
attrgetter gets but does not call.

If unicode isn't an issue, then the lambda can be removed:

[''.join(g) for k, g in groupby(' test ing ', str.isspace)]

[' ', 'test', ' ', 'ing', ' ']

Raymond Hettinger

Jul 18 '05 #4

Jeremy Bowers

On Fri, 01 Apr 2005 18:01:49 -0500, Brian Beck wrote:

py> from itertools import groupby
py> [''.join(g) for k, g in groupby(' test ing ', lambda x: x.isspace())]
[' ', 'test', ' ', 'ing', ' ']

I tried replacing the lambda thing with an attrgetter, but apparently my
understanding of that isn't perfect... it groups by the identify of the
bound method instead of calling it...

Unfortunately, as you pointed out, it is slower:

python timeit.py -s
"import re; x = 'a ab c' * 1000; whitespaceSplitter = re.compile('(\w+)')"

"whitespaceSplitter.split(x)"

100 loops, best of 3: 9.47 msec per loop

python timeit.py -s
"from itertools import groupby; x = 'a ab c' * 1000;"

"[''.join(g) for k, g in groupby(x, lambda y: y.isspace())]"

10 loops, best of 3: 65.8 msec per loop

(tried to break it up to be easier to read)

But I like yours much better theoretically. It's also a pretty good demo
of "groupby".

Jul 18 '05 #5

RickMuller

Thanks to everyone who responded!! I guess I have to study my regular
expressions a little more closely.

Jul 18 '05 #6

George Sakkis

Jeremy Bowers wrote:

On Fri, 01 Apr 2005 14:20:51 -0800, RickMuller wrote:
I'm trying to split a string into pieces on whitespace, but I want to
save the whitespace characters rather than discarding them.

For example, I want to split the string '1 2' into ['1',' ','2']. I was certain that there was a way to do this using the standard string functions, but I just spent some time poring over the documentation
without finding anything.
importPython 2.3.5 (#1, Mar 3 2005, 17:32:12)
[GCC 3.4.3 (Gentoo Linux 3.4.3, ssp-3.4.3-0, pie-8.7.6.6)] on linux2
Type "help", "copyright", "credits" or "license" for more

information.
import re
whitespaceSplitter = re.compile("(\w+)")
whitespaceSplitter.split("1 2 3 \t\n5") ['', '1', ' ', '2', ' ', '3', ' \t\n', '5', ''] whitespaceSplitter.split(" 1 2 3 \t\n5 ") [' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Note the null strings at the beginning and end if there are no

instances of the split RE at the beginning or end. Pondering the second invocation should show why they are there, though darned if I can think of a good way to put it into words.

If you don't want any null strings at the beginning or the end, an
equivalent regexp is:

whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']
George

Jul 18 '05 #7

Reinhold Birkenfeld

George Sakkis wrote:

If you don't want any null strings at the beginning or the end, an
equivalent regexp is:
whitespaceSplitter_2 = re.compile("\w+|\s+")
whitespaceSplitter_2.findall("1 2 3 \t\n5") ['1', ' ', '2', ' ', '3', ' \t\n', '5'] whitespaceSplitter_2.findall(" 1 2 3 \t\n5 ")

[' ', '1', ' ', '2', ' ', '3', ' \t\n', '5', ' ']

Perhaps you may want to use "\s+|\S+" if you have non-alphanumeric
characters in the string.

Reinhold

Jul 18 '05 #8

Similar topics

Help! Crazy Font Sizes

by: Kai Jaeger | last post by:

I am playing with setting font sizes in CSS using em as unit of measurement. All seems to be fine. Even Netscape Navigator shows the characters very similar to IE, what is not the kind if px is...

HTML / CSS

Discussion regarding hot/ cold splitting of structures.

by: Rakesh | last post by:

Hi, I was 'googling' to look out for some ways of optimizing the code and came across this term - 'hot / cold splitting'. In short, the discussion is about splitting heavily accessed ( hot )...

C / C++

help with string replace - for doing selective replace

by: Prasad S | last post by:

Hello I wish to replace all the characters in a string except those which are inside '<' & '>' characters. And there could be multiple occurences of < & > within the string. e.g. string =...

Javascript

Query Help, thanks!

by: rong.guo | last post by:

Greetings! Please see my data below, for each account, I would need the lastest balance_date with the corresponding balance. Can anyone help me with the query? Thanks a lot! create table a...

Microsoft SQL Server

i need help with splitting a string please

by: Trint Smith | last post by:

Ok, My program has been formating .txt files for input into sql server and ran into a problem...the .txt is an export from an accounting package and is only supposed to contain comas (,) between...

Visual Basic .NET

Pleasse help :( splitting window problem

by: melis | last post by:

Hi all, I am new to MFC, and cannot find a way to the following problem :( What I am trying to do is just to split the window into two parts, tyring to have a CFormView or CDialog on left and a...

.NET Framework

Splitting function

by: shadow_ | last post by:

Hi i m new at C and trying to write a parser and a string class. Basicly program will read data from file and splits it into lines then lines to words. i used strtok function for splitting data to...

C / C++

Help : glibc detected *** ./a.out: free(): invalid next size (normal): 0x099da890

by: shrik | last post by:

I have following error : Total giant files in replay configuration file are : File name : /new_file/prob1.rec Given file /new_file/prob1.rec is successfully verified. Splitting for giant file...

Linux

Help splitting a simple date string

by: yogi_bear_79 | last post by:

I have a simple string (i.e. February 27, 2008) that I need to split into three parts. The month, day, and year. Splitting into a string array would work, and I could convert day and years to...

C / C++

string splitting plzzzzzz help me...

by: xyz | last post by:

I have a string 16:23:18.659343 131.188.37.230.22 131.188.37.59.1398 tcp 168 for example lets say for the above string 16:23:18.659343 -- time 131.188.37.230 -- srcaddress 22 ...

C / C++

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice