newby question: Splitting a string - separator

Thomas Liesner

Hi all,

i am having a textfile which contains a single string with names.
I want to split this string into its records an put them into a list.
In "normal" cases i would do something like:

#!/usr/bin/python
inp = open("file")
data = inp.read()
names = data.split()
inp.close()

The problem is, that the names contain spaces an the records are also
just seprarated by spaces. The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

TIA,
Tom

Dec 8 '05 #1

Subscribe Post Reply

1424

Michael Spencer

Thomas Liesner wrote:

Hi all,

i am having a textfile which contains a single string with names.
I want to split this string into its records an put them into a list.
In "normal" cases i would do something like:
#!/usr/bin/python
inp = open("file")
data = inp.read()
names = data.split()
inp.close()

The problem is, that the names contain spaces an the records are also
just seprarated by spaces. The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

TIA,
Tom

\s+ gives one or more, you need \s{2,} for two or more:

import re
re.split("\s{2,}","Guido van Rossum Tim Peters Thomas Liesner") ['Guido van Rossum', 'Tim Peters', 'Thomas Liesner']

Michael

Dec 8 '05 #2

Noah

Thomas Liesner wrote:

...
The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

For your split regex you could say
"\s\s+"
or
"\s{2,}"

This should work for you:
YOUR_SPLIT_LIST = re.split("\s{2,}", YOUR_STRING)

Yours,
Noah

Dec 8 '05 #3

Jim

Hi Tom,

a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

For more than one, I'd use

\s\s+

-Jim

Dec 8 '05 #4

James Stroud

Thomas Liesner wrote:

Hi all,

i am having a textfile which contains a single string with names.
I want to split this string into its records an put them into a list.
In "normal" cases i would do something like:

#!/usr/bin/python
inp = open("file")
data = inp.read()
names = data.split()
inp.close()

The problem is, that the names contain spaces an the records are also
just seprarated by spaces. The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

TIA,
Tom

The one I like best goes like this:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using regexes.

James

Dec 10 '05 #5

Kent Johnson

James Stroud wrote:

The one I like best goes like this:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using regexes.

Unfortunately it gives the wrong result.

Kent

Dec 10 '05 #6

Tim Peters

[James Stroud]

The one I like best goes like this:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using regexes.

[Kent Johnson] Unfortunately it gives the wrong result.

Still, it gets extra points for being such a pleasing example ;-)

Dec 10 '05 #7

bonono

Thomas Liesner wrote:

Hi all,

i am having a textfile which contains a single string with names.
I want to split this string into its records an put them into a list.
In "normal" cases i would do something like:
#!/usr/bin/python
inp = open("file")
data = inp.read()
names = data.split()
inp.close()

The problem is, that the names contain spaces an the records are also
just seprarated by spaces. The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

Can I just use "two space" as the seperator ?

[ x.strip() for x in data.split(" ") ]

Dec 10 '05 #8

James Stroud

Kent Johnson wrote:

James Stroud wrote:
The one I like best goes like this:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using
regexes.

Unfortunately it gives the wrong result.

Kent

Just an example. Here is the "correct version":
names = [n for n in data.split(" ") if n]

James

Dec 10 '05 #9

Michael Spencer

bo****@gmail.com wrote:

Thomas Liesner wrote:
Hi all,

i am having a textfile which contains a single string with names.
I want to split this string into its records an put them into a list.
In "normal" cases i would do something like:
#!/usr/bin/python
inp = open("file")
data = inp.read()
names = data.split()
inp.close()

The problem is, that the names contain spaces an the records are also
just seprarated by spaces. The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

Can I just use "two space" as the seperator ?

[ x.strip() for x in data.split(" ") ]

If you like, but it will create dummy entries if there are more than two spaces:

data = "Guido van Rossum Tim Peters Thomas Liesner"
[ x.strip() for x in data.split(" ") ] ['Guido van Rossum', 'Tim Peters', '', 'Thomas Liesner']

You could add a condition to the listcomp:
[name.strip() for name in data.split(" ") if name] ['Guido van Rossum', 'Tim Peters', 'Thomas Liesner']

but what if there is some other whitespace character?
data = "Guido van Rossum Tim Peters \t Thomas Liesner"
[name.strip() for name in data.split(" ") if name] ['Guido van Rossum', 'Tim Peters', '', 'Thomas Liesner']
perhaps a smarter condition?
[name.strip() for name in data.split(" ") if name.strip(" \t")] ['Guido van Rossum', 'Tim Peters', 'Thomas Liesner']

but this is beginning to feel like hard work.
I think this is a case where it's not worth the effort to try to avoid the regexp
import re
re.split("\s{2,}",data) ['Guido van Rossum', 'Tim Peters', 'Thomas Liesner']

Michael

Dec 10 '05 #10

Steven D'Aprano

On Fri, 09 Dec 2005 18:02:02 -0800, James Stroud wrote:

Thomas Liesner wrote:
Hi all,

i am having a textfile which contains a single string with names.
I want to split this string into its records an put them into a list.
In "normal" cases i would do something like:

#!/usr/bin/python
inp = open("file")
data = inp.read()
names = data.split()
inp.close()

The problem is, that the names contain spaces an the records are also
just seprarated by spaces. The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

TIA,
Tom

The one I like best goes like this:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using regexes.

Yes, but the correct result would be:

['Guido van Rossum', 'Tim Peters', 'Thomas Liesner']

Your code is short, elegant but wrong.

It could also be shorter and more elegant:

# your version
py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> [n for n in data.split() if n]
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

# my version
py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> data.split()
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

The "if n" in the list comp is superfluous, and without that, the whole
list comp is unnecessary.

--
Steven.

Dec 10 '05 #11

James Stroud

Steven D'Aprano wrote:

On Fri, 09 Dec 2005 18:02:02 -0800, James Stroud wrote:

Thomas Liesner wrote:
Hi all,

i am having a textfile which contains a single string with names.
I want to split this string into its records an put them into a list.
In "normal" cases i would do something like:

#!/usr/bin/python
inp = open("file")
data = inp.read()
names = data.split()
inp.close()
The problem is, that the names contain spaces an the records are also
just seprarated by spaces. The only thing i can rely on, ist that the
recordseparator is always more than a single whitespace.

I thought of something like defining the separator for split() by using
a regex for "more than one whitespace". RegEx for whitespace is \s, but
what would i use for "more than one"? \s+?

TIA,
Tom

The one I like best goes like this:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using regexes.

Yes, but the correct result would be:

['Guido van Rossum', 'Tim Peters', 'Thomas Liesner']

Your code is short, elegant but wrong.

It could also be shorter and more elegant:

# your version
py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> [n for n in data.split() if n]
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

# my version
py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> data.split()
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

The "if n" in the list comp is superfluous, and without that, the whole
list comp is unnecessary.

see my post from 1 hr before this one.

Dec 10 '05 #12

Tim Roberts

James Stroud <js*****@mbi.ucla.edu> wrote:

The one I like best goes like this:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using regexes.

But it is slower than this, which produces EXACTLY the same (incorrect)
result:

data = "Guido van Rossum Tim Peters Thomas Liesner"
names = data.split()
--
- Tim Roberts, ti**@probo.com
Providenza & Boekelheide, Inc.

Dec 10 '05 #13

Fredrik Lundh

James Stroud wrote:

py> data = "Guido van Rossum Tim Peters Thomas Liesner"
py> names = [n for n in data.split() if n]
py> names
['Guido', 'van', 'Rossum', 'Tim', 'Peters', 'Thomas', 'Liesner']

I think it is theoretically faster (and more pythonic) than using
regexes.

Unfortunately it gives the wrong result.

Just an example. Here is the "correct version":

names = [n for n in data.split(" ") if n]

where "correct" is "still wrong", and "theoretically faster" means "slightly
slower" (at least if fix your version, and precompile the pattern).

</F>

Dec 10 '05 #14

Similar topics

trouble splitting strings

by: Aaron Walker | last post by:

I have a feeling this going to end up being something so stupid, but right now I'm confused as hell. I'm trying to code a function, that given a string and a delimiter char, returns a vector of...

C / C++

I think I'm gonna cry. (or newby problems with simple string function)

by: robbie.carlton | last post by:

Hello! I've programmed in c a bit, but nothing very complicated. I've just come back to it after a long sojourn in the lands of functional programming and am completely stumped on a very simple...

C / C++

Why isn't SPLIT splitting my strings

by: ronrsr | last post by:

I'm trying to break up the result tuple into keyword phrases. The keyword phrases are separated by a ; -- the split function is not working the way I believe it should be. Can anyone see what I"m...

Python

Splitting a String with 2 input variables, "beginstr" and "endstr"

by: kevineller794 | last post by:

I want to make a split string function, but it's getting complicated. What I want to do is make a function with a String, BeginStr and an EndStr variable, and I want it to return it in a char...

C / C++

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware