By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
455,587 Members | 1,643 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 455,587 IT Pros & Developers. It's quick & easy.

split() and string.whitespace

P: n/a
I am unable to figure out why the first two statements work as I
expect them to and the next two do not. Namely, the first two spit the
sentence into its component words, while the latter two return the
whole sentence entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')
print mytext.split(whitespace)
print string.split(mytext, sep=whitespace)
Oct 31 '08 #1
Share this Question
Share on Google+
13 Replies


P: n/a
On Fri, 31 Oct 2008 11:53:30 -0700, Chaim Krause wrote:
I am unable to figure out why the first two statements work as I expect
them to and the next two do not. Namely, the first two spit the sentence
into its component words, while the latter two return the whole sentence
entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')
This splits at the string ' '.
print mytext.split(whitespace)
This splits at the string '\t\n\x0b\x0c\r ' which doesn't occur in
`mytext`. The argument is a string not a set of characters.
print string.split(mytext, sep=whitespace)
Same here.

Ciao,
Marc 'BlackJack' Rintsch

Oct 31 '08 #2

P: n/a
I am unable to figure out why the first two statements work as I
expect them to and the next two do not. Namely, the first two spit the
sentence into its component words, while the latter two return the
whole sentence entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')
print mytext.split(whitespace)
print string.split(mytext, sep=whitespace)

Split does its work on literal strings, or if a separator is not
specified, on a set of data, splits on arbitrary whitespace.

For an example, try

s = "abcdefgbcdefgh"
s.split("c") # ['ab', 'defgb', 'defgh']
s.split("fgb") # ['abcde', 'cdefgh']
string.whitespace is a string, so split() tries to use split on
the literal whitespace, not a set of whitespace.

-tkc

Oct 31 '08 #3

P: n/a
On Fri, Oct 31, 2008 at 11:53 AM, Chaim Krause <ch***@chaim.comwrote:
I am unable to figure out why the first two statements work as I
expect them to and the next two do not. Namely, the first two spit the
sentence into its component words, while the latter two return the
whole sentence entact.

import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"

print mytext.split()
print mytext.split(' ')
print mytext.split(whitespace)
print string.split(mytext, sep=whitespace)
Also note that a plain 'mytext.split()' with no arguments will split
on any whitespace character like you're trying to do here.

Cheers,
Chris
--
Follow the path of the Iguana...
http://rebertia.com
--
http://mail.python.org/mailman/listinfo/python-list
Oct 31 '08 #4

P: n/a
The documentation I am referencing states...

The sep argument may consist of multiple characters (for example, "'1,
2, 3'.split(', ')" returns "['1', '2', '3']").

So why doesn't the latter two split on *any* whitespace character, and
is instead looking for the sep string as a whole?
Oct 31 '08 #5

P: n/a
I have arrived here while attempting to break down a larger problem. I
got to this question when attempting to split a line on any whitespace
character so that I could then add several other characters like ';'
and ':'. Ultimately splitting a line on any char in a union of
string.whitespace and some pre-designated chars.

I am now beginning to think that I have outgrown split() and must move
up to regular expressions. If that is the case, I will go off and RTFM
on RegEx.
Oct 31 '08 #6

P: n/a
On Oct 31, 2:12*pm, Chaim Krause <ch...@chaim.comwrote:
The documentation I am referencing states...

The sep argument may consist of multiple characters (for example, "'1,
2, 3'.split(', ')" returns "['1', '2', '3']").

So why doesn't the latter two split on *any* whitespace character, and
is instead looking for the sep string as a whole?
Now, rereading the documentation in light of the replies to my
origional posting, I see that I misinterpreted the example as using
"comma OR space" when it was actually "commaspace". I am now properly
enlightened.

Thank you all for your help.
Oct 31 '08 #7

P: n/a
On Oct 31, 6:57*pm, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
On Fri, 31 Oct 2008 11:53:30 -0700, Chaim Krause wrote:
I am unable to figure out why the first two statements work as I expect
them to and the next two do not. Namely, the first two spit the sentence
into its component words, while the latter two return the whole sentence
entact.
import string
from string import whitespace
mytext = "The quick brown fox jumped over the lazy dog.\n"
print mytext.split()
print mytext.split(' ')

This splits at the string ' '.
print mytext.split(whitespace)

This splits at the string '\t\n\x0b\x0c\r ' which doesn't occur in
`mytext`. *The argument is a string not a set of characters.
print string.split(mytext, sep=whitespace)

Same here.
<muse>
It's interesting, if you think about it, that here we have someone who
wants to split on a set of characters but 'split' splits on a string,
and others sometimes want to strip off a string but 'strip' strips on
a set of characters (passed as a string). You could imagine that if
Python had had (character) sets from the start then 'split' and
'strip' could have accepted a string or a set depending on whether you
wanted to split on or stripping off a string or a set.
</muse>
Oct 31 '08 #8

P: n/a
On Fri, 31 Oct 2008 12:18:32 -0700, Chaim Krause wrote:
I have arrived here while attempting to break down a larger problem. I
got to this question when attempting to split a line on any whitespace
character so that I could then add several other characters like ';' and
':'. Ultimately splitting a line on any char in a union of
string.whitespace and some pre-designated chars.

I am now beginning to think that I have outgrown split() and must move
up to regular expressions. If that is the case, I will go off and RTFM
on RegEx.
Or just do this:

s = "the quick brown\tdog\njumps over\r\n\t the lazy dog"
s = s.replace('\t', ' ').replace('\n', ' ').replace('\r', ' ')
s.split(' ')
or even simpler:

s.split()
--
Steven
Nov 1 '08 #9

P: n/a
Steven D'Aprano wrote:
On Fri, 31 Oct 2008 12:18:32 -0700, Chaim Krause wrote:
>I have arrived here while attempting to break down a larger problem. I
got to this question when attempting to split a line on any whitespace
character so that I could then add several other characters like ';' and
':'. Ultimately splitting a line on any char in a union of
string.whitespace and some pre-designated chars.

I am now beginning to think that I have outgrown split() and must move
up to regular expressions. If that is the case, I will go off and RTFM
on RegEx.

Or just do this:
s = "the quick brown\tdog\njumps over\r\n\t the lazy dog"
s = s.replace('\t', ' ').replace('\n', ' ').replace('\r', ' ')
s.split(' ')
or even simpler:
s.split()
Or, for faster per-repetition (blending in to your use-case):

import string
SEP = string.maketrans('abc \t', ' ')
...
parts = 'whatever, abalone dudes'.translate(SEP).split()
print parts

['wh', 'tever,', 'lone', 'dudes']
--Scott David Daniels
Sc***********@Acm.Org
Nov 3 '08 #10

P: n/a
That is a very creative solution! Thank you Scott.
Or, for faster per-repetition (blending in to your use-case):

* * *import string
* * *SEP = string.maketrans('abc \t', ' * * ')
* * *...
* * *parts = 'whatever, abalone dudes'.translate(SEP).split()
* * *print parts

['wh', 'tever,', 'lone', 'dudes']
Nov 4 '08 #11

P: n/a
MRAB:
It's interesting, if you think about it, that here we have someone who
wants to split on a set of characters but 'split' splits on a string,
and others sometimes want to strip off a string but 'strip' strips on
a set of characters (passed as a string).
That can be seen as a little inconsistency in the language. But with
some practice you learn it.

You could imagine that if
Python had had (character) sets from the start then 'split' and
'strip' could have accepted a string or a set depending on whether you
wanted to split on or stripping off a string or a set.
Too bad you haven't suggested this when they were designing Python
3 :-)
This may be suggested for Python 3.1.

Bye,
bearophile
Nov 4 '08 #12

P: n/a
On Nov 4, 8:00*pm, bearophileH...@lycos.com wrote:
MRAB:
It's interesting, if you think about it, that here we have someone who
wants to split on a set of characters but 'split' splits on a string,
and others sometimes want to strip off a string but 'strip' strips on
a set of characters (passed as a string).

That can be seen as a little inconsistency in the language. But with
some practice you learn it.
You could imagine that if
Python had had (character) sets from the start then 'split' and
'strip' could have accepted a string or a set depending on whether you
wanted to split on or stripping off a string or a set.

Too bad you haven't suggested this when they were designing Python
3 :-)
This may be suggested for Python 3.1.
I might also add that str.startswith can accept a tuple of strings;
shouldn't that have been a set? :-)

I also had the thought that the backtick (`), which is not used in
Python 3, could be used to form character set literals (`aeiou` =>
set("aeiou")), although that might only be worth while if character
sets were introduced as an specialised form of set.
Nov 4 '08 #13

P: n/a
MRAB:
I also had the thought that the backtick (`), which is not used in
Python 3, could be used to form character set literals (`aeiou` =>
set("aeiou")), although that might only be worth while if character
sets were introduced as an specialised form of set.
Python developers have removed it from the syntax mostly because lot
of keyboards (probably most in the world) don't have "`" on them.

Bye,
bearophile
Nov 4 '08 #14

This discussion thread is closed

Replies have been disabled for this discussion.