By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
459,203 Members | 1,633 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 459,203 IT Pros & Developers. It's quick & easy.

best split tokens?

P: n/a
Jay
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Sep 8 '06 #1
Share this Question
Share on Google+
12 Replies


P: n/a
8 Sep 2006 13:41:48 -0700, Jay <ja*******@gmail.com>:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
your_string.split()

--
Felipe.
Sep 8 '06 #2

P: n/a
"Jay" <ja*******@gmail.comwrites:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
I'd just use alphabetic strings:

textbox = 'Apple pie and penguins'
words = re.findall('[a-z]+', textbox, re.I)
Sep 8 '06 #3

P: n/a
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Do you mean like

for word in string_from_text_box.split():
spell_check(word)

Every string has a split() method which, by default, splits the
string at runs of whitespace.
>>help("".split)
will give you more info.

-tkc

Sep 8 '06 #4

P: n/a
Jay wrote:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
I'm sure this is not perfect, but it gives one the general idea.

pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
pyprint astr

Four score and seven years ago, our
forefathers, who art in heaven (hallowed be their names),
did forthwith declare that all men are created
to shed their mortal coils and to be given daily
bread, even in the best of times and the worst of times.

With liberty and justice for all.

-William Shakespear

py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 8 '06 #5

P: n/a
pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
This regexp could be shortened to just

rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc
Sep 8 '06 #6

P: n/a

Tim Chase wrote:
pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']

This regexp could be shortened to just

rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)
ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc
Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

The short answer to the OP's question is that there is no short answer.
This blog note (and the papers it cites) may help ...

http://blogs.msdn.com/correcteurorth...07/500807.aspx

Cheers,
John

Sep 8 '06 #7

P: n/a
John Machin wrote:
Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"
Not perfect, but would work for many cases:

s = "He was wont to be alarmed/amused by answers that won't work"
r = r'[()\[\]<>{}.,@#$%^&*?!-:;\\/_"\s\b]+'
l = filter(lambda x: not x == '', re.split(r, string))

Check out this short paper from the Natural Language Toolkit folks on
some problems / strategies for tokenization:
http://nltk.sourceforge.net/lite/doc/en/tokenize.html

Sep 9 '06 #8

P: n/a
> rgx = re.compile('\W+')
>>
if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...

I'd call it half-asscii :-)
groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.
textbox = "He was wont to be alarmed/amused by answers that won't work"
Well, one could do something like
>>s
"He was wont to be alarmed/amused by answers that won't work"
>>s2
"The two-faced liar--a real joker--can't tell the truth"
>>r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
r.findall(s), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc

Sep 9 '06 #9

P: n/a
It depends on the language as it was suggested, and it also depends on
how a token is defined. Can it have dashes, underscores, numbers and
stuff? This will also determine what the whitespace will be. Then the
two main methods of doing the splitting is to either cut based on
whitespace (specify whitespace explicitly) or pick out only valid token
symbols uninterrupted by any whitespace (specify valid symbols
explicitly).

Nick V.
Tim Chase wrote:
rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.
textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like
>>s
"He was wont to be alarmed/amused by answers that won't work"
>>s2
"The two-faced liar--a real joker--can't tell the truth"
>>r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
>>r.findall(s), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc
Sep 9 '06 #10

P: n/a
Tim Chase wrote:
>
I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

Sep 9 '06 #11

P: n/a
>Any more crazy examples? :)
>
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?
I said "crazy"...not "pathological" :)

If one really wants such a case, one has to omit the standard
practice of nesting quotes:

John replied "Dad told me 'you can't go' but let Judy"

However, if you don't have such situations and to want to make
'enry and 'orace 'appy, you can change the regexp to

>>s="He was wont to be alarmed/amused by answers that won't work"
s2="The two-faced liar--a real joker--can't tell the truth"
s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop
their aitches?"
>>r =
re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")

It will also choke using double-dashes:
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced',
'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'],
["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])

Or you could combine them to only allow infix dashes, but allow
apostrophes anywhere in the word, including the front or back,
one could use:
>>r =
re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey",
"'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])
Now your spell-checker has to have the "dropped initial or
terminal letter" locale... :)

-tkc


Sep 9 '06 #12

P: n/a

Tim Chase wrote:
Any more crazy examples? :)
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

I said "crazy"...not "pathological" :)

If one really wants such a case, one has to omit the standard
practice of nesting quotes:

John replied "Dad told me 'you can't go' but let Judy"

However, if you don't have such situations and to want to make
'enry and 'orace 'appy, you can change the regexp to

>>s="He was wont to be alarmed/amused by answers that won't work"
>>s2="The two-faced liar--a real joker--can't tell the truth"
>>s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop
their aitches?"
>>r =
re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")

It will also choke using double-dashes:
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced',
'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'],
["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])

Or you could combine them to only allow infix dashes, but allow
apostrophes anywhere in the word, including the front or back,
one could use:
>>r =
re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey",
"'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])
Now your spell-checker has to have the "dropped initial or
terminal letter" locale... :)
Too complicated for string.bleedin'_split(), innit?
Cheers,
John

Sep 9 '06 #13

This discussion thread is closed

Replies have been disabled for this discussion.