473,386 Members | 1,705 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

best split tokens?

Jay
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Sep 8 '06 #1
12 1840
8 Sep 2006 13:41:48 -0700, Jay <ja*******@gmail.com>:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
your_string.split()

--
Felipe.
Sep 8 '06 #2
"Jay" <ja*******@gmail.comwrites:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
I'd just use alphabetic strings:

textbox = 'Apple pie and penguins'
words = re.findall('[a-z]+', textbox, re.I)
Sep 8 '06 #3
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Do you mean like

for word in string_from_text_box.split():
spell_check(word)

Every string has a split() method which, by default, splits the
string at runs of whitespace.
>>help("".split)
will give you more info.

-tkc

Sep 8 '06 #4
Jay wrote:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
I'm sure this is not perfect, but it gives one the general idea.

pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
pyprint astr

Four score and seven years ago, our
forefathers, who art in heaven (hallowed be their names),
did forthwith declare that all men are created
to shed their mortal coils and to be given daily
bread, even in the best of times and the worst of times.

With liberty and justice for all.

-William Shakespear

py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 8 '06 #5
pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
This regexp could be shortened to just

rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc
Sep 8 '06 #6

Tim Chase wrote:
pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']

This regexp could be shortened to just

rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)
ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc
Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

The short answer to the OP's question is that there is no short answer.
This blog note (and the papers it cites) may help ...

http://blogs.msdn.com/correcteurorth...07/500807.aspx

Cheers,
John

Sep 8 '06 #7
John Machin wrote:
Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"
Not perfect, but would work for many cases:

s = "He was wont to be alarmed/amused by answers that won't work"
r = r'[()\[\]<>{}.,@#$%^&*?!-:;\\/_"\s\b]+'
l = filter(lambda x: not x == '', re.split(r, string))

Check out this short paper from the Natural Language Toolkit folks on
some problems / strategies for tokenization:
http://nltk.sourceforge.net/lite/doc/en/tokenize.html

Sep 9 '06 #8
> rgx = re.compile('\W+')
>>
if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...

I'd call it half-asscii :-)
groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.
textbox = "He was wont to be alarmed/amused by answers that won't work"
Well, one could do something like
>>s
"He was wont to be alarmed/amused by answers that won't work"
>>s2
"The two-faced liar--a real joker--can't tell the truth"
>>r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
r.findall(s), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc

Sep 9 '06 #9
It depends on the language as it was suggested, and it also depends on
how a token is defined. Can it have dashes, underscores, numbers and
stuff? This will also determine what the whitespace will be. Then the
two main methods of doing the splitting is to either cut based on
whitespace (specify whitespace explicitly) or pick out only valid token
symbols uninterrupted by any whitespace (specify valid symbols
explicitly).

Nick V.
Tim Chase wrote:
rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.
textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like
>>s
"He was wont to be alarmed/amused by answers that won't work"
>>s2
"The two-faced liar--a real joker--can't tell the truth"
>>r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
>>r.findall(s), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc
Sep 9 '06 #10
Tim Chase wrote:
>
I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

Sep 9 '06 #11
>Any more crazy examples? :)
>
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?
I said "crazy"...not "pathological" :)

If one really wants such a case, one has to omit the standard
practice of nesting quotes:

John replied "Dad told me 'you can't go' but let Judy"

However, if you don't have such situations and to want to make
'enry and 'orace 'appy, you can change the regexp to

>>s="He was wont to be alarmed/amused by answers that won't work"
s2="The two-faced liar--a real joker--can't tell the truth"
s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop
their aitches?"
>>r =
re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")

It will also choke using double-dashes:
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced',
'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'],
["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])

Or you could combine them to only allow infix dashes, but allow
apostrophes anywhere in the word, including the front or back,
one could use:
>>r =
re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey",
"'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])
Now your spell-checker has to have the "dropped initial or
terminal letter" locale... :)

-tkc


Sep 9 '06 #12

Tim Chase wrote:
Any more crazy examples? :)
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

I said "crazy"...not "pathological" :)

If one really wants such a case, one has to omit the standard
practice of nesting quotes:

John replied "Dad told me 'you can't go' but let Judy"

However, if you don't have such situations and to want to make
'enry and 'orace 'appy, you can change the regexp to

>>s="He was wont to be alarmed/amused by answers that won't work"
>>s2="The two-faced liar--a real joker--can't tell the truth"
>>s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop
their aitches?"
>>r =
re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")

It will also choke using double-dashes:
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced',
'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'],
["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])

Or you could combine them to only allow infix dashes, but allow
apostrophes anywhere in the word, including the front or back,
one could use:
>>r =
re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")
>>r.findall(s), r.findall(s2), r.findall(s3)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey",
"'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])
Now your spell-checker has to have the "dropped initial or
terminal letter" locale... :)
Too complicated for string.bleedin'_split(), innit?
Cheers,
John

Sep 9 '06 #13

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: oliver | last post by:
hi there i'm experimanting with imaplib and came across stringts like (\HasNoChildren) "." "INBOX.Sent Items" in which the quotes are part of the string. now i try to convert this into a...
5
by: Andy Mee | last post by:
Hello one and all, I'm developing an Asp.NET system to take a CSV file uploaded via the web, parse it, and insert the values into an SQL database. My sticking point comes when I try to split()...
2
by: Frank Oquendo | last post by:
I have the following code: string pattern = @"(\{)|(})|(\()|(\))|(\)|(\^)|(\*)|(/)|(-)|(\+)|(%)"; Regex regex = new Regex(pattern); string input = "QTY * ESTIMATED COST + 2"; string tokens =...
4
by: Crirus | last post by:
There is a function somewhere to split a string with multiple tokens at a time? Say I have this: aaaa#bbbbb*ccccc$dddd I whould like to split it so the result whould be aaaa bbb
22
by: Sven-Thorsten Fahrbach | last post by:
Hi Does anybody know of a library that offers a function to split pathnames. It should work somewhat like the following code snippet: ----------------- char *path =...
4
by: p.lavarre | last post by:
How can I instantiate shlex.shlex to behave like shlex.split does? I see shlex.split gives me what I want: import shlex print shlex.split("1.2e+3") # 1.2e+3 But every doc'ed...
9
by: martinskou | last post by:
Hi, I'm looking for something like: multi_split( 'a:=b+c' , ) returning: whats the python way to achieve this, preferably without regexp? Thanks.
3
by: Dave | last post by:
I'm calling string.Split() producing output string. I need direct access to its enumerator, but would greatly prefer an enumerator strings and not object types (as my parsing is unsafe casting...
2
by: Yimin Rong | last post by:
For example, given a string "A, B, C (P, Q, R), D (X, Y , Z)". Would like to split into tokens thusly: a == "A" a == "B" a == "C (P, Q, R)" a == "D (X, Y , Z)"
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.