best split tokens?

Jay

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Sep 8 '06 #1

Subscribe Post Reply

1840

Felipe Almeida Lessa

8 Sep 2006 13:41:48 -0700, Jay <ja*******@gmail.com>:

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

your_string.split()

--
Felipe.

Sep 8 '06 #2

Paul Rubin

"Jay" <ja*******@gmail.comwrites:

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

I'd just use alphabetic strings:

textbox = 'Apple pie and penguins'
words = re.findall('[a-z]+', textbox, re.I)

Sep 8 '06 #3

Tim Chase

Let's say, for instance, that one was programming a spell checker or

some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Do you mean like

for word in string_from_text_box.split():
spell_check(word)

Every string has a split() method which, by default, splits the
string at runs of whitespace.

>>help("".split)

will give you more info.

-tkc

Sep 8 '06 #4

James Stroud

Jay wrote:

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

I'm sure this is not perfect, but it gives one the general idea.

pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
pyprint astr

Four score and seven years ago, our
forefathers, who art in heaven (hallowed be their names),
did forthwith declare that all men are created
to shed their mortal coils and to be given daily
bread, even in the best of times and the worst of times.

With liberty and justice for all.

-William Shakespear

py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Sep 8 '06 #5

Tim Chase

pyimport re

pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']

This regexp could be shortened to just

rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc

Sep 8 '06 #6

John Machin

Tim Chase wrote:

pyimport re
pyrgx = re.compile(r'(?:\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']

This regexp could be shortened to just

rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...

I'd call it half-asscii :-)

ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc

Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

The short answer to the OP's question is that there is no short answer.
This blog note (and the papers it cites) may help ...

http://blogs.msdn.com/correcteurorth...07/500807.aspx

Cheers,
John

Sep 8 '06 #7

MonkeeSage

John Machin wrote:

Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

Not perfect, but would work for many cases:

s = "He was wont to be alarmed/amused by answers that won't work"
r = r'[()\[\]<>{}.,@#$%^&*?!-:;\\/_"\s\b]+'
l = filter(lambda x: not x == '', re.split(r, string))

Check out this short paper from the Natural Language Toolkit folks on
some problems / strategies for tokenization:
http://nltk.sourceforge.net/lite/doc/en/tokenize.html

Sep 9 '06 #8

Tim Chase

> rgx = re.compile('\W+')

>>
if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...

I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.

textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like

>>s

"He was wont to be alarmed/amused by answers that won't work"

>>s2

"The two-faced liar--a real joker--can't tell the truth"

>>r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
r.findall(s), r.findall(s2)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc

Sep 9 '06 #9

Nick Vatamaniuc

It depends on the language as it was suggested, and it also depends on
how a token is defined. Can it have dashes, underscores, numbers and
stuff? This will also determine what the whitespace will be. Then the
two main methods of doing the splitting is to either cut based on
whitespace (specify whitespace explicitly) or pick out only valid token
symbols uninterrupted by any whitespace (specify valid symbols
explicitly).

Nick V.
Tim Chase wrote:

rgx = re.compile('\W+')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.

textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like

>>s

"He was wont to be alarmed/amused by answers that won't work"

>>s2

"The two-faced liar--a real joker--can't tell the truth"

>>r = re.compile("(?:(?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
>>r.findall(s), r.findall(s2)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc

Sep 9 '06 #10

John Machin

Tim Chase wrote:

>
I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

Sep 9 '06 #11

Tim Chase

>Any more crazy examples? :)

>
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

I said "crazy"...not "pathological" :)

If one really wants such a case, one has to omit the standard
practice of nesting quotes:

John replied "Dad told me 'you can't go' but let Judy"

However, if you don't have such situations and to want to make
'enry and 'orace 'appy, you can change the regexp to

>>s="He was wont to be alarmed/amused by answers that won't work"
s2="The two-faced liar--a real joker--can't tell the truth"
s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop

their aitches?"

>>r =

re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")

It will also choke using double-dashes:

>>r.findall(s), r.findall(s2), r.findall(s3)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced',
'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'],
["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])

Or you could combine them to only allow infix dashes, but allow
apostrophes anywhere in the word, including the front or back,
one could use:

>>r =

re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")

>>r.findall(s), r.findall(s2), r.findall(s3)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey",
"'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])
Now your spell-checker has to have the "dropped initial or
terminal letter" locale... :)

-tkc

Sep 9 '06 #12

John Machin

Tim Chase wrote:

Any more crazy examples? :)
'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop their aitches?

I said "crazy"...not "pathological" :)

If one really wants such a case, one has to omit the standard
practice of nesting quotes:

John replied "Dad told me 'you can't go' but let Judy"

However, if you don't have such situations and to want to make
'enry and 'orace 'appy, you can change the regexp to

>>s="He was wont to be alarmed/amused by answers that won't work"
>>s2="The two-faced liar--a real joker--can't tell the truth"
>>s3="'ey, 'alf a mo, wot about when 'enry 'n' 'orace drop

their aitches?"

>>r =

re.compile("(?:(?:[a-zA-Z][-'])|(?:[-'][a-zA-Z])|[a-zA-Z])+")

It will also choke using double-dashes:

>>r.findall(s), r.findall(s2), r.findall(s3)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced',
'liar--a', 'real', "joker--can't", 'tell', 'the', 'truth'],
["'ey", "'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])

Or you could combine them to only allow infix dashes, but allow
apostrophes anywhere in the word, including the front or back,
one could use:

>>r =

re.compile("(?:(?:[a-zA-Z]')|(?:'[a-zA-Z])|(?:[a-zA-Z]-[a-zA-Z])|[a-zA-Z])+")

>>r.findall(s), r.findall(s2), r.findall(s3)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'], ["'ey",
"'alf", 'a', 'mo', 'wot', 'about', 'when', "'enry", "'n",
"'orace", 'drop', 'their', 'aitches'])
Now your spell-checker has to have the "dropped initial or
terminal letter" locale... :)

Too complicated for string.bleedin'_split(), innit?
Cheers,
John

Sep 9 '06 #13

by: oliver | last post by:

hi there i'm experimanting with imaplib and came across stringts like (\HasNoChildren) "." "INBOX.Sent Items" in which the quotes are part of the string. now i try to convert this into a...

Python

Web Forms / HTTP File Upload / String.Split a StreamReader.ReadLine() string

by: Andy Mee | last post by:

Hello one and all, I'm developing an Asp.NET system to take a CSV file uploaded via the web, parse it, and insert the values into an SQL database. My sticking point comes when I try to split()...

ASP / Active Server Pages

Question on Regex.Split

by: Frank Oquendo | last post by:

I have the following code: string pattern = @"(\{)|(})|($)|($)|(\)|(\^)|(\*)|(/)|(-)|(\+)|(%)"; Regex regex = new Regex(pattern); string input = "QTY * ESTIMATED COST + 2"; string tokens =...

C# / C Sharp

Split a string

by: Crirus | last post by:

There is a function somewhere to split a string with multiple tokens at a time? Say I have this: aaaa#bbbbb*ccccc$dddd I whould like to split it so the result whould be aaaa bbb

Visual Basic .NET

Looking for a library to split pathnames

by: Sven-Thorsten Fahrbach | last post by:

Hi Does anybody know of a library that offers a function to split pathnames. It should work somewhat like the following code snippet: ----------------- char *path =...

C / C++

shlex.split != shlex.shlex get_token til eof

by: p.lavarre | last post by:

How can I instantiate shlex.shlex to behave like shlex.split does? I see shlex.split gives me what I want: import shlex print shlex.split("1.2e+3") # 1.2e+3 But every doc'ed...

Python

multi split function taking delimiter list

by: martinskou | last post by:

Hi, I'm looking for something like: multi_split( 'a:=b+c' , ) returning: whats the python way to achieve this, preferably without regexp? Thanks.

Python

Need generic enumerator when calling string.split()

by: Dave | last post by:

I'm calling string.Split() producing output string. I need direct access to its enumerator, but would greatly prefer an enumerator strings and not object types (as my parsing is unsafe casting...

C# / C Sharp

Regular Expressions to Split Lists Into Sub-Lists

by: Yimin Rong | last post by:

For example, given a string "A, B, C (P, Q, R), D (X, Y , Z)". Would like to split into tokens thusly: a == "A" a == "B" a == "C (P, Q, R)" a == "D (X, Y , Z)"

PHP

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

best split tokens?

Similar topics