473,657 Members | 2,401 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

best split tokens?

Jay
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Sep 8 '06 #1
12 1860
8 Sep 2006 13:41:48 -0700, Jay <ja*******@gmai l.com>:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
your_string.spl it()

--
Felipe.
Sep 8 '06 #2
"Jay" <ja*******@gmai l.comwrites:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
I'd just use alphabetic strings:

textbox = 'Apple pie and penguins'
words = re.findall('[a-z]+', textbox, re.I)
Sep 8 '06 #3
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Do you mean like

for word in string_from_tex t_box.split():
spell_check(wor d)

Every string has a split() method which, by default, splits the
string at runs of whitespace.
>>help("".split )
will give you more info.

-tkc

Sep 8 '06 #4
Jay wrote:
Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?
I'm sure this is not perfect, but it gives one the general idea.

pyimport re
pyrgx = re.compile(r'(? :\s+)|[()\[\].,?;-]+')
pyprint astr

Four score and seven years ago, our
forefathers, who art in heaven (hallowed be their names),
did forthwith declare that all men are created
to shed their mortal coils and to be given daily
bread, even in the best of times and the worst of times.

With liberty and justice for all.

-William Shakespear

py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/
Sep 8 '06 #5
pyimport re
pyrgx = re.compile(r'(? :\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
This regexp could be shortened to just

rgx = re.compile('\W+ ')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...ideal ly Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc
Sep 8 '06 #6

Tim Chase wrote:
pyimport re
pyrgx = re.compile(r'(? :\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']

This regexp could be shortened to just

rgx = re.compile('\W+ ')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)
ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc
Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

The short answer to the OP's question is that there is no short answer.
This blog note (and the papers it cites) may help ...

http://blogs.msdn.com/correcteurorth...07/500807.aspx

Cheers,
John

Sep 8 '06 #7
John Machin wrote:
Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"
Not perfect, but would work for many cases:

s = "He was wont to be alarmed/amused by answers that won't work"
r = r'[()\[\]<>{}.,@#$%^&* ?!-:;\\/_"\s\b]+'
l = filter(lambda x: not x == '', re.split(r, string))

Check out this short paper from the Natural Language Toolkit folks on
some problems / strategies for tokenization:
http://nltk.sourceforge.net/lite/doc/en/tokenize.html

Sep 9 '06 #8
> rgx = re.compile('\W+ ')
>>
if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...

I'd call it half-asscii :-)
groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.
textbox = "He was wont to be alarmed/amused by answers that won't work"
Well, one could do something like
>>s
"He was wont to be alarmed/amused by answers that won't work"
>>s2
"The two-faced liar--a real joker--can't tell the truth"
>>r = re.compile("(?: (?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
r.findall(s ), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc

Sep 9 '06 #9
It depends on the language as it was suggested, and it also depends on
how a token is defined. Can it have dashes, underscores, numbers and
stuff? This will also determine what the whitespace will be. Then the
two main methods of doing the splitting is to either cut based on
whitespace (specify whitespace explicitly) or pick out only valid token
symbols uninterrupted by any whitespace (specify valid symbols
explicitly).

Nick V.
Tim Chase wrote:
rgx = re.compile('\W+ ')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.
textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like
>>s
"He was wont to be alarmed/amused by answers that won't work"
>>s2
"The two-faced liar--a real joker--can't tell the truth"
>>r = re.compile("(?: (?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
>>r.findall(s ), r.findall(s2)
(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc
Sep 9 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
3266
by: oliver | last post by:
hi there i'm experimanting with imaplib and came across stringts like (\HasNoChildren) "." "INBOX.Sent Items" in which the quotes are part of the string. now i try to convert this into a list. assume the string is in the variable f, then i tried f.split() but i end up with
5
12228
by: Andy Mee | last post by:
Hello one and all, I'm developing an Asp.NET system to take a CSV file uploaded via the web, parse it, and insert the values into an SQL database. My sticking point comes when I try to split() the string returned by readline() on the file. The following code snippet works for me: tokens = "one,two,three,four".Split(",") for each token in tokens response.write("<td>"+token+"</td>")
2
1422
by: Frank Oquendo | last post by:
I have the following code: string pattern = @"(\{)|(})|(\()|(\))|(\)|(\^)|(\*)|(/)|(-)|(\+)|(%)"; Regex regex = new Regex(pattern); string input = "QTY * ESTIMATED COST + 2"; string tokens = regex.Split(input); for (int i = 0; i != tokens.Length; i++) { Console.WriteLine("Token {0} = {1}", i, tokens.Trim());
4
6874
by: Crirus | last post by:
There is a function somewhere to split a string with multiple tokens at a time? Say I have this: aaaa#bbbbb*ccccc$dddd I whould like to split it so the result whould be aaaa bbb
22
2923
by: Sven-Thorsten Fahrbach | last post by:
Hi Does anybody know of a library that offers a function to split pathnames. It should work somewhat like the following code snippet: ----------------- char *path = "/home/user/Documents/Textdocuments/Bills"; char **dirs; splitPath(path, dirs);
4
2959
by: p.lavarre | last post by:
How can I instantiate shlex.shlex to behave like shlex.split does? I see shlex.split gives me what I want: import shlex print shlex.split("1.2e+3") # 1.2e+3 But every doc'ed instantiation of shlex.shlex surprisingly gives me something else:
9
2311
by: martinskou | last post by:
Hi, I'm looking for something like: multi_split( 'a:=b+c' , ) returning: whats the python way to achieve this, preferably without regexp? Thanks.
3
3626
by: Dave | last post by:
I'm calling string.Split() producing output string. I need direct access to its enumerator, but would greatly prefer an enumerator strings and not object types (as my parsing is unsafe casting from object to string frequently). Basically generics and not its non- generic counterpart. string str1 = "abc: value1 def: value2 ghi: value3"; char delimiterChars = { '\t' }; string tokens = str1.Split(delimiterChars);
2
1533
by: Yimin Rong | last post by:
For example, given a string "A, B, C (P, Q, R), D (X, Y , Z)". Would like to split into tokens thusly: a == "A" a == "B" a == "C (P, Q, R)" a == "D (X, Y , Z)"
0
8402
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8315
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8734
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8508
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
7341
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5633
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4323
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2733
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
1962
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.