best split tokens?

Jay

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Sep 8 '06 #1

Subscribe Reply

1860

Felipe Almeida Lessa

8 Sep 2006 13:41:48 -0700, Jay <ja*******@gmai l.com>:

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

your_string.spl it()

--
Felipe.

Sep 8 '06 #2

Paul Rubin

"Jay" <ja*******@gmai l.comwrites:

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

I'd just use alphabetic strings:

textbox = 'Apple pie and penguins'
words = re.findall('[a-z]+', textbox, re.I)

Sep 8 '06 #3

Tim Chase

Let's say, for instance, that one was programming a spell checker or

some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

Do you mean like

for word in string_from_tex t_box.split():
spell_check(wor d)

Every string has a split() method which, by default, splits the
string at runs of whitespace.

>>help("".split )

will give you more info.

-tkc

Sep 8 '06 #4

James Stroud

Jay wrote:

Let's say, for instance, that one was programming a spell checker or
some other function where the contents of a string from a text-editor's
text box needed to be split so that the resulting array has each word
as an element. Is there a shortcut to do this and, if not, what's the
best and most efficient token group for the split function to achieve
this?

I'm sure this is not perfect, but it gives one the general idea.

pyimport re
pyrgx = re.compile(r'(? :\s+)|[()\[\].,?;-]+')
pyprint astr

Four score and seven years ago, our
forefathers, who art in heaven (hallowed be their names),
did forthwith declare that all men are created
to shed their mortal coils and to be given daily
bread, even in the best of times and the worst of times.

With liberty and justice for all.

-William Shakespear

py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']
James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com/

Sep 8 '06 #5

Tim Chase

pyimport re

pyrgx = re.compile(r'(? :\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']

This regexp could be shortened to just

rgx = re.compile('\W+ ')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...ideal ly Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc

Sep 8 '06 #6

John Machin

Tim Chase wrote:

pyimport re
pyrgx = re.compile(r'(? :\s+)|[()\[\].,?;-]+')
py[s for s in rgx.split(astr) if s]
['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'forefathers',
'who', 'art', 'in', 'heaven', 'hallowed', 'be', 'their', 'names', 'did',
'forthwith', 'declare', 'that', 'all', 'men', 'are', 'created', 'to',
'shed', 'their', 'mortal', 'coils', 'and', 'to', 'be', 'given', 'daily',
'bread', 'even', 'in', 'the', 'best', 'of', 'times', 'and', 'the',
'worst', 'of', 'times', 'With', 'liberty', 'and', 'justice', 'for',
'all', 'William', 'Shakespear']

This regexp could be shortened to just

rgx = re.compile('\W+ ')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...

I'd call it half-asscii :-)

ideally Python regexps would support
Posix character classes, so one could use

rgx = re.compile('[^[:alpha:]]+')
or something of the like...however, that fails on my python2.4 here.

-tkc

Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

The short answer to the OP's question is that there is no short answer.
This blog note (and the papers it cites) may help ...

http://blogs.msdn.com/correcteurorth...07/500807.aspx

Cheers,
John

Sep 8 '06 #7

MonkeeSage

John Machin wrote:

Not picking on Tim in particular; try the following with *all*
suggestions so far:

textbox = "He was wont to be alarmed/amused by answers that won't work"

Not perfect, but would work for many cases:

s = "He was wont to be alarmed/amused by answers that won't work"
r = r'[()\[\]<>{}.,@#$%^&* ?!-:;\\/_"\s\b]+'
l = filter(lambda x: not x == '', re.split(r, string))

Check out this short paper from the Natural Language Toolkit folks on
some problems / strategies for tokenization:
http://nltk.sourceforge.net/lite/doc/en/tokenize.html

Sep 9 '06 #8

Tim Chase

> rgx = re.compile('\W+ ')

>>
if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...

I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.

textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like

>>s

"He was wont to be alarmed/amused by answers that won't work"

>>s2

"The two-faced liar--a real joker--can't tell the truth"

>>r = re.compile("(?: (?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
r.findall(s ), r.findall(s2)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc

Sep 9 '06 #9

Nick Vatamaniuc

It depends on the language as it was suggested, and it also depends on
how a token is defined. Can it have dashes, underscores, numbers and
stuff? This will also determine what the whitespace will be. Then the
two main methods of doing the splitting is to either cut based on
whitespace (specify whitespace explicitly) or pick out only valid token
symbols uninterrupted by any whitespace (specify valid symbols
explicitly).

Nick V.
Tim Chase wrote:

rgx = re.compile('\W+ ')

if you don't mind numbers included you text (in the event you
have things like "fatal1ty", "thing2", or "pdf2txt") which is
often the case...they should be considered part of the word.

If that's a problem, you should be able to use

rgx = re.compile('[^a-zA-Z]+')

This is a bit Euro-centric...
I'd call it half-asscii :-)

groan... :)

Given the link you provided, I correct my statement to
"Ango-centric", as there are clearly oddball cases in languages
such as French.

textbox = "He was wont to be alarmed/amused by answers that won't work"

Well, one could do something like

>>s

"He was wont to be alarmed/amused by answers that won't work"

>>s2

"The two-faced liar--a real joker--can't tell the truth"

>>r = re.compile("(?: (?:[a-zA-Z][-'][a-zA-Z])|[a-zA-Z])+")
>>r.findall(s ), r.findall(s2)

(['He', 'was', 'wont', 'to', 'be', 'alarmed', 'amused', 'by',
'answers', 'that', "won't", 'work'], ['The', 'two-faced', 'liar',
'a', 'real', 'joker', "can't", 'tell', 'the', 'truth'])
which parses your example the way I would want it to be parsed,
and handles the strange string I came up with to try similar
examples the way I would expect that it would be broken down by
"words"...

I had a hard time comin' up with any words I'd want to call
"words" where the additional non-word glyph (apostrophe, dash,
etc) wasn't 'round the middle of the word. :)

Any more crazy examples? :)

-tkc

Sep 9 '06 #10

Similar topics

3266

split a string with quoted parts into list

by: oliver | last post by:

hi there i'm experimanting with imaplib and came across stringts like (\HasNoChildren) "." "INBOX.Sent Items" in which the quotes are part of the string. now i try to convert this into a list. assume the string is in the variable f, then i tried f.split() but i end up with

Python

12228

Web Forms / HTTP File Upload / String.Split a StreamReader.ReadLine() string

by: Andy Mee | last post by:

Hello one and all, I'm developing an Asp.NET system to take a CSV file uploaded via the web, parse it, and insert the values into an SQL database. My sticking point comes when I try to split() the string returned by readline() on the file. The following code snippet works for me: tokens = "one,two,three,four".Split(",") for each token in tokens response.write("<td>"+token+"</td>")

ASP / Active Server Pages

1422

Question on Regex.Split

by: Frank Oquendo | last post by:

I have the following code: string pattern = @"(\{)|(})|($)|($)|(\)|(\^)|(\*)|(/)|(-)|(\+)|(%)"; Regex regex = new Regex(pattern); string input = "QTY * ESTIMATED COST + 2"; string tokens = regex.Split(input); for (int i = 0; i != tokens.Length; i++) { Console.WriteLine("Token {0} = {1}", i, tokens.Trim());

C# / C Sharp

6874

Split a string

by: Crirus | last post by:

There is a function somewhere to split a string with multiple tokens at a time? Say I have this: aaaa#bbbbb*ccccc$dddd I whould like to split it so the result whould be aaaa bbb

Visual Basic .NET

2923

Looking for a library to split pathnames

by: Sven-Thorsten Fahrbach | last post by:

Hi Does anybody know of a library that offers a function to split pathnames. It should work somewhat like the following code snippet: ----------------- char *path = "/home/user/Documents/Textdocuments/Bills"; char **dirs; splitPath(path, dirs);

C / C++

2959

shlex.split != shlex.shlex get_token til eof

by: p.lavarre | last post by:

How can I instantiate shlex.shlex to behave like shlex.split does? I see shlex.split gives me what I want: import shlex print shlex.split("1.2e+3") # 1.2e+3 But every doc'ed instantiation of shlex.shlex surprisingly gives me something else:

Python

2311

multi split function taking delimiter list

by: martinskou | last post by:

Hi, I'm looking for something like: multi_split( 'a:=b+c' , ) returning: whats the python way to achieve this, preferably without regexp? Thanks.

Python

3626

Need generic enumerator when calling string.split()

by: Dave | last post by:

I'm calling string.Split() producing output string. I need direct access to its enumerator, but would greatly prefer an enumerator strings and not object types (as my parsing is unsafe casting from object to string frequently). Basically generics and not its non- generic counterpart. string str1 = "abc: value1 def: value2 ghi: value3"; char delimiterChars = { '\t' }; string tokens = str1.Split(delimiterChars);

C# / C Sharp

1533

Regular Expressions to Split Lists Into Sub-Lists

by: Yimin Rong | last post by:

For example, given a string "A, B, C (P, Q, R), D (X, Y , Z)". Would like to split into tokens thusly: a == "A" a == "B" a == "C (P, Q, R)" a == "D (X, Y , Z)"

PHP

8402

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8315

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8734

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8508

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

7341

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5633

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4323

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2733

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1962

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP