469,292 Members | 1,310 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,292 developers. It's quick & easy.

Split text file into words

The standard split() can use only one delimiter. To split a text file
into words you need multiple delimiters like blank, punctuation, math
signs (+-*/), parenteses and so on.

I didn't succeeded in using re.split()...
Jul 18 '05 #1
4 17862
On Tuesday 08 March 2005 14:43, qwweeeit wrote:
The standard split() can use only one delimiter. To split a text file
into words you need multiple delimiters like blank, punctuation, math
signs (+-*/), parenteses and so on.

I didn't succeeded in using re.split()...


Then try again... ;) No, seriously, re.split() can do what you want. Just
think about what are word delimiters.

Say, you want to split on all whitespace, and ",", ".", and "?", then you'd
use something like:

heiko@heiko ~ $ python
Python 2.3.5 (#1, Feb 27 2005, 22:40:59)
[GCC 3.4.3 20050110 (Gentoo Linux 3.4.3.20050110, ssp-3.4.3.20050110-0,
pie-8.7 on linux2
Type "help", "copyright", "credits" or "license" for more information.
import re
teststr = "Hello qwweeeit, how are you? I am fine, today, actually."
re.split(r"[\s\.,\?]+",teststr)

['Hello', 'qwweeeit', 'how', 'are', 'you', 'I', 'am', 'fine', 'today',
'actually', '']

Extending with other word separators shouldn't be hard... Just have a look at

http://docs.python.org/lib/re-syntax.html

HTH!

--
--- Heiko.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQBCLa5Yf0bpgh6uVAMRAh7RAJ9LY1P1lLJmMz6v8EPlGU 46KGsPDwCcDxFb
jPZAoMBmLTkMliiFBP6s8bg=
=7kGS
-----END PGP SIGNATURE-----

Jul 18 '05 #2
qwweeeit wrote:
The standard split() can use only one delimiter. To split a text file
into words you need multiple delimiters like blank, punctuation, math
signs (+-*/), parenteses and so on.

I didn't succeeded in using re.split()...


Would you care to elaborate on how you tried to use re.split and failed? We
aren't mind readers here. An example of your non-working code along with
the expected result and the actual result would be useful.

This is the first example given in the documentation for re.split:
re.split('\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

Does it do what you want? If not what do you want?
Jul 18 '05 #3
I thank you for your help.
I already used re.split successfully but in this case...
I didn't explain more deeply because I don't want someone else do my
homework.

I want to implement a variable & commands cross reference tool.
For this goal I must clean the python source from any comment and
manifest string.
On the cleaned source file I must isolate all the words (keeping the
words connected by '.')

My wrong code (don't consider the line ref. in traceback ... it's an
extract!):

import re

# input text file w/o strings & comments

f=open('file.txt')
lInput=f.readlines()
f.close()

fOut=open('words.txt','w')

for i in lInput:
.. ll=re.split(r"[\s,{}[]()+=-/*]",i)
.. fOut.write(' '.join(ll)+'\n')

fOut.close()

Traceback (most recent call last):
File "./GetWords.py", line 70, in ?
ll=re.split(r"[\s,{}[]()+=-/*]",i)
File "/usr/lib/python2.3/sre.py", line 156, in split
return _compile(pattern, 0).split(string, maxsplit)
RuntimeError: maximum recursion limit exceeded
.... and if I use:
ll=re.split(r"\s,{}[]()+=-/*",i)

Traceback (most recent call last):
File "./GetWords.py", line 70, in ?
ll=re.split(r"\s,{}[]()+=-/*",i)
File "/usr/lib/python2.3/sre.py", line 156, in split
return _compile(pattern, 0).split(string, maxsplit)
File "/usr/lib/python2.3/sre.py", line 230, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range

I taught it was my mistake in the use of re.split...

I am using:
Python 2.3.4 (#2, Aug 19 2004, 15:49:40)
[GCC 3.4.1 (Mandrakelinux (Alpha 3.4.1-3mdk)] on linux2
Jul 18 '05 #4
qwweeeit wrote:
ll=re.split(r"[\s,{}[]()+=-/*]",i)


The stack overflow comes because the ()+ tried to match an empty string as
many times as possible.

This regular expression contains a character set '\s,{}[' followed by the
expression '()+=-/*]'. You can see that the parentheses aren't part of a
character set if you reverse their order which gives you an error when the
expression is compiled instead of failing when trying to match:
ll=re.split(r"[\s,{}[])(+=-/*]",i)
Traceback (most recent call last):
File "<pyshell#10>", line 1, in -toplevel-
ll=re.split(r"[\s,{}[])(+=-/*]",i)
File "C:\Python24\Lib\sre.py", line 157, in split
return _compile(pattern, 0).split(string, maxsplit)
File "C:\Python24\Lib\sre.py", line 227, in _compile
raise error, v # invalid expression
error: unbalanced parenthesis


I suspect you actually meant the character set to include the other
punctuation characters in which case you need to escape the closing square
bracket or make it the first character:

Try:

ll=re.split(r"[\s,{}[\]()+=-/*]",i)

or:

ll=re.split(r"[]\s,{}[()+=-/*]",i)

instead.
Jul 18 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

6 posts views Thread by Jocknerd | last post: by
5 posts views Thread by Amjad Farran | last post: by
2 posts views Thread by ownowl | last post: by
1 post views Thread by Alan T | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by harlem98 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.