473,394 Members | 1,679 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,394 software developers and data experts.

Looking for very simple general purpose tokenizer

Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:
splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten
--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology
Jul 18 '05 #1
6 2111
Maarten van Reeuwijk wrote:
Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:
splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance


You may use re.findall for that:
import re
s = "a = b+c; z = 34;"
pat = " |=|;|[^ =;]*"
re.findall(pat, s)

['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

The pattern basically says: match either a space, a '=', a ';', or a sequence of
any characters that are not space, '=' or ';'. You may have to take care
beforehands about special characters like \n or \ (very special in regular
expressions)

HTH
--
- Eric Brunel <eric dot brunel at pragmadev dot com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com

Jul 18 '05 #2
"Maarten van Reeuwijk" <maarten@remove_this_ws.tn.tudelft.nl> wrote in
message news:bu**********@news.tudelft.nl...
Hi group,

I need to parse various text files in python. I was wondering if there was a general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried tokenize but this specifically for Python and is way too heavy for me. I am looking for something like this:
splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten
--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

Maarten -
Please give my pyparsing module a try. You can download it from SourceForge
at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
allows you to define your own parsing patterns for any text data file, and
the tokenized results are returned in a dictionary or list, as you prefer.
The download includes several examples also - one especially difficult file
parsing solution is shown in the dictExample.py script. And if you get
stuck, send me a sample of what you are trying to parse, and I can try to
give you some pointers (or even tell you if pyparsing isn't necessarily the
most appropriate tool for your job - it happens sometimes!).

-- Paul McGuire

Austin, Texas, USA
Jul 18 '05 #3
Maarten van Reeuwijk wrote:
I need to parse various text files in python. I was wondering if
there was a general purpose tokenizer available.


Indeed there is: python comes with batteries included. Try the shlex
module.

http://www.python.org/doc/lib/module-shlex.html

Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo
andso/shouldthis
and=this
"""

import shlex
import StringIO

def prepareToker(toker, splitters):
for s in splitters: # resists People's Front of Judea joke ;-D
if toker.whitespace.find(s) == -1:
toker.whitespace = "%s%s" % (s, toker.whitespace)
return toker

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
print "%s:%s" % (num, tok)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.

regards,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
Jul 18 '05 #4
Thank you all for your very useful comments. Below I have included my
source. Could you comment if there's a more elegant way of implementing the
continuation character &?

With the RE implementation I have noticed that the position of the '*' in
spclist is very delicate. This order works, but other orders throw
exceptions. Is this correct or is it a bug? Lastly, is there more
documentation and examples for the shlex module? Ideally I would like to
see a full scale example of how this module should be used to parse.

Maarten

import re
import shlex
import StringIO

def splitf90(source):
buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.commenters = "!"
toker.whitespace = " \t\r"
return processTokens(toker)

def splitf90_re(source):
spclist = ['\*', '\+', '-', '/', '=','\[', '\]', '\(', '\)' \
'>', '<', '&', ';', ',', ':', '!', ' ', '\n']
pat = '|'.join(spclist) + '|[^' + ''.join(spclist) + ']+'
rawtokens = re.findall(pat, source)
return processTokens(rawtokens)

def processTokens(rawtokens):
# substitute characters
subst1 = []
prevtoken = None
for token in rawtokens:
if token == ';': token = '\n'
if token == ' ': token = ''
if token == '\n' and prevtoken == '&': token = ''
if not token == '':
subst1.append(token)
prevtoken = token

# remove continuation chars
subst2 = []
for token in subst1:
if token == '&': token = ''
if not token == '':
subst2.append(token)

# split into lines
final = []
curline = []
for token in subst2:
if not token == '\n':
curline.append(token)
else:
if not curline == []:
final.append(curline)
curline = []

return final

# Example session
src = """
MODULE modsize
implicit none

integer, parameter:: &
Nx = 256, &
Ny = 256, &
Nz = 256, &
nt = 1, & ! nr of (passive) scalars
Np = 16 ! nr of processors, should match mpirun -np .. command

END MODULE
"""
print splitf90(src)
print splitf90_re(src)

Output:
[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', 'Np', '=', '16'], ['END', 'MODULE']]

[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', '!', 'nr', 'of', '(', 'passive', 'scalars'],
['Np', '=', '16', '!', 'nr', 'of', 'processors', ',', 'should', 'match',
'mpirun', '-', 'np', 'command'], ['END', 'MODULE']]

--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology
Jul 18 '05 #5
I found a complication with the shlex module. When I execute the following
fragment you'll notice that doubles are split. Is there any way to avoid
numbers this?
source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"
print [tok for tok in toker]

Output:
['\n', '$', 'NAMRUN', '\n', 'Lz', '=', '0', '.', '15', '\n', 'nu', '=', '1',
'.', '08E', '-', '6', '\n']
--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology
Jul 18 '05 #6
Maarten van Reeuwijk <maarten@remove_this_ws.tn.tudelft.nl> schreef:
I found a complication with the shlex module. When I execute the
following fragment you'll notice that doubles are split. Is there any way
to avoid numbers this?
From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

wordchars
The string of characters that will accumulate into multi-character
tokens. By default, includes all ASCII alphanumerics and underscore.
source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"
toker.wordchars = toker.wordchars + ".-$" # etc.
print [tok for tok in toker]

Output:

['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

Is this what you want?

--
JanC

"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9
Jul 18 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Kenneth Downs | last post by:
Well, I'm coming to the end of a large and exhausting project, done in my new favorite language PHP, and its time for a diversion. I'm wondering if anyone has experience with writing simple...
5
by: Knackeback | last post by:
task: - read/parse CSV file code snippet: string key,line; typedef tokenizer<char_separator<char> > tokenizer; tokenizer tok(string(""), sep); while ( getline(f, line) ){ ++lineNo;...
20
by: Clark | last post by:
Hi all. I'm looking for good C source code to study and be able to advance my C programming skills. Do you recomend any open source project in particular that in your opinion has good writen C...
5
by: mikegw | last post by:
Hello all. I am currently using an implementation of sysV shared memory. The entire shared memory is allocated is one continuous block of which I get the pointer to the head, everything should...
0
by: David++ | last post by:
Hi Folks, I'm interested in Web Services development using C# and ASP.NET. The problem is I dont really know what task to set myself i.e. what Web Service to actually create. I need a goal....
6
by: Bob Alston | last post by:
I am looking for Access reporting add-in that would be easy to use by end users. My key focus is on selection criteria. I am very happy with the Access report writer capabilities. As far as...
50
by: diffuser78 | last post by:
I have just started to learn python. Some said that its slow. Can somebody pin point the issue. Thans
23
by: walterbyrd | last post by:
Way back when, I got a lot of training and experience in highly structued software development. These days, I dabble with web-development, but I may become more serious. I consider php to be an...
56
by: Omar | last post by:
I'd love the perfect editor that would be: a) free b) enable me to drag and drop code snippets from a sort of browser into the code c) can run programs right from within d) can edit
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.