Looking for very simple general purpose tokenizer

Maarten van Reeuwijk

Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:
splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten
--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

Jul 18 '05 #1

Subscribe Post Reply

2111

Eric Brunel

Maarten van Reeuwijk wrote:

Hi group,

I need to parse various text files in python. I was wondering if there was a
general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
tokenize but this specifically for Python and is way too heavy for me. I am
looking for something like this:
splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

You may use re.findall for that:

import re
s = "a = b+c; z = 34;"
pat = " |=|;|[^ =;]*"
re.findall(pat, s)

['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

The pattern basically says: match either a space, a '=', a ';', or a sequence of
any characters that are not space, '=' or ';'. You may have to take care
beforehands about special characters like \n or \ (very special in regular
expressions)

HTH
--
- Eric Brunel <eric dot brunel at pragmadev dot com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com

Jul 18 '05 #2

Paul McGuire

"Maarten van Reeuwijk" <maarten@remove_this_ws.tn.tudelft.nl> wrote in
message news:bu**********@news.tudelft.nl...

Hi group,

I need to parse various text files in python. I was wondering if there was a general purpose tokenizer available. I know about split(), but this
(otherwise very handy method does not allow me to specify a list of
splitting characters, only one at the time and it removes my splitting
operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried tokenize but this specifically for Python and is way too heavy for me. I am looking for something like this:
splitchars = [' ', '\n', '=', '/', ....]
tokenlist = tokenize(rawfile, splitchars)

Is there something like this available inside Python or did anyone already
make this? Thank you in advance

Maarten
--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

Maarten -
Please give my pyparsing module a try. You can download it from SourceForge
at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
allows you to define your own parsing patterns for any text data file, and
the tokenized results are returned in a dictionary or list, as you prefer.
The download includes several examples also - one especially difficult file
parsing solution is shown in the dictExample.py script. And if you get
stuck, send me a sample of what you are trying to parse, and I can try to
give you some pointers (or even tell you if pyparsing isn't necessarily the
most appropriate tool for your job - it happens sometimes!).

-- Paul McGuire

Austin, Texas, USA

Jul 18 '05 #3

Alan Kennedy

Maarten van Reeuwijk wrote:

I need to parse various text files in python. I was wondering if
there was a general purpose tokenizer available.

Indeed there is: python comes with batteries included. Try the shlex
module.

http://www.python.org/doc/lib/module-shlex.html

Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo
andso/shouldthis
and=this
"""

import shlex
import StringIO

def prepareToker(toker, splitters):
for s in splitters: # resists People's Front of Judea joke ;-D
if toker.whitespace.find(s) == -1:
toker.whitespace = "%s%s" % (s, toker.whitespace)
return toker

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
print "%s:%s" % (num, tok)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.

regards,

--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan

Jul 18 '05 #4

Maarten van Reeuwijk

Thank you all for your very useful comments. Below I have included my
source. Could you comment if there's a more elegant way of implementing the
continuation character &?

With the RE implementation I have noticed that the position of the '*' in
spclist is very delicate. This order works, but other orders throw
exceptions. Is this correct or is it a bug? Lastly, is there more
documentation and examples for the shlex module? Ideally I would like to
see a full scale example of how this module should be used to parse.

Maarten

import re
import shlex
import StringIO

def splitf90(source):
buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.commenters = "!"
toker.whitespace = " \t\r"
return processTokens(toker)

def splitf90_re(source):
spclist = ['\*', '\+', '-', '/', '=','\[', '\]', '$', '$' \
'>', '<', '&', ';', ',', ':', '!', ' ', '\n']
pat = '|'.join(spclist) + '|[^' + ''.join(spclist) + ']+'
rawtokens = re.findall(pat, source)
return processTokens(rawtokens)

def processTokens(rawtokens):
# substitute characters
subst1 = []
prevtoken = None
for token in rawtokens:
if token == ';': token = '\n'
if token == ' ': token = ''
if token == '\n' and prevtoken == '&': token = ''
if not token == '':
subst1.append(token)
prevtoken = token

# remove continuation chars
subst2 = []
for token in subst1:
if token == '&': token = ''
if not token == '':
subst2.append(token)

# split into lines
final = []
curline = []
for token in subst2:
if not token == '\n':
curline.append(token)
else:
if not curline == []:
final.append(curline)
curline = []

return final

# Example session
src = """
MODULE modsize
implicit none

integer, parameter:: &
Nx = 256, &
Ny = 256, &
Nz = 256, &
nt = 1, & ! nr of (passive) scalars
Np = 16 ! nr of processors, should match mpirun -np .. command

END MODULE
"""
print splitf90(src)
print splitf90_re(src)

Output:
[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', 'Np', '=', '16'], ['END', 'MODULE']]

[['MODULE', 'modsize'], ['implicit', 'none'], ['integer', ',', 'parameter',
':', ':', 'Nx', '=', '256', ',', 'Ny', '=', '256', ',', 'Nz', '=', '256',
',', 'nt', '=', '1', ',', '!', 'nr', 'of', '(', 'passive', 'scalars'],
['Np', '=', '16', '!', 'nr', 'of', 'processors', ',', 'should', 'match',
'mpirun', '-', 'np', 'command'], ['END', 'MODULE']]

--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

Jul 18 '05 #5

Maarten van Reeuwijk

I found a complication with the shlex module. When I execute the following
fragment you'll notice that doubles are split. Is there any way to avoid
numbers this?
source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"
print [tok for tok in toker]

Output:
['\n', '$', 'NAMRUN', '\n', 'Lz', '=', '0', '.', '15', '\n', 'nu', '=', '1',
'.', '08E', '-', '6', '\n']
--
================================================== =================
Maarten van Reeuwijk Heat and Fluid Sciences
Phd student dept. of Multiscale Physics
www.ws.tn.tudelft.nl Delft University of Technology

Jul 18 '05 #6

JanC

Maarten van Reeuwijk <maarten@remove_this_ws.tn.tudelft.nl> schreef:

I found a complication with the shlex module. When I execute the
following fragment you'll notice that doubles are split. Is there any way
to avoid numbers this?
From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

wordchars
The string of characters that will accumulate into multi-character
tokens. By default, includes all ASCII alphanumerics and underscore.
source = """
$NAMRUN
Lz = 0.15
nu = 1.08E-6
"""

import shlex
import StringIO

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker.comments = ""
toker.whitespace = " \t\r"
toker.wordchars = toker.wordchars + ".-$" # etc.
print [tok for tok in toker]

Output:

['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

Is this what you want?

--
JanC

"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9

Jul 18 '05 #7

Similar topics

Simple parser in PHP?

by: Kenneth Downs | last post by:

Well, I'm coming to the end of a large and exhausting project, done in my new favorite language PHP, and its time for a diversion. I'm wondering if anyone has experience with writing simple...

PHP

read/parse flat file / performance / boost::tokenizer

by: Knackeback | last post by:

task: - read/parse CSV file code snippet: string key,line; typedef tokenizer<char_separator<char> > tokenizer; tokenizer tok(string(""), sep); while ( getline(f, line) ){ ++lineNo;...

C / C++

looking for good C source code

by: Clark | last post by:

Hi all. I'm looking for good C source code to study and be able to advance my C programming skills. Do you recomend any open source project in particular that in your opinion has good writen C...

C / C++

Looking for a more elegant way to do memory offsets

by: mikegw | last post by:

Hello all. I am currently using an implementation of sysV shared memory. The entire shared memory is allocated is one continuous block of which I get the pointer to the head, everything should...

C / C++

Novice - Looking for a Web Service challenge

by: David++ | last post by:

Hi Folks, I'm interested in Web Services development using C# and ASP.NET. The problem is I dont really know what task to set myself i.e. what Web Service to actually create. I need a goal....

.NET Framework

Looking for Access Reporting Add-ins

by: Bob Alston | last post by:

I am looking for Access reporting add-in that would be easy to use by end users. My key focus is on selection criteria. I am very happy with the Access report writer capabilities. As far as...

Microsoft Access / VBA

Is python very slow compared to C

by: diffuser78 | last post by:

I have just started to learn python. Some said that its slow. Can somebody pin point the issue. Thans

Python

Looking for a language/framework

by: walterbyrd | last post by:

Way back when, I got a lot of training and experience in highly structued software development. These days, I dabble with web-development, but I may become more serious. I consider php to be an...

Python

Looking for the Perfect Editor

by: Omar | last post by:

I'd love the perfect editor that would be: a) free b) enable me to drag and drop code snippets from a sort of browser into the code c) can run programs right from within d) can edit

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General