Parsing a file based on differing delimiters

Kylotan

I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a
newline. This means I can't effectively use split() except on a small
scale. For most of the file I can just call one of several functions I
wrote that read in just as much data as is required from the input
string, and return the value and modified string. Much of the code
therefore looks like this:

filedata = file('whatever').read()
firstWord, filedata = GetWord(filedata)
nextNumber, filedata = GetNumber(filedata)

This works, but is obviously ugly. Is there a cleaner alternative that
can avoid me having to re-assign data all the time that will 'consume'
the value from the stream)? I'm a bit unclear on the whole passing by
value/reference thing. I'm guessing that while GetWord gets a
reference to the 'filedata' string, assigning to that will just reseat
the reference and not change the original string.

The other problem is that parts of the format are potentially repeated
an arbitrary number of times and therefore a degree of lookahead is
required. If I've already extracted a token and then find out I need
it, putting it back is awkward. Yet there is nowhere near enough
complexity or repetition in the file format to justify a formal
grammar or anything like that.

All in all, in the basic parsing code I am doing a lot more operations
on the input data than I would like. I can see how I'd encapsulate
this behind functions if I was willing to iterate through the data
character by character like I would in C++. But I am hoping that
Python can, as usual, save me from the majority of this drudgery
somehow.

Any help appreciated.

--
Ben Sizer

Jul 18 '05 #1

Subscribe Post Reply

2979

Alex Martelli

Kylotan wrote:

I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a

What sadist designed it?-) Anyway...

I suggest a simple class which holds the filedata and an index into
it. Your functions such as GetWord(f) examine f.data from f.index
onwards, and increment f.index before returning the result. To
"pushback", you just decrement f.index again (you may want to
keep a small stack of values - perhaps just one -- for the "undo",
again in the simple class in question).
Alex

Jul 18 '05 #2

Peter Otten

Kylotan wrote:

I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a
newline. This means I can't effectively use split() except on a small
scale. For most of the file I can just call one of several functions I
wrote that read in just as much data as is required from the input
string, and return the value and modified string. Much of the code
therefore looks like this:

filedata = file('whatever').read()
firstWord, filedata = GetWord(filedata)
nextNumber, filedata = GetNumber(filedata)

This works, but is obviously ugly. Is there a cleaner alternative that
can avoid me having to re-assign data all the time that will 'consume'
the value from the stream)? I'm a bit unclear on the whole passing by
value/reference thing. I'm guessing that while GetWord gets a
reference to the 'filedata' string, assigning to that will just reseat
the reference and not change the original string.
The strategy to rebind is to wrap the reference into a mutable object and
pass that object around instead of the original reference.
The other problem is that parts of the format are potentially repeated
an arbitrary number of times and therefore a degree of lookahead is
required. If I've already extracted a token and then find out I need
it, putting it back is awkward. Yet there is nowhere near enough
complexity or repetition in the file format to justify a formal
grammar or anything like that.

All in all, in the basic parsing code I am doing a lot more operations
on the input data than I would like. I can see how I'd encapsulate
this behind functions if I was willing to iterate through the data
character by character like I would in C++. But I am hoping that
Python can, as usual, save me from the majority of this drudgery
somehow.

I've made a little Reader class that should do what you want. Of course the
actual parsing routines will differ, depending on your file format.

<code>
class EndOfData(Exception):
pass

class Reader:
def __init__(self, data):
self.data = data
self.positions = [0]

def _getChunk(self, delim):
start = self.positions[-1]
if start >= len(self.data):
raise EndOfData
end = self.data.find(delim, start)
if end < 0:
end = len(self.data)
self.positions.append(end+1)
return self.data[start:end]

def rest(self):
return self.data[self.positions[-1]:]
def rewind(self):
self.positions = [0]
def unget(self):
self.positions.pop()
def getString(self):
return self._getChunk("~")
def getInteger(self):
chunk = self._getChunk(" ")
try:
return int(chunk)
except ValueError:
self.unget()
raise

#example usage:

sample = "abc~123 456 rst"
r = Reader(sample)

commands = {
"i": r.getInteger,
"s": r.getString,
"u": lambda: r.unget() or "#unget " + r.rest(),
}

for key in "ssuiisuuisi":
try:
print commands[key]()
except ValueError:
print "#error"
</code>

Peter

Jul 18 '05 #3

Bengt Richter

On 21 Oct 2003 15:21:13 -0700, ky*****@hotmail.com (Kylotan) wrote:

I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a
newline. This means I can't effectively use split() except on a small
scale. For most of the file I can just call one of several functions I
wrote that read in just as much data as is required from the input
string, and return the value and modified string. Much of the code
therefore looks like this:

filedata = file('whatever').read()
firstWord, filedata = GetWord(filedata)
nextNumber, filedata = GetNumber(filedata)

This works, but is obviously ugly. Is there a cleaner alternative that
can avoid me having to re-assign data all the time that will 'consume'
the value from the stream)? I'm a bit unclear on the whole passing by
value/reference thing. I'm guessing that while GetWord gets a
reference to the 'filedata' string, assigning to that will just reseat
the reference and not change the original string.

The other problem is that parts of the format are potentially repeated
an arbitrary number of times and therefore a degree of lookahead is
required. If I've already extracted a token and then find out I need A generator can look ahead by holding put-back info in its own state
without yielding a result until it has decided what to do. It can read
input line-wise and scan lines for patterns and store ambiguous info
for re-analysis if backup is needed. You can go character by character
or whip through lines of comments in bigger chunks, and recognize alternative
patterns with regular expressions. There are lots of options.
it, putting it back is awkward. Yet there is nowhere near enough A generator wouldn't have to put it back, but if that is a convenient way to
go, you can define one with a put-back stack or queue by including a mutable
for that purpose as one of the initial arguments in the intial generator call.
complexity or repetition in the file format to justify a formal
grammar or anything like that.
Communicating clearly and precisely should be more than enough justification IMO ;-)

What you've said above sounds like approximately:

kylotan_file: ( string_text '~' | number WS | some_identifiers NL )*

If it's not that complicated, why not complete the picture? I'd bet you'll get several
versions of tokenizers/parsers for it, and questions as to what you want to do with the
pieces. Maybe a tokenizer as a generator that gives you a sequence of (token_type, token_data)
tuples would work. If you have nested structures, you can define start-of-nest and end-of-nest
tokens as operator tokens like ( OP, '(' ) and ( OP ')' )

Look and Andrew Dalke's recent post for a number of ideas and code you might snip and adapt
to your problem (I think this shortened url will get you there):

http://groups.google.com/groups?q=rp...n&lr=&ie=UTF-8

All in all, in the basic parsing code I am doing a lot more operations
on the input data than I would like. I can see how I'd encapsulate
this behind functions if I was willing to iterate through the data
character by character like I would in C++. But I am hoping that
Python can, as usual, save me from the majority of this drudgery
somehow. I suspect you could recognize bigger chunks with regular expressions, or at
least split them apart by splitting on a regex of delimiters (which you can
preserve in the split list by enclosing in parens).

Any help appreciated.

HTH

Regards,
Bengt Richter

Jul 18 '05 #4

Kylotan

bo**@oz.net (Bengt Richter) wrote in message news:<bn**********@216.39.172.122>...

A generator can look ahead by holding put-back info in its own state
without yielding a result until it has decided what to do. It can read
input line-wise and scan lines for patterns and store ambiguous info
for re-analysis if backup is needed. You can go character by character
or whip through lines of comments in bigger chunks, and recognize alternative
patterns with regular expressions. There are lots of options.
Sadly none of these options seem obvious to me :) Basically 90% of
the time, I know exactly what type to expect. Other times, I am gonna
get one of several things back, where sometimes one of those things is
actually part of something totally different, so I need to leave it
there for the next routine. How would that be done with a generator?
Communicating clearly and precisely should be more than enough justification IMO ;-)

What you've said above sounds like approximately:

kylotan_file: ( string_text '~' | number WS | some_identifiers NL )*

If it's not that complicated, why not complete the picture?
Because it would be a fairly flat grammar where each non-terminal
symbol has a very long rule of almost exclusively terminal symbols
describing what it contains. There's no recursiveness and very little
iteration or alternation in here. With all this in mind, I'd rather
keep all the logic for reading and assigning values in one place
rather than going through a parser middleman which will complicate the
code. Traditional tokenizers and lexers are also of little use since
many of the tokens are context-dependent.
Look and Andrew Dalke's recent post for a number of ideas and code you might
snip and adapt to your problem

All I found in a short search was something complex that appeared to
be an expression parser, which is not really what I need here.

Thanks,

Ben Sizer

Jul 18 '05 #5

Kylotan

Peter,

Thanks for your reply. I will probably use something similar to this
in the end. However, I was wondering if there's an obvious
implementation of multiple delimiters for the _getChunk() function?
The most obvious and practical example would be the ability to get the
next chunk up to any sort of whitespace, not just a space.

--
Ben Sizer

Jul 18 '05 #6

Peter Otten

Kylotan wrote:

in the end. However, I was wondering if there's an obvious
implementation of multiple delimiters for the _getChunk() function?
The most obvious and practical example would be the ability to get the
next chunk up to any sort of whitespace, not just a space.

As far as I know, nothing short of regular expressions will do.

def _getChunk(self, expr):
start = self.positions[-1]
if start >= len(self.data):
raise EndOfData
match = expr.search(self.data, start)
if match:
end = match.start()
self.positions.append(match.end())
else:
end = len(self.data)
self.positions.append(end)
return self.data[start:end]

This would be called, e. g. for one or more whitespace chars as the
delimiter:

whites = re.compile(r"\s+")
def getString(self):
return self._getChunk(self.whites)
Peter

Jul 18 '05 #7

Andrae Muys

ky*****@hotmail.com (Kylotan) wrote in message news:<15*************************@posting.google.c om>...

bo**@oz.net (Bengt Richter) wrote in message news:<bn**********@216.39.172.122>...
Look and Andrew Dalke's recent post for a number of ideas and code you might
snip and adapt to your problem

All I found in a short search was something complex that appeared to
be an expression parser, which is not really what I need here.

To me it sounds like you need parser, so why not just bite the bullet
and use one?

From your description of the file format earlier in this thread, it
sounds like you just need a straight forward LL parser.

Andrae

Jul 18 '05 #8

Similar topics

Parsing challenge...

by: Artco News | last post by:

I thought I ask the scripting guru about the following. I have a file containing records of data with the following format(first column is the label): CODE#1^DESCRIPTION^CODE#2^NOTES...

PHP

Perl expression for parsing CSV (ignoring parsing commas when in double quotes)

by: GIMME | last post by:

I can't figure an expression needed to parse a string. This problem arrises from parsing Excel csv files ... The expression must parse a string based upon comma delimiters, but if a comma...

Perl

parsing floats out of alphanumeric strings using strtok

by: BGP | last post by:

I am working on a WIN32 API app using devc++4992 that will accept Dow Jones/NASDAQ/etc. stock prices as input, parse them, and do things with it. The user can just cut and paste back prices into a...

C / C++

parsing config file

by: Mantorok Redgormor | last post by:

If I am parsing a config file that uses '#' for comments and the config file itself is 1640 bytes, and the format is VARIABLE=VALUE, is it recommended to use a) fgetc (parse a character at a...

C / C++

mmap parsing...

by: netbogus | last post by:

hi, I have a file stored in memory using mmap() and I'd like to parse to read line by line. Also, there are several threads that read this buffer so I think strtok(p, "\n") wouldnt be a good...

C / C++

Text Parsing with Qualifiers

by: Lucas Tam | last post by:

Hi all, Does anyone know of a GOOD example on parsing text with text qualifiers? I am hoping to parse text with variable length delimiters/qualifiers. Also, qualified text could run onto...

Visual Basic .NET

parsing text file into DataTable

by: kevin | last post by:

I need to parse an third party supplied delimited or fixed width text file into a datatable. The delimiter may vary. I am currently using a SteamReader to read each line and, for delimited...

C# / C Sharp

parsing text in blocks and line too

by: flyzone | last post by:

Goodmorning people :) I have just started to learn this language and i have a logical problem. I need to write a program to parse various file of text. Here two sample: --------------- trial...

Python

Parsing Based On Capital Words.

by: guardian | last post by:

Hey Guys, I was hoping someone could help me solve a problem I'm having using vb2005 to parse a text file. I'm trying to parse the text file by Sentences that start with Words that are all...

Visual Basic 4 / 5 / 6

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp