Parsing a file with iterators

Luis Zarrabeitia

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
....
data line N
TYPE2 metadata
data line 1
....
TYPE3 metadata
....

And so on. The type and metadata determine how to parse the following data
lines. When the parser fails to parse one of the lines, the next parser is
chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).

This doesn't work:

===
for line in input:
parser = parser_from_str ing(line)
parser(input)
===

because when the parser iterates over the input, it can't know that it finished
processing the section until it reads the next "TYPE" line (actually, until it
reads the first line that it cannot parse, which if everything went well,should
be the 'TYPE'), but once it reads it, it is no longer available to the outer
loop. I wouldn't like to leak the internals of the parsers to the outside..

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

--
Luis Zarrabeitia
Facultad de Matemática y Computación, UH
http://profesores.matcom.uh.cu/~kyrie

Oct 17 '08 #1

Subscribe Reply

1525

Eddie Corns

Luis Zarrabeitia <ky***@uh.cuwri tes:

>I need to parse a file, text file. The format is something like that:

>TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...

>And so on. The type and metadata determine how to parse the following dat=
a
lines. When the parser fails to parse one of the lines, the next parser i=
s
chosen (or if there is no 'TYPE metadata' line there, an exception is thr=
own).

>This doesn't work:

>=3D=3D=3D
for line in input:
parser =3D parser_from_str ing(line)
parser(input)
=3D=3D=3D

>because when the parser iterates over the input, it can't know that it fi=
nished
processing the section until it reads the next "TYPE" line (actually, unt=
il it
reads the first line that it cannot parse, which if everything went well,=
should
be the 'TYPE'), but once it reads it, it is no longer available to the ou=
ter
loop. I wouldn't like to leak the internals of the parsers to the outside=
.

>What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)
=20
--=20
Luis Zarrabeitia
Facultad de Matem=E1tica y Computaci=F3n, UH
http://profesores.matcom.uh.cu/~kyrie

One simple way is to allow your "input" iterator to support pushing values
back into the input stream as soon as it finds an input it can't handle.

See http://code.activestate.com/recipes/502304/ for an example.

Oct 17 '08 #2

Marc 'BlackJack' Rintsch

On Fri, 17 Oct 2008 11:42:05 -0400, Luis Zarrabeitia wrote:

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...
[â€¦]
because when the parser iterates over the input, it can't know that it
finished processing the section until it reads the next "TYPE" line
(actually, until it reads the first line that it cannot parse, which if
everything went well, should be the 'TYPE'), but once it reads it, it is
no longer available to the outer loop. I wouldn't like to leak the
internals of the parsers to the outside.

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

Group the lines before processing and feed each group to the right parser:

import sys
from itertools import groupby, imap
from operator import itemgetter
def parse_a(metadat a, lines):
print 'parser a', metadata
for line in lines:
print 'a', line
def parse_b(metadat a, lines):
print 'parser b', metadata
for line in lines:
print 'b', line
def parse_c(metadat a, lines):
print 'parser c', metadata
for line in lines:
print 'c', line
def test_for_type(l ine):
return line.startswith ('TYPE')
def parse(lines):
def tag():
type_line = None
for line in lines:
if test_for_type(l ine):
type_line = line
else:
yield (type_line, line)

type2parser = {'TYPE1': parse_a,
'TYPE2': parse_b,
'TYPE3': parse_c }

for type_line, group in groupby(tag(), itemgetter(0)):
type_id, metadata = type_line.split (' ', 1)
type2parser[type_id](metadata, imap(itemgetter (1), group))
def main():
parse(sys.stdin )

Oct 17 '08 #3

Paul McGuire

On Oct 17, 10:42*am, Luis Zarrabeitia <ky...@uh.cuwro te:

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...

And so on. The type and metadata determine how to parse the following data
lines. When the parser fails to parse one of the lines, the next parser is
chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).

<snip>

Pyparsing will take care of this for you, if you define a set of
alternatives and then parse/search for them. Here is an annotated
example. Note the ability to attach names to different fields of the
parser, and then how those fields are accessed after parsing.

"""
TYPE1 metadata
data line 1
data line 2
....
data line N
TYPE2 metadata
data line 1
....
TYPE3 metadata
....
"""

from pyparsing import *

# define basic element types to be used in data formats
integer = Word(nums)
ident = Word(alphas) | quotedString.se tParseAction(re moveQuotes)
zipcode = Combine(Word(nu ms,exact=5) + Optional("-" +
Word(nums,exact =4)))
stateAbbreviati on = oneOf("""AA AE AK AL AP AR AS AZ CA CO CT DC DE
FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS
MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT
VA VI VT WA WI WV WY""".split() )

# define data format for each type
DATA = Suppress("data" )
type1dataline = Group(DATA + OneOrMore(integ er))
type2dataline = Group(DATA + delimitedList(i dent))
type3dataline = DATA + countedArray(id ent)

# define complete expressions for each type - note different types
# may have different metadata
type1data = "TYPE1" + ident("name") + \
OneOrMore(type1 dataline)("data ")
type2data = "TYPE2" + ident("name") + zipcode("zip") + \
OneOrMore(type2 dataline)("data ")
type3data = "TYPE3" + ident("name") + stateAbbreviati on("state") + \
OneOrMore(type3 dataline)("data ")

# expression containing all different type alternatives
data = type1data | type2data | type3data

# search a test input string and dump the matched tokens by name
testInput = """
TYPE1 Abercrombie
data 400 26 42 66
data 1 1 2 3 5 8 13 21
data 1 4 9 16 25 36
data 1 2 4 8 16 32 64
TYPE2 Benjamin 78704
data Larry, Curly, Moe
data Hewey,Dewey ,Louie
data Tom , Dick, Harry, Fred
data Thelma,Louise
TYPE3 Christopher WA
data 3 "Raspberry Red" "Lemon Yellow" "Orange Orange"
data 7 Grumpy Sneezy Happy Dopey Bashful Sleepy Doc
"""
for tokens in data.searchStri ng(testInput):
print tokens.dump()
print tokens.name
if tokens.state: print tokens.state
for d in tokens.data:
print " ",d
print

Prints:

['TYPE1', 'Abercrombie', ['400', '26', '42', '66'], ['1', '1', '2',
'3', '5', '8', '13', '21'], ['1', '4', '9', '16', '25', '36'], ['1',
'2', '4', '8', '16', '32', '64']]
- data: [['400', '26', '42', '66'], ['1', '1', '2', '3', '5', '8',
'13', '21'], ['1', '4', '9', '16', '25', '36'], ['1', '2', '4', '8',
'16', '32', '64']]
- name: Abercrombie
Abercrombie
['400', '26', '42', '66']
['1', '1', '2', '3', '5', '8', '13', '21']
['1', '4', '9', '16', '25', '36']
['1', '2', '4', '8', '16', '32', '64']

['TYPE2', 'Benjamin', '78704', ['Larry', 'Curly', 'Moe'], ['Hewey',
'Dewey', 'Louie'], ['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma',
'Louise']]
- data: [['Larry', 'Curly', 'Moe'], ['Hewey', 'Dewey', 'Louie'],
['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma', 'Louise']]
- name: Benjamin
- zip: 78704
Benjamin
['Larry', 'Curly', 'Moe']
['Hewey', 'Dewey', 'Louie']
['Tom', 'Dick', 'Harry', 'Fred']
['Thelma', 'Louise']

['TYPE3', 'Christopher', 'WA', ['Raspberry Red', 'Lemon Yellow',
'Orange Orange'], ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful',
'Sleepy', 'Doc']]
- data: [['Raspberry Red', 'Lemon Yellow', 'Orange Orange'],
['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']]
- name: Christopher
- state: WA
Christopher
WA
['Raspberry Red', 'Lemon Yellow', 'Orange Orange']
['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']
More info on pyparsing at http://pyparsing.wikispaces.com.

-- Paul

Oct 17 '08 #4

James Harris

On 17 Oct, 16:42, Luis Zarrabeitia <ky...@uh.cuwro te:

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...

And so on. The type and metadata determine how to parse the following data
lines. When the parser fails to parse one of the lines, the next parser is
chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).

This doesn't work:

===
for line in input:
parser = parser_from_str ing(line)
parser(input)
===

because when the parser iterates over the input, it can't know that it finished
processing the section until it reads the next "TYPE" line (actually, until it
reads the first line that it cannot parse, which if everything went well, should
be the 'TYPE'), but once it reads it, it is no longer available to the outer
loop. I wouldn't like to leak the internals of the parsers to the outside.

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

The main issue seems to be that you need to keep the 'current' line
data when a parser has decided it doesn't understand it so it can
still be used to select the next parser. The for loop in your example
uses the next() method which only returns the next and never the
current line. There are two easy options though:

1. Wrap the input file with your own object.
2. Use the linecache module and maintain a line number.

http://blog.doughellmann.com/2007/04...linecache.html

--
HTH,
James

Oct 17 '08 #5

George Sakkis

On Oct 17, 12:45*pm, Marc 'BlackJack' Rintsch <bj_...@gmx.net wrote:

On Fri, 17 Oct 2008 11:42:05 -0400, Luis Zarrabeitia wrote:
I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...
[…]
because when the parser iterates over the input, it can't know that it
finished processing the section until it reads the next "TYPE" line
(actually, until it reads the first line that it cannot parse, which if
everything went well, should be the 'TYPE'), but once it reads it, it is
no longer available to the outer loop. I wouldn't like to leak the
internals of the parsers to the outside.

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

Group the lines before processing and feed each group to the right parser:

import sys
from itertools import groupby, imap
from operator import itemgetter

def parse_a(metadat a, lines):
* * print 'parser a', metadata
* * for line in lines:
* * * * print 'a', line

def parse_b(metadat a, lines):
* * print 'parser b', metadata
* * for line in lines:
* * * * print 'b', line

def parse_c(metadat a, lines):
* * print 'parser c', metadata
* * for line in lines:
* * * * print 'c', line

def test_for_type(l ine):
* * return line.startswith ('TYPE')

def parse(lines):
* * def tag():
* * * * type_line = None
* * * * for line in lines:
* * * * * * if test_for_type(l ine):
* * * * * * * * type_line = line
* * * * * * else:
* * * * * * * * yield (type_line, line)

* * type2parser = {'TYPE1': parse_a,
* * * * * * * * * *'TYPE2': parse_b,
* * * * * * * * * *'TYPE3': parse_c }

* * for type_line, group in groupby(tag(), itemgetter(0)):
* * * * type_id, metadata = type_line.split (' ', 1)
* * * * type2parser[type_id](metadata, imap(itemgetter (1), group))

def main():
* * parse(sys.stdin )

I like groupby and find it very powerful but I think it complicates
things here instead of simplifying them. I would instead create a
parser instance for every section as soon as the TYPE line is read and
then feed it one data line at a time (or if all the data lines must or
should be given at once, append them in a list and feed them all as
soon as the next section is found), something like:

class parse_a(object) :
def __init__(self, metadata):
print 'parser a', metadata
def parse(self, line):
print 'a', line

# similar for parse_b and parse_c
# ...

def parse(lines):
parse = None
for line in lines:
if test_for_type(l ine):
type_id, metadata = line.split(' ', 1)
parse = type2parser[type_id](metadata).pars e
else:
parse(line)

George

Oct 18 '08 #6

Similar topics

3946

XML file parsing/validating with xerces-j

by: Cigdem | last post by:

Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home Canonicalpath-Directory4: \\wkdis3\ROOT\home\bwe\ You selected the file named AAA.XML getXmlAlgorithmDocument(): IOException Not logged in

.NET Framework

8450

IP Address, File Parsing - Help

by: Rob | last post by:

I am trying to write a program with VC++ 6.0 to read a txt file which looks like this: Network Destination Netmask Gateway Interface Metric 0.0.0.0 0.0.0.0 10.155.12.1 10.155.12.188 1 10.155.12.0 255.255.255.0 10.155.12.188 10.155.12.188 1 10.155.12.188 255.255.255.255 127.0.0.1 127.0.0.1

C / C++

3100

Explain the magic? Counting lines in a file

by: Dale Atkin | last post by:

As part of a larger project, I need to be able to count the number of lines in a file (so I know what to expect). Anyways, I came accross the following code that seems to do the trick, the only thing is, I'm not 100% sure what it is doing, or how. #include<iostream> #include<fstream> main(int argc, char *argv) {

C / C++

2818

Modifying the contents of a file

by: Jason Heyes | last post by:

I would like to modify the contents of a file, replacing all occurances of one string with another. I wrote these functions: bool read_file(std::string name, std::string &s); bool write_file(std::string name, const std::string &s); void find_replace(std::string &s, std::string first, std::string second); bool find_replace_file(std::string name, std::string first, std::string second) {

C / C++

2408

reading lines from file

by: felixnielsen | last post by:

The question is pretty simple, i have a file called "primes.txt" and funny enough it contains alot of primes (one per line) Besides that i have an empty vector: vector<__int64> P(0); How do i fill that with the contents of the file? P.push_back(line 1); P.push_back(line 2); ect.

C / C++

1888

parsing an xml

by: sp | last post by:

i have an xml file (an rss file) <?xml version="1.0" ?> <rss version="2.0"> <channel> <title>CodeGuru.com</title> <link>http://www.codeguru.com/</link> <description>The number one developer site!</description> <language>en-us</language> <lastBuildDate>Mon, 13 Feb 2006 09:52:05 EST</lastBuildDate>

.NET Framework

2339

vector iterators ...

by: ma740988 | last post by:

typedef std::vector < std::complex < double > > complex_vec_type; // option1 int main() { complex_vec_type cc ( 24000 ); complex_vec_type dd ( &cc, &cc ); } versus

C / C++

4373

parsing an ifstream to get some specific text

by: toton | last post by:

Hi, I have some ascii files, which are having some formatted text. I want to read some section only from the total file. For that what I am doing is indexing the sections (denoted by .START in the file) with the location. And for a particular section I parse only that section. The file is something like, .... DATAS

C / C++

1693

itertools, functools, file enhancement ideas

by: Paul Rubin | last post by:

I just had to write some programs that crunched a lot of large files, both text and binary. As I use iterators more I find myself wishing for some maybe-obvious enhancements: 1. File iterator for blocks of chars: f = open('foo') for block in f.iterchars(n=1024): ... iterates through 1024-character blocks from the file. The default iterator

Python

8251

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8182

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8688

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8635

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

7178

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6115

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

4188

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

2614

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

1800

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP