By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,652 Members | 1,358 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,652 IT Pros & Developers. It's quick & easy.

Python's CSV reader

P: n/a
I'm fairly new to python and am working on parsing some delimited text
files. I noticed that there's a nice CSV reading/writing module
included in the libraries.

My data files however, are odd in that they are composed of lines with
alternating formats. (Essentially the rows are a header record and a
corresponding detail record on the next line. Each line type has a
different number of fields.)

Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?

Thanks for your insight,
Stephan

Aug 4 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a
Stephan wrote:
Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?


Well, readlines/split really isn't bad. So long as the file fits
comfortably in memory:

fi = open(file)
lines = fi.readlines()
evens = iter(lines[0::2])
odds = iter(lines[1::2])
csv1 = csv.reader(evens)
csv2 = csv.reader(odds)

The trick is that the "csvfile" in the CSV object doesn't have to be a
real file, it just has to be an iterator that returns strings. If the
file's too big to fit in memory, you could piece together a pair of
iterators that execute read() on the file appropriately.
Aug 4 '05 #2

P: n/a
Stephan wrote:
Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?


Yes, it can:

import csv
import sys

reader = csv.reader(sys.stdin)

while True:
try:
names = reader.next()
values = reader.next()
except StopIteration:
break
print dict(zip(names, values))

Python offers an elegant way to do the same using the zip() or
itertools.izip() function:

import csv
import sys
from itertools import izip

reader = csv.reader(sys.stdin)

for names, values in izip(reader, reader):
print dict(izip(names, values))

Now let's add some minimal error checking, and we are done:

import csv
import sys
from itertools import izip, chain

def check_orphan():
raise Exception("Unexpected end of input")
yield None

reader = csv.reader(sys.stdin)
for names, values in izip(reader, chain(reader, check_orphan())):
if len(names) != len(values):
if len(names) > len(values):
raise Exception("More names than values")
else:
raise Exception("More values than names")
print dict(izip(names, values))

Peter

Aug 4 '05 #3

P: n/a
In article <11**********************@z14g2000cwz.googlegroups .com>,
Stephan <us***********@gmail.com> writes
I'm fairly new to python and am working on parsing some delimited text
files. I noticed that there's a nice CSV reading/writing module
included in the libraries.

My data files however, are odd in that they are composed of lines with
alternating formats. (Essentially the rows are a header record and a
corresponding detail record on the next line. Each line type has a
different number of fields.)

Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?

Thanks for your insight,
Stephan


The csv module should be suitable. The reader just takes each line,
parses it, then returns a list of strings. It doesn't matter if
different lines have different numbers of fields.

To get an idea of what I mean, try something like the following
(untested):

import csv

reader = csv.reader(open(filename))

while True:

# Read next "header" line, if there isn't one then exit the
loop
header = reader.next()
if not header: break

# Assume that there is a "detail" line if the preceding
# "header" line exists
detail = reader.next()

# Print the parsed data
print '-' * 40
print "Header (%d fields): %s" % (len(header), header)
print "Detail (%d fields): %s" % (len(detail), detail)

You could wrap this up into a class which returns (header, detail) pairs
and does better error handling, but the above code should illustrate the
basics.

--
Andrew McLean
Aug 4 '05 #4

P: n/a
Thank you all for these interesting examples and methods!

Supposing I want to use DictReader to bring in the CSV lines and tie
them to field names, (again, with alternating lines having different
fields), should I use two two DictReaders as in Christopher's example
or is there a better way?

--
Stephan

Aug 4 '05 #5

P: n/a
Stephan wrote:
Thank you all for these interesting examples and methods!
You're welcome.
Supposing I want to use DictReader to bring in the CSV lines and tie
them to field names, (again, with alternating lines having different
fields), should I use two two DictReaders as in Christopher's example
or is there a better way?


For a clean design you would need not just two DictReader instances, but one
DictReader for every two lines.
However, with the current DictReader implementation, the following works,
too:

import csv
import sys

reader = csv.DictReader(sys.stdin)

for record in reader:
print record
reader.fieldnames = None

Peter

Aug 4 '05 #6

P: n/a
In article <11**********************@o13g2000cwo.googlegroups .com>,
Stephan <us***********@gmail.com> writes
Thank you all for these interesting examples and methods!


You are welcome. One point. I think there have been at least two
different interpretations of precisely what you task is.

I had assumed that all the different "header" lines contained data for
the same fields in the same order, and similarly that all the "detail"
lines contained data for the same fields in the same order.

However, I think Peter has answered on the basis that you have records
consisting of pairs of lines, the first line being a header containing
field names specific to that record with the second line containing the
corresponding data.

It would help of you let us know which (if any) was correct.

--
Andrew McLean
Aug 5 '05 #7

P: n/a
Andrew McLean wrote:
You are welcome. One point. I think there have been at least two
different interpretations of precisely what you task is.

I had assumed that all the different "header" lines contained data for
the same fields in the same order, and similarly that all the "detail"
lines contained data for the same fields in the same order.


Indeed, you are correct. Peter's version is interesting in its own
right, but not precisely what I had in mind. However, from his example
I saw what I was missing: I didn't realize that you could reassign the
DictReader field names on the fly. Here is a rudimentary example of my
working code and the data it can parse.

-------------------------------------
John|Smith
Beef|Potatos|Dinner Roll|Ice Cream
Susan|Jones
Chicken|Peas|Biscuits|Cake
Roger|Miller
Pork|Salad|Muffin|Cookies
-------------------------------------

import csv

HeaderFields = ["First Name", "Last Name"]
DetailFields = ["Entree", "Side Dish", "Starch", "Desert"]

reader = csv.DictReader(open("testdata.txt"), [], delimiter="|")

while True:
try:
# Read next "header" line (if there isn't one then exit the
loop)
reader.fieldnames = HeaderFields
header = reader.next()

# Read the next "detail" line
reader.fieldnames = DetailFields
detail = reader.next()

# Print the parsed data
print '-' * 40
print "Header (%d fields): %s" % (len(header), header)
print "Detail (%d fields): %s" % (len(detail), detail)

except StopIteration: break

Regards,
-Stephan

Aug 8 '05 #8

P: n/a
Stephan wrote:
DictReader field names on the fly. Here is a rudimentary example of my
working code and the data it can parse.

-------------------------------------
John|Smith
Beef|Potatos|Dinner Roll|Ice Cream
Susan|Jones
Chicken|Peas|Biscuits|Cake
Roger|Miller
Pork|Salad|Muffin|Cookies
-------------------------------------


That sample data would have been valuable information in your original post.
Here's what becomes of your code if you apply the "zip trick" from my first
post (yes, I am sometimes stubborn):

import itertools
import csv

HeaderFields = ["First Name", "Last Name"]
DetailFields = ["Entree", "Side Dish", "Starch", "Desert"]

instream = open("testdata.txt")

heads = csv.DictReader(instream, HeaderFields, delimiter="|")
details = csv.DictReader(instream, DetailFields, delimiter="|")

for header, detail in itertools.izip(heads, details):
print "Header (%d fields): %s" % (len(header), header)
print "Detail (%d fields): %s" % (len(detail), detail)

Peter

Aug 8 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.