473,386 Members | 1,873 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

Unexpected behaviour of csv module

I have a bunch of csv files that have the following characteristics:

- field delimiter is a comma
- all fields quoted with double quotes
- lines terminated by a *space* followed by a newline

What surprised me was that the csv reader included the trailing space in
the final field value returned, even though it is outside of the quotes.
I've produced a test program (see below) that demonstrates this. There
is a workaround, which is to not pass the csv reader the file iterator,
but rather a generator that returns lines from the file with the
trailing space stripped.

Interestingly, the same behaviour is seen if there are spaces before the
field separator. They are also included in the preceding field value,
even if they are outside the quotations. My workaround wouldn't help here.

Anyway is this a bug or a feature? If it is a feature then I'm curious
as to why it is considered desirable behaviour.

- Andrew

import csv
filename = "test_data.csv"

# Generate a test file - note the spaces before the newlines
fout = open(filename, "wb")
fout.write('"Field1","Field2","Field3" \n')
fout.write('"a","b","c" \n')
fout.write('"d" ,"e","f" \n')
fout.close()

# Function to test a reader
def read_and_print(reader):
for line in reader:
print ",".join(['"%s"' % field for field in line])

# Read the test file - and print the output
reader = csv.reader(open("test_data.csv", "rb"))
read_and_print(reader)

# Now the workaround: a generator to strip the strings before the reader
decodes them
def stripped(input):
for line in input:
yield line.strip()
reader = csv.reader(stripped(open("test_data.csv", "rb")))
read_and_print(reader)

# Try using lineterminator instead - it doesn't work
reader = csv.reader(open("test_data.csv", "rb"), lineterminator=" \r\n")
read_and_print(reader)
Sep 24 '06 #1
7 2760
Andrew McLean wrote:
I have a bunch of csv files that have the following characteristics:

- field delimiter is a comma
- all fields quoted with double quotes
- lines terminated by a *space* followed by a newline

What surprised me was that the csv reader included the trailing space in
the final field value returned, even though it is outside of the quotes.
I've produced a test program (see below) that demonstrates this. There
is a workaround, which is to not pass the csv reader the file iterator,
but rather a generator that returns lines from the file with the
trailing space stripped.

Interestingly, the same behaviour is seen if there are spaces before the
field separator. They are also included in the preceding field value,
even if they are outside the quotations. My workaround wouldn't help here.
A better workaround IMHO is to strip each *field* after it is received
from the csv reader. In fact, it is very rare that leading or trailing
space in CSV fields is of any significance at all. Multiple spaces
ditto. Just do this all the time:

row = [' '.join(x.split()) for x in row]
>
Anyway is this a bug or a feature? If it is a feature then I'm curious
as to why it is considered desirable behaviour.
IMHO, a bug. In that state, it should be expecting another quotechar, a
delimiter, or a lineterminator. A case could be made for either (a)
ignore space characters (b) raise an exception (c) a or b depending on
an arg ..., ignore_trailing_space=False.

But it gets even more bizarre; see output from revised test script:

DOS_prompt>cat amclean2.py
import csv
filename = "test_data.csv"

# Generate a test file - note the spaces before the newlines
fout = open(filename, "w")
fout.write('"Field1","Field2","Field3" \n')
fout.write('"a","b","c" \n')
fout.write('"d" ,"e","f" \n')
fout.write('"g"xxx,"h" yyy,"i"zzz \n')
fout.write('Fred "Supercoder" Nerk,p,q\n')
fout.write('Fred "Supercoder\' Nerk,p,q\n')
fout.write('Fred \'Supercoder" Nerk,p,q\n')
fout.write('"Fred "Supercoder" Nerk",p,q\n')
fout.write('"Fred "Supercoder\' Nerk",p,q\n')
fout.write('"Fred \'Supercoder" Nerk",p,q\n')
fout.write('"Emoh Ruo", 123 Smith St, Sometown,p,q\n')
fout.write('""Emoh Ruo", 123 Smith St, Sometown","p","q"\n')
fout.close()

# Function to test a reader
def read_and_print(reader):
for line in reader:
# print ",".join(['"%s"' % field for field in line])
# sheesh
print repr(line)

# Read the test file - and print the output
reader = csv.reader(open("test_data.csv", "rb"))
read_and_print(reader)

DOS_prompt>\python25\python amclean2.py
['Field1', 'Field2', 'Field3 ']
['a', 'b', 'c ']
['d ', 'e', 'f ']
['gxxx', 'h yyy', 'izzz ']
['Fred "Supercoder" Nerk', 'p', 'q']
['Fred "Supercoder\' Nerk', 'p', 'q']
['Fred \'Supercoder" Nerk', 'p', 'q']
['Fred Supercoder" Nerk"', 'p', 'q']
['Fred Supercoder\' Nerk"', 'p', 'q']
['Fred \'Supercoder Nerk"', 'p', 'q']
['Emoh Ruo', ' 123 Smith St', ' Sometown', 'p', 'q']
['Emoh Ruo"', ' 123 Smith St', ' Sometown"', 'p', 'q']

Input like the 4th line (and subsequent lines) in the test file can not
have been produced by code which was following the usual algorithm for
quoting CSV fields. Either it is *concatenating* properly-quoted
segments (unlikely) or it is not doing CSV quoting at all or it is
blindly wrapping quotes around the field without doubling internal
quotes.

IMHO such problems should not be silently ignored.
# Try using lineterminator instead - it doesn't work
reader = csv.reader(open("test_data.csv", "rb"), lineterminator=" \r\n")
lineterminator is silently ignored by the reader.

Cheers,
John

Sep 25 '06 #2
John Machin wrote:
A better workaround IMHO is to strip each *field* after it is received
from the csv reader. In fact, it is very rare that leading or trailing
space in CSV fields is of any significance at all. Multiple spaces
ditto. Just do this all the time:

row = [' '.join(x.split()) for x in row]
The problem with removing the spaces after they are received from the
csv reader is if you want to use DictReader. I like to use DictReader,
without passing it the field list. The module then reads the field list
from the first line, and in this situation you don't get an opportunity
to strip the spaces from that.
Sep 25 '06 #3

Andrew McLean wrote:
John Machin wrote:
A better workaround IMHO is to strip each *field* after it is received
from the csv reader. In fact, it is very rare that leading or trailing
space in CSV fields is of any significance at all. Multiple spaces
ditto. Just do this all the time:

row = [' '.join(x.split()) for x in row]

The problem with removing the spaces after they are received from the
csv reader is if you want to use DictReader. I like to use DictReader,
without passing it the field list. The module then reads the field list
from the first line, and in this situation you don't get an opportunity
to strip the spaces from that.
You can fix that. The beauty of open source is that you can grab it
(Windows: c:\python2?\lib\csv.py (typically)) and hack it about till it
suits your needs. Go fer it!

Cheers,
John

Sep 25 '06 #4
One could argue that your CSV file is broken. Of course, since CSV is a
poorly specified format, that's a pretty weak statement. I don't remember
just what your original problem was, but it concerned itself with white
space as I recall. Have you tried setting the skipinitialspace parameter in
your call to create a reader object?

Skip

Sep 25 '06 #5
sk**@pobox.com wrote:
One could argue that your CSV file is broken.
Hi Skip,

His CSV file is mildly broken. The examples that I gave are even more
broken, and are typical of real world files created by clueless
developers from databases which contain quotes and commas in the data
(e.g. addresses). The brokenness is not the point at issue. The point
is that the csv module is weakly silent about the brokenness and in
some cases munges the data even further.
Of course, since CSV is a
poorly specified format, that's a pretty weak statement.
It would help if the csv module docs did specify what format it
expects/allows on reading, and what it does on writing. How to quote a
field properly isn't all that mindbogglingly difficult (leaving out
options like escapechar and more-than-minimal quoting):

qc = quotechar
if qc in field:
out = qc + field.replace(qc, qc+qc) + qc
elif delimiter in field or '\n' in field or '\r' in field:
out = qc + field + qc
else:
out = field

Notice how if there are any special characters in the input, the output
has a quotechar at each end. If not, it's broken, and detectably
broken:
abc"def
___^ unexpected quote inside unquoted field
"abc"def
_____^ after quote, expected quote, delimiter, or end-of-line

I don't remember
just what your original problem was, but it concerned itself with white
space as I recall. Have you tried setting the skipinitialspace parameter in
your call to create a reader object?
The problem has nothing to do with *initial* spaces; the OP's problem
cases involved *trailing* spaces. And that's only a subset of the real
problem: casual attitude towards fields that contain quotes but don't
start and/or end with quotes i.e. they have *not* been created by
applying the usual quoting algorithm to raw data.

HTH,
John

Sep 25 '06 #6
John Machin wrote:
You can fix that. The beauty of open source is that you can grab it
(Windows: c:\python2?\lib\csv.py (typically)) and hack it about till it
suits your needs. Go fer it!
Unfortunately the bits I should change are in _csv.c and, as I'm not
very proficient at C, that wouldn't be a good idea. Anyway, for the
specific brokenness of my CSV file, the simple workaround from my
original post is fine.
Sep 26 '06 #7
Andrew McLean wrote:
John Machin wrote:
You can fix that. The beauty of open source is that you can grab it
(Windows: c:\python2?\lib\csv.py (typically)) and hack it about till it
suits your needs. Go fer it!

Unfortunately the bits I should change are in _csv.c
We must be talking about different things. IIRC we were talking about
stripping spaces off each field, in your own code. You said that
DictReader was a problem. This can be fixed by changing each of the 3
occurrences of
row = self.reader.next()
in the DictReader.next method to
row = [x.strip() for x in self.reader.next()]
or, less kludgedly, by defining a function or method to do that.

Cheers,
John

Sep 26 '06 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Simon Wittber | last post by:
Greetings fellow Python People! I am using the latest debian-unstable Apache 1.3 and mod-python 2.7 with python 2.3.3 Yesterday I wrote a small mod-python script which returned the URL...
2
by: Gerhard Esterhuizen | last post by:
Hi, I am observing unexpected behaviour, in the form of a corrupted class member access, from a simple C++ program that accesses an attribute declared in a virtual base class via a chain of...
4
by: | last post by:
I have earlier used an HttpModule that did URL rewrites on the BeginRequest event. Now I am trying to use the same module in a different application on a new and upgraded machine (winxp sp2). ...
3
by: rimmer | last post by:
I'm writing an extension module in C in which I'm passing an array of floats from C to python. The code below illustrates a simple C function designed to output an array of floats. ---------...
4
by: conan | last post by:
This regexp '<widget class=".*" id=".*">' works well with 'grep' for matching lines of the kind <widget class="GtkWindow" id="window1"> on a XML .glade file However that's not true for the...
5
by: Ivan Velev | last post by:
Hello, Minimal example below - it gives me different output if I comment / uncomment the extra time.mktime call - note that this call is not related in any way to main logic flow. When...
0
by: MRAB | last post by:
I'm working on the sources for the regex module (_sre.c) but I've come across some behaviour that I wasn't aware of before: ('b', 'a') The regex module was modified to return this instead of...
3
by: rs387 | last post by:
Hi, I've found the following behaviour on importing a variable from a module somewhat odd. The behaviour is identical in Python 2.5 and 3.0b2. In summary, here's what happens. I have a...
10
by: nisp | last post by:
Hi all ! I'm trying to capture stderr of an external module I use in my python program. I'm doing this by setting up a class in my module overwriting the stderr file object method write. The...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.