I have a bunch of csv files that have the following characteristics:
- field delimiter is a comma
- all fields quoted with double quotes
- lines terminated by a *space* followed by a newline
What surprised me was that the csv reader included the trailing space in
the final field value returned, even though it is outside of the quotes.
I've produced a test program (see below) that demonstrates this. There
is a workaround, which is to not pass the csv reader the file iterator,
but rather a generator that returns lines from the file with the
trailing space stripped.
Interestingly, the same behaviour is seen if there are spaces before the
field separator. They are also included in the preceding field value,
even if they are outside the quotations. My workaround wouldn't help here.
Anyway is this a bug or a feature? If it is a feature then I'm curious
as to why it is considered desirable behaviour.
- Andrew
import csv
filename = "test_data.csv"
# Generate a test file - note the spaces before the newlines
fout = open(filename, "wb")
fout.write('"Field1","Field2","Field3" \n')
fout.write('"a","b","c" \n')
fout.write('"d" ,"e","f" \n')
fout.close()
# Function to test a reader
def read_and_print(reader):
for line in reader:
print ",".join(['"%s"' % field for field in line])
# Read the test file - and print the output
reader = csv.reader(open("test_data.csv", "rb"))
read_and_print(reader)
# Now the workaround: a generator to strip the strings before the reader
decodes them
def stripped(input):
for line in input:
yield line.strip()
reader = csv.reader(stripped(open("test_data.csv", "rb")))
read_and_print(reader)
# Try using lineterminator instead - it doesn't work
reader = csv.reader(open("test_data.csv", "rb"), lineterminator=" \r\n")
read_and_print(reader)