By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,001 Members | 1,262 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,001 IT Pros & Developers. It's quick & easy.

Troubles with CSV file

P: n/a
Hello!

I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1

.... which is look pretty "natural" for me. Instead I got a:

read this
man""
1

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this
and can't figure out how to proceed with my CSV file. Maybe some
*alternative* CSV parsers can help? Any suggestions are welcomed.

Vladimir Ignatov
Jul 18 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Vladimir Ignatov wrote:
I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1


Do you have anything that already accepts this particular dialect?
It seems to me that the above could just as easily be interpreted
as three fields (using parentheses as delimiters) :

(""read this) ( man"") ( 1)

Is it possible that what you have is not really any standard CSV
format, but just something home-brewed? In that case, you may
well need to massage it before feeding it to the csv module.

Or, if you can define how your example works in terms of delimiters,
quoting and such, maybe there's a way to make the csv module handle
it without complaints.

As far as I can see, you want either the doubled quotation marks to
be treated as single quotation marks, or you want the outer quotation
marks to magically quote the whole string containing the comma even
though it contains the quotation marks already. I don't think CSV
can handle the latter (and it's probably an impossible goal), so you
must really want the former. In that case, unfortunately, you
are also screwed because the doubling of quotation marks must mean
that 'doublequote' is True, but then 'quotechar' must have been '"'
in the first place and that first field would now have triple quotes
around it, like the Excel dialect.

Can you just blindly substitute all double quotes with triple quotes
in the input string first? That might be the easiest approach.

-Peter
Jul 18 '05 #2

P: n/a
"Vladimir Ignatov" <vi******@colorpilot.com> wrote in message news:<ma***********************************@python .org>...
Hello!

I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1

... which is look pretty "natural" for me. Instead I got a:

read this
man""
1

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this
and can't figure out how to proceed with my CSV file. Maybe some
*alternative* CSV parsers can help? Any suggestions are welcomed.

Vladimir Ignatov

I have written a very simple CSV parser which uses a simple function
'unquote' to unquote quoted elements.
It would be *very* simple to amend unquote to handle double-quoted
elements.

http://www.voidspace.org.uk/atlantib...thonutils.html

Regards,

Fuzzy
Jul 18 '05 #3

P: n/a
"Vladimir Ignatov" <vi******@colorpilot.com> wrote in message
news:ma***********************************@python. org...
Hello!

I have a big CSV file, which I must read and do some processing with it.
Unfortunately I can't figure out how to use standard *csv* module in my
situation. The problem is that some records look like:

""read this, man"", 1

which should be decoded back into the:

"read this, man"
1

... which is look pretty "natural" for me. Instead I got a:

read this
man""
1

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this and can't figure out how to proceed with my CSV file. Maybe some
*alternative* CSV parsers can help? Any suggestions are welcomed.

Vladimir Ignatov

Vladimir -

Here is the CSV example that is provided with pyparsing (with some slight
edits). I wrote this for exactly the situation you describe - just
splitting on commas doesn't always do the right thing.

You can download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

==========================
# commasep.py
#
# comma-separated list example, to illustrate the advantages of using
# the pyparsing commaSeparatedList as opposed to string.split(","):
# - leading and trailing whitespace is implicitly trimmed from list elements
# - list elements can be quoted strings, which can safely contain commas
without breaking
# into separate elements

from pyparsing import commaSeparatedList
import string

testData = [
"a,b,c,100.2,,3",
"d, e, j k , m ",
"'Hello, World', f, g , , 5.1,x",
"John Doe, 123 Main St., Cleveland, Ohio",
"Jane Doe, 456 St. James St., Los Angeles , California ",
"",
]

for line in testData:
print "input:", repr(line)
print "split:", line.split(",")
print "parse:", commaSeparatedList.parseString(line)
print

==========================
Output:
input: 'a,b,c,100.2,,3'
split: ['a', 'b', 'c', '100.2', '', '3']
parse: ['a', 'b', 'c', '100.2', '', '3']

input: 'd, e, j k , m '
split: ['d', ' e', ' j k ', ' m ']
parse: ['d', 'e', 'j k', 'm']

input: "'Hello, World', f, g , , 5.1,x"
split: ["'Hello", " World'", ' f', ' g ', ' ', ' 5.1', 'x']
parse: ["'Hello, World'", 'f', 'g', '', '5.1', 'x']

input: 'John Doe, 123 Main St., Cleveland, Ohio'
split: ['John Doe', ' 123 Main St.', ' Cleveland', ' Ohio']
parse: ['John Doe', '123 Main St.', 'Cleveland', 'Ohio']

input: 'Jane Doe, 456 St. James St., Los Angeles , California '
split: ['Jane Doe', ' 456 St. James St.', ' Los Angeles ', ' California ']
parse: ['Jane Doe', '456 St. James St.', 'Los Angeles', 'California']

input: ''
split: ['']
parse: ['']

Jul 18 '05 #4

P: n/a
On Fri, 14 May 2004 14:08:15 +0400, "Vladimir Ignatov"
<vi******@colorpilot.com> declaimed the following in comp.lang.python:

output. In other words, csv reader does not understand using of "" here.
Quick experiment show me that *csv* module (with default 'excel' dialect)
expects something like

"""read this, man""", 1

in my situation - quotes actually must be trippled. I don't understand this
Which is standard behavior in almost all programming languages.
The first " signals the beginning of a quoted string. Within a quoted
string, double "s flag an escape, being replaced with a single " in the
text. Then a final " ends the quoted string.

"This is a ""quoted"" string"
becomes
This is a "quoted" string
internally.

I don't know why you got the "" on the trailing segment of your
text -- maybe a bug in the CSV module, as I'd parse your (use fixed
font)

""read this, man"", 1
start---|
end------| ie, an empty quoted string
unquoted--^^^^^^^^^
comma-split--------|
unquoted------------^^^^
start-------------------|
end----------------------| another empty quoted string
comma-split---------------|
unquoted-------------------^^

whereas

"""read this, man""", 1
start---|
end?-----| could be empty string
NO-doubled| no, it's a " inside the string
quoted-----^^^^^^^^^^^^^^
end?---------------------| end of string?
NO-doubled----------------| no, another " inside the string
end------------------------| not doubled so end of string
comma-split-----------------|
unquoted---------------------^^
-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 18 '05 #5

P: n/a
Dennis> ""read this, man"", 1
Dennis> start---|
Dennis> end------| ie, an empty quoted string
Dennis> unquoted--^^^^^^^^^
Dennis> comma-split--------|
Dennis> unquoted------------^^^^
Dennis> start-------------------|
Dennis> end----------------------| another empty quoted string
Dennis> comma-split---------------|
Dennis> unquoted-------------------^^

I'm not sure what "correct" interpretation of this should be since no
separator was placed after the first '""' and before the second. Given that
the input is ill-defined, just about any output could be considered
"valid". ;-)

Skip

Jul 18 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.