By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,427 Members | 1,378 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,427 IT Pros & Developers. It's quick & easy.

Escaping commas within parens in CSV parsing?

P: n/a
Hi --

I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]

I think this is probably non-standard escaping, as I can't figure out
how to structure a csv dialect to handle it correctly. I can probably
hack this with regular expressions but I thought I'd check to see if
anyone had any quick suggestions for how to do this elegantly first.

Thanks!

Ramon

Jul 19 '05 #1
Share this Question
Share on Google+
8 Replies


P: n/a

Ramon> I am trying to use the csv module to parse a column of values
Ramon> containing comma-delimited values with unusual escaping:

Ramon> AAA, BBB, CCC (some text, right here), DDD

Ramon> I want this to come back as:

Ramon> ["AAA", "BBB", "CCC (some text, right here)", "DDD"]

Alas, there's no "escaping" at all in the line above. I see no obvious way
to distinguish one comma from another in this example. If you mean the fact
that the comma you want to retain is in parens, that's not escaping. Escape
characters don't appear in the output as they do in your example.

Ramon> I can probably hack this with regular expressions but I thought
Ramon> I'd check to see if anyone had any quick suggestions for how to
Ramon> do this elegantly first.

I see nothing obvious unless you truly mean that the beginning of each field
is all caps. In that case you could wrap a file object and :

import re
class FunnyWrapper:
"""untested"""
def __init__(self, f):
self.f = f

def __iter__(self):
return self

def next(self):
return '"' + re.sub(r',( *[A-Z]+)', r'","\1', self.f.next()) + '"'

and use it like so:

reader = csv.reader(FunnyWrapper(open("somefile.csv", "rb")))
for row in reader:
print row

(I'm not sure what the ramifications are of iterating over a file opened in
binary mode.)

Skip
Jul 19 '05 #2

P: n/a
Try this.
re.findall(r'(.+? \(.+?\))(?:,|$)',yourtexthere)

Jul 19 '05 #3

P: n/a
Oops, the above code doesn't quite work. Use this one instead.
re.findall(r'(.+? (?:\(.+?\))?)(?:,|$)',yourtexthere)

Jul 19 '05 #4

P: n/a
Well, this doesn't have the terseness of an re solution, but it
shouldn't be hard to follow.
-- Paul

#~ This is a very crude first pass. It does not handle nested
#~ ()'s, nor ()'s inside quotes. But if your data does not
#~ stray too far from the example, this will probably do the job.

#~ Download pyparsing at http://pyparsing.sourceforge.net.
import pyparsing as pp

test = "AAA, BBB , CCC (some text, right here), DDD"

COMMA = pp.Literal(",")
LPAREN = pp.Literal("(")
RPAREN = pp.Literal(")")
parenthesizedText = LPAREN + pp.SkipTo(RPAREN) + RPAREN

nonCommaChars = "".join( [ chr(c) for c in range(32,127)
if c not in map(ord,list(",()")) ] )
nonCommaText = pp.Word(nonCommaChars)

commaListEntry = pp.Combine(pp.OneOrMore( parenthesizedText |
nonCommaText ),adjacent=False)
commaListEntry.setParseAction( lambda s,l,t: t[0].strip() )

csvList = pp.delimitedList( commaListEntry )
print csvList.parseString(test)

Jul 19 '05 #5

P: n/a
Why don't you use a different delimiter when you're writing the CSV?

Jul 19 '05 #6

P: n/a
fe******@gmail.com writes:
I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]


Quick and somewhat dirty: change your delimiter to a char that never exists in
fields (eg. null character '\0').

Example:
s = 'AAA\0 BBB\0 CCC (some text, right here)\0 DDD'
[f.strip() for f in s.split('\0')]

['AAA', 'BBB', 'CCC (some text, right here)', 'DDD']

But then you'd need to be certain there's no null character in the input
lines by checking it:

colsep = '\0'

for field in inputs:
if colsep in field:
raise IllegalCharException('invalid chars in field %s' % field)

If you need to stick with comma as a separator and the format is relatively
fixed, I'd probably use some parser module instead. Regular expressions are
nice too, but it is easy to make a mistake with those, and for non-trivial
stuff they tend to become write-only.

--
# Edvard Majakari Software Engineer
# PGP PUBLIC KEY available Soli Deo Gloria!

$_ = '456476617264204d616a616b6172692c20612043687269737 469616e20'; print
join('',map{chr hex}(split/(\w{2})/)),uc substr(crypt(60281449,'es'),2,4),"\n";
Jul 19 '05 #7

P: n/a
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscapedCommasInParens(s, trim=False):
pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1|\2",s)
if trim:
return [string.strip(string.replace(x,"|",",")) for x in
string.split(s,",")]
else:
return [string.replace(x,"|",",") for x in string.split(s,",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :-)

Thanks again for all the quick responses.

Ramon

Jul 19 '05 #8

P: n/a
felciano <fe******@gmail.com> wrote:
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscapedCommasInParens(s, trim=False):
pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1|\2",s)
if trim:
return [string.strip(string.replace(x,"|",",")) for x in
string.split(s,",")]
else:
return [string.replace(x,"|",",") for x in string.split(s,",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :-)

Thanks again for all the quick responses.


How about changing '(' or ')' into three double-quotes '"""'? That will
solve splitting issue. But, I'm not sure how you would get back '(' or
')', without much coding.

--
William Park <op**********@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Jul 21 '05 #9

This discussion thread is closed

Replies have been disabled for this discussion.