473,395 Members | 2,253 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,395 software developers and data experts.

Escaping commas within parens in CSV parsing?

Hi --

I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]

I think this is probably non-standard escaping, as I can't figure out
how to structure a csv dialect to handle it correctly. I can probably
hack this with regular expressions but I thought I'd check to see if
anyone had any quick suggestions for how to do this elegantly first.

Thanks!

Ramon

Jul 19 '05 #1
8 5686

Ramon> I am trying to use the csv module to parse a column of values
Ramon> containing comma-delimited values with unusual escaping:

Ramon> AAA, BBB, CCC (some text, right here), DDD

Ramon> I want this to come back as:

Ramon> ["AAA", "BBB", "CCC (some text, right here)", "DDD"]

Alas, there's no "escaping" at all in the line above. I see no obvious way
to distinguish one comma from another in this example. If you mean the fact
that the comma you want to retain is in parens, that's not escaping. Escape
characters don't appear in the output as they do in your example.

Ramon> I can probably hack this with regular expressions but I thought
Ramon> I'd check to see if anyone had any quick suggestions for how to
Ramon> do this elegantly first.

I see nothing obvious unless you truly mean that the beginning of each field
is all caps. In that case you could wrap a file object and :

import re
class FunnyWrapper:
"""untested"""
def __init__(self, f):
self.f = f

def __iter__(self):
return self

def next(self):
return '"' + re.sub(r',( *[A-Z]+)', r'","\1', self.f.next()) + '"'

and use it like so:

reader = csv.reader(FunnyWrapper(open("somefile.csv", "rb")))
for row in reader:
print row

(I'm not sure what the ramifications are of iterating over a file opened in
binary mode.)

Skip
Jul 19 '05 #2
Try this.
re.findall(r'(.+? \(.+?\))(?:,|$)',yourtexthere)

Jul 19 '05 #3
Oops, the above code doesn't quite work. Use this one instead.
re.findall(r'(.+? (?:\(.+?\))?)(?:,|$)',yourtexthere)

Jul 19 '05 #4
Well, this doesn't have the terseness of an re solution, but it
shouldn't be hard to follow.
-- Paul

#~ This is a very crude first pass. It does not handle nested
#~ ()'s, nor ()'s inside quotes. But if your data does not
#~ stray too far from the example, this will probably do the job.

#~ Download pyparsing at http://pyparsing.sourceforge.net.
import pyparsing as pp

test = "AAA, BBB , CCC (some text, right here), DDD"

COMMA = pp.Literal(",")
LPAREN = pp.Literal("(")
RPAREN = pp.Literal(")")
parenthesizedText = LPAREN + pp.SkipTo(RPAREN) + RPAREN

nonCommaChars = "".join( [ chr(c) for c in range(32,127)
if c not in map(ord,list(",()")) ] )
nonCommaText = pp.Word(nonCommaChars)

commaListEntry = pp.Combine(pp.OneOrMore( parenthesizedText |
nonCommaText ),adjacent=False)
commaListEntry.setParseAction( lambda s,l,t: t[0].strip() )

csvList = pp.delimitedList( commaListEntry )
print csvList.parseString(test)

Jul 19 '05 #5
Why don't you use a different delimiter when you're writing the CSV?

Jul 19 '05 #6
fe******@gmail.com writes:
I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]


Quick and somewhat dirty: change your delimiter to a char that never exists in
fields (eg. null character '\0').

Example:
s = 'AAA\0 BBB\0 CCC (some text, right here)\0 DDD'
[f.strip() for f in s.split('\0')]

['AAA', 'BBB', 'CCC (some text, right here)', 'DDD']

But then you'd need to be certain there's no null character in the input
lines by checking it:

colsep = '\0'

for field in inputs:
if colsep in field:
raise IllegalCharException('invalid chars in field %s' % field)

If you need to stick with comma as a separator and the format is relatively
fixed, I'd probably use some parser module instead. Regular expressions are
nice too, but it is easy to make a mistake with those, and for non-trivial
stuff they tend to become write-only.

--
# Edvard Majakari Software Engineer
# PGP PUBLIC KEY available Soli Deo Gloria!

$_ = '456476617264204d616a616b6172692c20612043687269737 469616e20'; print
join('',map{chr hex}(split/(\w{2})/)),uc substr(crypt(60281449,'es'),2,4),"\n";
Jul 19 '05 #7
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscapedCommasInParens(s, trim=False):
pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1|\2",s)
if trim:
return [string.strip(string.replace(x,"|",",")) for x in
string.split(s,",")]
else:
return [string.replace(x,"|",",") for x in string.split(s,",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :-)

Thanks again for all the quick responses.

Ramon

Jul 19 '05 #8
felciano <fe******@gmail.com> wrote:
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscapedCommasInParens(s, trim=False):
pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1|\2",s)
if trim:
return [string.strip(string.replace(x,"|",",")) for x in
string.split(s,",")]
else:
return [string.replace(x,"|",",") for x in string.split(s,",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :-)

Thanks again for all the quick responses.


How about changing '(' or ')' into three double-quotes '"""'? That will
solve splitting issue. But, I'm not sure how you would get back '(' or
')', without much coding.

--
William Park <op**********@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Jul 21 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Dave Moore | last post by:
Hi All, Can anybody point me to a FAQ or similar that describes what all this stuff is about please?. I'm interfacing with a MySQL database if that's relavent. I've read a couple of books which...
1
by: Allan | last post by:
Hi All, I am having a problem parsing an xml file I am getting from another server. This is the portion of the xml I am getting I am interested in: <DestinationAddress> <City>Leawood</City>...
4
by: Vishal | last post by:
I need a simple method to find whether there are any instances of consecutive commas (more than 1) in a given string without parsing each character of the string. I tried with strtok() with comma...
8
by: Frank Rizzo | last post by:
Hello, I'd like to have the following structure in my XML file <lname, _fname, _minit> <status>it is all good</status> </lname, _fname, _minit> But apparently, there is a problem with...
24
by: dmitrey | last post by:
Hi all, I looked to the PEPs & didn't find a proposition to remove brackets & commas for to make Python func call syntax caml- or tcl- like: instead of result = myfun(param1, myfun2(param5,...
9
by: korovev76 | last post by:
Hello everybody. I'm wondering how to iterate over a tuple like this while saving A and C in a list. My problem is that C sometimes is a tuple of the same structure itself...
9
by: Bruce | last post by:
I'm outputting form content into a csv file. If a comma is used in one of the fields, however, it will interpret to go to next column. Is there a workaround? Thanks. $fp = fopen('my.csv','a');...
4
by: E11esar | last post by:
Hi there. This could be a curious one. Has anybody come across a solution to remove stray commas that appear within strings in a CSV file please? In effect I have many address fields that are...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.