473,792 Members | 2,877 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Escaping commas within parens in CSV parsing?

Hi --

I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]

I think this is probably non-standard escaping, as I can't figure out
how to structure a csv dialect to handle it correctly. I can probably
hack this with regular expressions but I thought I'd check to see if
anyone had any quick suggestions for how to do this elegantly first.

Thanks!

Ramon

Jul 19 '05 #1
8 5705

Ramon> I am trying to use the csv module to parse a column of values
Ramon> containing comma-delimited values with unusual escaping:

Ramon> AAA, BBB, CCC (some text, right here), DDD

Ramon> I want this to come back as:

Ramon> ["AAA", "BBB", "CCC (some text, right here)", "DDD"]

Alas, there's no "escaping" at all in the line above. I see no obvious way
to distinguish one comma from another in this example. If you mean the fact
that the comma you want to retain is in parens, that's not escaping. Escape
characters don't appear in the output as they do in your example.

Ramon> I can probably hack this with regular expressions but I thought
Ramon> I'd check to see if anyone had any quick suggestions for how to
Ramon> do this elegantly first.

I see nothing obvious unless you truly mean that the beginning of each field
is all caps. In that case you could wrap a file object and :

import re
class FunnyWrapper:
"""untested """
def __init__(self, f):
self.f = f

def __iter__(self):
return self

def next(self):
return '"' + re.sub(r',( *[A-Z]+)', r'","\1', self.f.next()) + '"'

and use it like so:

reader = csv.reader(Funn yWrapper(open(" somefile.csv", "rb")))
for row in reader:
print row

(I'm not sure what the ramifications are of iterating over a file opened in
binary mode.)

Skip
Jul 19 '05 #2
Try this.
re.findall(r'(. +? \(.+?\))(?:,|$) ',yourtexthere)

Jul 19 '05 #3
Oops, the above code doesn't quite work. Use this one instead.
re.findall(r'(. +? (?:\(.+?\))?)(? :,|$)',yourtext here)

Jul 19 '05 #4
Well, this doesn't have the terseness of an re solution, but it
shouldn't be hard to follow.
-- Paul

#~ This is a very crude first pass. It does not handle nested
#~ ()'s, nor ()'s inside quotes. But if your data does not
#~ stray too far from the example, this will probably do the job.

#~ Download pyparsing at http://pyparsing.sourceforge.net.
import pyparsing as pp

test = "AAA, BBB , CCC (some text, right here), DDD"

COMMA = pp.Literal(",")
LPAREN = pp.Literal("(")
RPAREN = pp.Literal(")")
parenthesizedTe xt = LPAREN + pp.SkipTo(RPARE N) + RPAREN

nonCommaChars = "".join( [ chr(c) for c in range(32,127)
if c not in map(ord,list(", ()")) ] )
nonCommaText = pp.Word(nonComm aChars)

commaListEntry = pp.Combine(pp.O neOrMore( parenthesizedTe xt |
nonCommaText ),adjacent=Fals e)
commaListEntry. setParseAction( lambda s,l,t: t[0].strip() )

csvList = pp.delimitedLis t( commaListEntry )
print csvList.parseSt ring(test)

Jul 19 '05 #5
Why don't you use a different delimiter when you're writing the CSV?

Jul 19 '05 #6
fe******@gmail. com writes:
I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:

AAA, BBB, CCC (some text, right here), DDD

I want this to come back as:

["AAA", "BBB", "CCC (some text, right here)", "DDD"]


Quick and somewhat dirty: change your delimiter to a char that never exists in
fields (eg. null character '\0').

Example:
s = 'AAA\0 BBB\0 CCC (some text, right here)\0 DDD'
[f.strip() for f in s.split('\0')]

['AAA', 'BBB', 'CCC (some text, right here)', 'DDD']

But then you'd need to be certain there's no null character in the input
lines by checking it:

colsep = '\0'

for field in inputs:
if colsep in field:
raise IllegalCharExce ption('invalid chars in field %s' % field)

If you need to stick with comma as a separator and the format is relatively
fixed, I'd probably use some parser module instead. Regular expressions are
nice too, but it is easy to make a mistake with those, and for non-trivial
stuff they tend to become write-only.

--
# Edvard Majakari Software Engineer
# PGP PUBLIC KEY available Soli Deo Gloria!

$_ = '45647661726420 4d616a616b61726 92c206120436872 69737469616e20' ; print
join('',map{chr hex}(split/(\w{2})/)),uc substr(crypt(60 281449,'es'),2, 4),"\n";
Jul 19 '05 #7
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscape dCommasInParens (s, trim=False):
pat = re.compile(r"(. +?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1 |\2",s)
if trim:
return [string.strip(st ring.replace(x, "|",",")) for x in
string.split(s, ",")]
else:
return [string.replace( x,"|",",") for x in string.split(s, ",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :-)

Thanks again for all the quick responses.

Ramon

Jul 19 '05 #8
felciano <fe******@gmail .com> wrote:
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:

def splitWithEscape dCommasInParens (s, trim=False):
pat = re.compile(r"(. +?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1 |\2",s)
if trim:
return [string.strip(st ring.replace(x, "|",",")) for x in
string.split(s, ",")]
else:
return [string.replace( x,"|",",") for x in string.split(s, ",")]

Probably not the most efficient, but its "the simplest thing that
works" for me :-)

Thanks again for all the quick responses.


How about changing '(' or ')' into three double-quotes '"""'? That will
solve splitting issue. But, I'm not sure how you would get back '(' or
')', without much coding.

--
William Park <op**********@y ahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
Jul 21 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
4420
by: Dave Moore | last post by:
Hi All, Can anybody point me to a FAQ or similar that describes what all this stuff is about please?. I'm interfacing with a MySQL database if that's relavent. I've read a couple of books which refer to stripslahes and 'escaping' but nothing really explains what these terms are and why these are used. Why is 'escaping' (whatever that is) used?. What the hell is a magic quote?. How is it different from a non-magic one?. Regards, Dave
1
1348
by: Allan | last post by:
Hi All, I am having a problem parsing an xml file I am getting from another server. This is the portion of the xml I am getting I am interested in: <DestinationAddress> <City>Leawood</City> <StateOrProvinceCode>KS</StateOrProvinceCode> <PostalCode>66209</PostalCode>
4
1557
by: Vishal | last post by:
I need a simple method to find whether there are any instances of consecutive commas (more than 1) in a given string without parsing each character of the string. I tried with strtok() with comma as separator but it considers all consecutive commas as a single separator and gives the next token. Is there any simple method to do the same?
8
16824
by: Frank Rizzo | last post by:
Hello, I'd like to have the following structure in my XML file <lname, _fname, _minit> <status>it is all good</status> </lname, _fname, _minit> But apparently, there is a problem with commas and underscores being in the key name of the node. How can I escape it?
24
2289
by: dmitrey | last post by:
Hi all, I looked to the PEPs & didn't find a proposition to remove brackets & commas for to make Python func call syntax caml- or tcl- like: instead of result = myfun(param1, myfun2(param5, param8), param3) just make possible using result = myfun param1 (myfun2 param5 param8) param3 it would reduce length of code lines and make them more readable, + no needs to write annoing charecters.
9
1289
by: korovev76 | last post by:
Hello everybody. I'm wondering how to iterate over a tuple like this while saving A and C in a list. My problem is that C sometimes is a tuple of the same structure itself...
9
34913
by: Bruce | last post by:
I'm outputting form content into a csv file. If a comma is used in one of the fields, however, it will interpret to go to next column. Is there a workaround? Thanks. $fp = fopen('my.csv','a'); $content = "$var1,$var2,$var3... fwrite($fp,$content);
4
4805
by: E11esar | last post by:
Hi there. This could be a curious one. Has anybody come across a solution to remove stray commas that appear within strings in a CSV file please? In effect I have many address fields that are punctuated with commas and I am looking for a way to remove these while parsing the csv file. Any ideas will be most appreciated please. Thank you.
0
9670
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10430
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10211
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10159
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10000
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9033
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7538
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5436
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4111
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.