Hi --
I am trying to use the csv module to parse a column of values
containing comma-delimited values with unusual escaping:
AAA, BBB, CCC (some text, right here), DDD
I want this to come back as:
["AAA", "BBB", "CCC (some text, right here)", "DDD"]
I think this is probably non-standard escaping, as I can't figure out
how to structure a csv dialect to handle it correctly. I can probably
hack this with regular expressions but I thought I'd check to see if
anyone had any quick suggestions for how to do this elegantly first.
Thanks!
Ramon 8 5686
Ramon> I am trying to use the csv module to parse a column of values
Ramon> containing comma-delimited values with unusual escaping:
Ramon> AAA, BBB, CCC (some text, right here), DDD
Ramon> I want this to come back as:
Ramon> ["AAA", "BBB", "CCC (some text, right here)", "DDD"]
Alas, there's no "escaping" at all in the line above. I see no obvious way
to distinguish one comma from another in this example. If you mean the fact
that the comma you want to retain is in parens, that's not escaping. Escape
characters don't appear in the output as they do in your example.
Ramon> I can probably hack this with regular expressions but I thought
Ramon> I'd check to see if anyone had any quick suggestions for how to
Ramon> do this elegantly first.
I see nothing obvious unless you truly mean that the beginning of each field
is all caps. In that case you could wrap a file object and :
import re
class FunnyWrapper:
"""untested"""
def __init__(self, f):
self.f = f
def __iter__(self):
return self
def next(self):
return '"' + re.sub(r',( *[A-Z]+)', r'","\1', self.f.next()) + '"'
and use it like so:
reader = csv.reader(FunnyWrapper(open("somefile.csv", "rb")))
for row in reader:
print row
(I'm not sure what the ramifications are of iterating over a file opened in
binary mode.)
Skip
Try this.
re.findall(r'(.+? \(.+?\))(?:,|$)',yourtexthere)
Oops, the above code doesn't quite work. Use this one instead.
re.findall(r'(.+? (?:\(.+?\))?)(?:,|$)',yourtexthere)
Well, this doesn't have the terseness of an re solution, but it
shouldn't be hard to follow.
-- Paul
#~ This is a very crude first pass. It does not handle nested
#~ ()'s, nor ()'s inside quotes. But if your data does not
#~ stray too far from the example, this will probably do the job.
#~ Download pyparsing at http://pyparsing.sourceforge.net.
import pyparsing as pp
test = "AAA, BBB , CCC (some text, right here), DDD"
COMMA = pp.Literal(",")
LPAREN = pp.Literal("(")
RPAREN = pp.Literal(")")
parenthesizedText = LPAREN + pp.SkipTo(RPAREN) + RPAREN
nonCommaChars = "".join( [ chr(c) for c in range(32,127)
if c not in map(ord,list(",()")) ] )
nonCommaText = pp.Word(nonCommaChars)
commaListEntry = pp.Combine(pp.OneOrMore( parenthesizedText |
nonCommaText ),adjacent=False)
commaListEntry.setParseAction( lambda s,l,t: t[0].strip() )
csvList = pp.delimitedList( commaListEntry )
print csvList.parseString(test)
Why don't you use a different delimiter when you're writing the CSV? fe******@gmail.com writes: I am trying to use the csv module to parse a column of values containing comma-delimited values with unusual escaping:
AAA, BBB, CCC (some text, right here), DDD
I want this to come back as:
["AAA", "BBB", "CCC (some text, right here)", "DDD"]
Quick and somewhat dirty: change your delimiter to a char that never exists in
fields (eg. null character '\0').
Example: s = 'AAA\0 BBB\0 CCC (some text, right here)\0 DDD' [f.strip() for f in s.split('\0')]
['AAA', 'BBB', 'CCC (some text, right here)', 'DDD']
But then you'd need to be certain there's no null character in the input
lines by checking it:
colsep = '\0'
for field in inputs:
if colsep in field:
raise IllegalCharException('invalid chars in field %s' % field)
If you need to stick with comma as a separator and the format is relatively
fixed, I'd probably use some parser module instead. Regular expressions are
nice too, but it is easy to make a mistake with those, and for non-trivial
stuff they tend to become write-only.
--
# Edvard Majakari Software Engineer
# PGP PUBLIC KEY available Soli Deo Gloria!
$_ = '456476617264204d616a616b6172692c20612043687269737 469616e20'; print
join('',map{chr hex}(split/(\w{2})/)),uc substr(crypt(60281449,'es'),2,4),"\n";
Thanks for all the postings. I can't change delimiter in the source
itself, so I'm doing it temporarily just to handle the escaping:
def splitWithEscapedCommasInParens(s, trim=False):
pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)")
while pat.search(s):
s = re.sub(pat,r"\1|\2",s)
if trim:
return [string.strip(string.replace(x,"|",",")) for x in
string.split(s,",")]
else:
return [string.replace(x,"|",",") for x in string.split(s,",")]
Probably not the most efficient, but its "the simplest thing that
works" for me :-)
Thanks again for all the quick responses.
Ramon
felciano <fe******@gmail.com> wrote: Thanks for all the postings. I can't change delimiter in the source itself, so I'm doing it temporarily just to handle the escaping:
def splitWithEscapedCommasInParens(s, trim=False): pat = re.compile(r"(.+?\([^\(\),]*?),(.+?\).*)") while pat.search(s): s = re.sub(pat,r"\1|\2",s) if trim: return [string.strip(string.replace(x,"|",",")) for x in string.split(s,",")] else: return [string.replace(x,"|",",") for x in string.split(s,",")]
Probably not the most efficient, but its "the simplest thing that works" for me :-)
Thanks again for all the quick responses.
How about changing '(' or ')' into three double-quotes '"""'? That will
solve splitting issue. But, I'm not sure how you would get back '(' or
')', without much coding.
--
William Park <op**********@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Dave Moore |
last post by:
Hi All,
Can anybody point me to a FAQ or similar that describes what all this
stuff is about please?. I'm interfacing with a MySQL database if that's
relavent. I've read a couple of books which...
|
by: Allan |
last post by:
Hi All,
I am having a problem parsing an xml file I am getting from another server.
This is the portion of the xml I am getting I am interested in:
<DestinationAddress>
<City>Leawood</City>...
|
by: Vishal |
last post by:
I need a simple method to find whether there are any instances of consecutive
commas (more than 1) in a given string without parsing each character of the
string. I tried with strtok() with comma...
|
by: Frank Rizzo |
last post by:
Hello,
I'd like to have the following structure in my XML file
<lname, _fname, _minit>
<status>it is all good</status>
</lname, _fname, _minit>
But apparently, there is a problem with...
|
by: dmitrey |
last post by:
Hi all,
I looked to the PEPs & didn't find a proposition to remove brackets &
commas for to make Python func call syntax caml- or tcl- like: instead
of
result = myfun(param1, myfun2(param5,...
|
by: korovev76 |
last post by:
Hello everybody.
I'm wondering how to iterate over a tuple like this
while saving A and C in a list.
My problem is that C sometimes is a tuple of the same structure
itself...
|
by: Bruce |
last post by:
I'm outputting form content into a csv file. If a comma is used in one
of the fields, however, it will interpret to go to next column. Is there
a workaround? Thanks.
$fp = fopen('my.csv','a');...
|
by: E11esar |
last post by:
Hi there. This could be a curious one.
Has anybody come across a solution to remove stray commas that appear within strings in a CSV file please?
In effect I have many address fields that are...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: Sonnysonu |
last post by:
This is the data of csv file
1 2 3
1 2 3
1 2 3
1 2 3
2 3
2 3
3
the lengths should be different i have to store the data by column-wise with in the specific length.
suppose the i have to...
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
| |