Trying to fix Invalid CSV File

Ryan Rosario

I have a very large CSV file that contains double quoted fields (since
they contain commas). Unfortunately, some of these fields also contain
other double quotes and I made the painful mistake of forgetting to
escape or double the quotes inside the field:

123,"Here is some, text "and some quoted text" where the quotes should
have been doubled",321

Has anyone dealt with this problem before? Any ideas of an algorithm I
can use for a Python script to create a new, repaired CSV file?

TIA,
Ryan

Aug 4 '08 #1

Subscribe Post Reply

4232

Emile van Sebille

Ryan Rosario wrote:

I have a very large CSV file that contains double quoted fields (since
they contain commas). Unfortunately, some of these fields also contain
other double quotes and I made the painful mistake of forgetting to
escape or double the quotes inside the field:

123,"Here is some, text "and some quoted text" where the quotes should
have been doubled",321

rec = '''123,"Here is some, text "and some quoted text" where the quotes
should have been doubled",321'''

import csv

csv.reader([rec.replace(',"',',"""')
.replace('",','""",')
.replace('"""',"'''")
.replace('"','""')
.replace("'''",'"')]).next()

['123', 'Here is some, text "and some quoted text" where the quotes
should have been doubled', '321']

:))

Emile

Has anyone dealt with this problem before? Any ideas of an algorithm I
can use for a Python script to create a new, repaired CSV file?

TIA,
Ryan
--
http://mail.python.org/mailman/listinfo/python-list

Aug 4 '08 #2

Ryan Rosario

On Aug 3, 10:38*pm, Emile van Sebille <em...@fenx.comwrote:

Ryan Rosario wrote:
I have a very large CSV file that contains double quoted fields (since
they contain commas). Unfortunately, some of these fields also contain
other double quotes and I made the painful mistake of forgetting to
escape or double the quotes inside the field:

123,"Here is some, text "and some quoted text" where the quotes should
have been doubled",321

rec = '''123,"Here is some, text "and some quoted text" where the quotes
should have been doubled",321'''

import csv

csv.reader([rec.replace(',"',',"""')
* * * * * * * * .replace('",','""",')
* * * * * * * * .replace('"""',"'''")
* * * * * * * * .replace('"','""')
* * * * * * * * .replace("'''",'"')]).next()

['123', 'Here is some, text "and some quoted text" where the quotes
should have been doubled', '321']

:))

Emile

Has anyone dealt with this problem before? Any ideas of an algorithm I
can use for a Python script to create a new, repaired CSV file?

TIA,
Ryan
--
http://mail.python.org/mailman/listinfo/python-list

Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?

TIA,
Ryan

Aug 4 '08 #3

John Machin

On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:

>
Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?

You originally said "I have a very large CSV file that contains double
quoted fields (since they contain commas)". Are you now saying that
if a field contained a comma, you didn't wrap the field in quotes? Or
is this a separate question unrelated to your original problem?

Aug 4 '08 #4

Ryan Rosario

On Aug 4, 1:01*am, John Machin <sjmac...@lexicon.netwrote:

On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:

Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?

You originally said "I have a very large CSV file that contains double
quoted fields (since they contain commas)". Are you now saying *that
if a field contained a comma, you didn't wrap the field in quotes? Or
is this a separate question unrelated to your original problem?

I enclosed all text fields within quotes. The problem is that I have
quotes embedded inside those text fields as well and I did not double/
escape them. Emile's snippet takes care of the escaping but it strips
the outer quotes from the text fields and if there are commas inside
the text field, the field is split into multiple fields. Of course, it
is possible that I am not using the snippet correctly I suppose.

Aug 4 '08 #5

John Machin

On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.comwrote:

On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.netwrote:

On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:

Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?

You originally said "I have a very large CSV file that contains double
quoted fields (since they contain commas)". Are you now saying that
if a field contained a comma, you didn't wrap the field in quotes? Or
is this a separate question unrelated to your original problem?

I enclosed all text fields within quotes. The problem is that I have
quotes embedded inside those text fields as well and I did not double/
escape them. Emile's snippet takes care of the escaping but it strips
the outer quotes from the text fields and if there are commas inside
the text field, the field is split into multiple fields. Of course, it
is possible that I am not using the snippet correctly I suppose.

Without you actually showing how you are using it, I can only surmise:

Emile's snippet is pushing it through the csv reading process, to
demonstrate that his series of replaces works (on your *sole* example,
at least). Note carefully his output for one line is a *list* of
fields. The repr() of that list looks superficially like a line of csv
input. It looks like you are csv-reading it a second time, using
quotechar="'", after stripping off the enclosing []. If this guess is
not correct, please show what you are actually doing.

If (as you said) you require a fixed csv file, you need to read the
bad file line by line, use Emile's chain of replaces, and write each
fixed line out to the new file.

Aug 4 '08 #6

Emile van Sebille

John Machin wrote:

On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.comwrote:
>On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.netwrote:

>>On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:
Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?

<snip>

Emile's snippet is pushing it through the csv reading process, to
demonstrate that his series of replaces works (on your *sole* example,
at least).

Exactly -- just print out the results of the passed argument:

>>>

rec.replace(',"',",'''").replace('",',"''',").repl ace('"','""').replace("'''",'"')

'123,"Here is some, text ""and some quoted text"" where the quotes
should have been doubled",321'

Where it won't work is if any of the field embedded quotes are next to
commas.

I'd run it against the file. Presumably, you've got a consistent field
count expectation per record. Any resulting record not matching is
suspect and will identify records this approach won't address.

There's probably better ways, but sometimes it's fun to create
executable line noise. :)

Emile

Aug 4 '08 #7

Ryan Rosario

On Aug 4, 8:30*am, Emile van Sebille <em...@fenx.comwrote:

John Machin wrote:
On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.comwrote:
On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.netwrote:

>On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:
Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?

<snip>

Emile's snippet is pushing it through the csv reading process, to
demonstrate that his series of replaces works (on your *sole* example,
at least).

Exactly -- just print out the results of the passed argument:

*>>>
rec.replace(',"',",'''").replace('",',"''',").repl ace('"','""').replace("'''",'"')

'123,"Here is some, text ""and some quoted text"" where the quotes
should have been doubled",321'

Where it won't work is if any of the field embedded quotes are next to
commas.

I'd run it against the file. *Presumably, you've got a consistent field
count expectation per record. *Any resulting record not matching is
suspect and will identify records this approach won't address.

There's probably better ways, but sometimes it's fun to create
executable line noise. *:)

Emile

Thanks for your responses. I think John may be right that I am reading
it a second time. I will take a look at the CSV reader documentation
and see if that helps. Then once I run it I can see if I need to worry
about the comma-next-to-quote issue.

Aug 4 '08 #8

Larry Bates

Ryan Rosario wrote:

On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.comwrote:
>John Machin wrote:
>>On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.comwrote:
On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.netwrote:
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:
>Thanks Emile! Works almost perfectly, but is there some way I can
>adapt this to quote fields that contain a comma in them?
<snip>

>>Emile's snippet is pushing it through the csv reading process, to
demonstrate that his series of replaces works (on your *sole* example,
at least).
Exactly -- just print out the results of the passed argument:

> >>>
rec.replace(',"',",'''").replace('",',"''',").rep lace('"','""').replace("'''",'"')

'123,"Here is some, text ""and some quoted text"" where the quotes
should have been doubled",321'

Where it won't work is if any of the field embedded quotes are next to
commas.

I'd run it against the file. Presumably, you've got a consistent field
count expectation per record. Any resulting record not matching is
suspect and will identify records this approach won't address.

There's probably better ways, but sometimes it's fun to create
executable line noise. :)

Emile

Thanks for your responses. I think John may be right that I am reading
it a second time. I will take a look at the CSV reader documentation
and see if that helps. Then once I run it I can see if I need to worry
about the comma-next-to-quote issue.

This is a perfect demonstration of why tab delimited files are so much better
than comma and quote delimited. Virtually all software can handle table
delimited as well as comma and quote delimited, but you would have none of these
problems if you had used tab delimited. The chances of tabs being embedded in
most data is virtually nil.

-Larry

Aug 4 '08 #9

John Machin

On Aug 5, 6:56 am, Larry Bates <larry.ba...@websafe.com`wrote:

Ryan Rosario wrote:
On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.comwrote:
John Machin wrote:
On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.comwrote:
On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.netwrote:
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:
Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?
<snip>

>Emile's snippet is pushing it through the csv reading process, to
demonstrate that his series of replaces works (on your *sole* example,
at least).
Exactly -- just print out the results of the passed argument:

rec.replace(',"',",'''").replace('",',"''',").repl ace('"','""').replace("'''",'"')

'123,"Here is some, text ""and some quoted text"" where the quotes
should have been doubled",321'

Where it won't work is if any of the field embedded quotes are next to
commas.

I'd run it against the file. Presumably, you've got a consistent field
count expectation per record. Any resulting record not matching is
suspect and will identify records this approach won't address.

There's probably better ways, but sometimes it's fun to create
executable line noise. :)

Emile

Thanks for your responses. I think John may be right that I am reading
it a second time. I will take a look at the CSV reader documentation
and see if that helps. Then once I run it I can see if I need to worry
about the comma-next-to-quote issue.

This is a perfect demonstration of why tab delimited files are so much better
than comma and quote delimited.

No, it's a perfect demonstration of what happens when a protocol is
not followed.

Virtually all software can handle table
delimited as well as comma and quote delimited, but you would have none of these
problems if you had used tab delimited. The chances of tabs being embedded in
most data is virtually nil.

There may be no tabs in *your* data. There is no guarantee that there
are no tabs in a VARCHAR(n) column in somebody else's database. I've
seen all of \x00, \t, \n, \r and \x1a (Ctrl-Z (EOF in CP/M, *DOS and
Windows text files)).

The possibilities include (1) Don't check (2) check if '\t' in field
and raise an exception (3) silently remove tabs; what do you
recommend?

Aug 5 '08 #10

Ryan Rosario

On Aug 4, 1:56*pm, Larry Bates <larry.ba...@websafe.com`wrote:

Ryan Rosario wrote:
On Aug 4, 8:30 am, Emile van Sebille <em...@fenx.comwrote:
John Machin wrote:
On Aug 4, 6:15 pm, Ryan Rosario <uclamath...@gmail.comwrote:
On Aug 4, 1:01 am, John Machin <sjmac...@lexicon.netwrote:
On Aug 4, 5:49 pm, Ryan Rosario <uclamath...@gmail.comwrote:
Thanks Emile! Works almost perfectly, but is there some way I can
adapt this to quote fields that contain a comma in them?
<snip>

>Emile's snippet is pushing it through thecsvreading process, to
demonstrate that his series of replaces works (on your *sole* example,
at least).
Exactly -- just print out the results of the passed argument:

rec.replace(',"',",'''").replace('",',"''',").repl ace('"','""').replace("'''",'"')

'123,"Here is some, text ""and some quoted text"" where the quotes
should have been doubled",321'

Where it won't work is if any of the field embedded quotes are next to
commas.

I'd run it against the file. *Presumably, you've got a consistent field
count expectation per record. *Any resulting record not matching is
suspect and will identify records this approach won't address.

There's probably better ways, but sometimes it's fun to create
executable line noise. *:)

Emile

Thanks for your responses. I think John may be right that I am reading
it a second time. I will take a look at theCSVreader documentation
and see if that helps. Then once I run it I can see if I need to worry
about the comma-next-to-quote issue.

This is a perfect demonstration of why tab delimited files are so much better
than comma and quote delimited. *Virtually all software can handle table
delimited as well as comma and quote delimited, but you would have none of these
problems if you had used tab delimited. *The chances of tabs being embedded in
most data is virtually nil.

-Larry

Thank you for all the help. I wasn't using Emile's code correctly. It
fixed 99% of the problem, reducing 30,000 bad lines to about 300. The
remaining cases were too difficult to pin a pattern on, so I just
spent an hour fixing those lines. It was typically just adding one
more " to one that was already there.

Next time I am going to be much more careful. Tab delimited is
probably better for my purpose, but I can definitely see there being
issues with invisible tab characters and other weirdness.

Ryan

Aug 6 '08 #11

Roel Schroeven

Ryan Rosario schreef:

Next time I am going to be much more careful. Tab delimited is
probably better for my purpose, but I can definitely see there being
issues with invisible tab characters and other weirdness.

No matter which delimiter you use, there will always be data that
includes that delimiter, and you need some way to deal with it.

I prefer the approach that esr suggests in "The Art of Unix Programming"
(http://www.catb.org/~esr/writings/ta...ch05s02.html): define a
delimiter (preferably but necessary one that doesn't occur frequently in
your data) and an escape character. On output, escape all occurrences of
delimiter and escape character in your data. On input, you can trivially
and unambiguously distinguish delimiters in the data from delimiters
between data, and unescape everything.

Cheers,
Roel

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven

Aug 6 '08 #12

by: Glenn A. Harlan | last post by:

Why am I receiving the below error when calling - Path.GetTempFileName() The directory name is invalid. Description: An unhandled exception occurred during the execution of the current web...

.NET Framework

[Slightly OT] Trying to simulate OOP with C

by: Deniz Bahar | last post by:

Hello, A couple days ago my friend (OOP guy) shows me what OOP was all about in C++. This morning I figured I can do pretty much the same thing with C (by putting function pointers in...

C / C++

"Length of the data to decrypt is invalid." when trying to decrypt TripleDes algorithm

by: Dica | last post by:

i've used the sample code from msdn to create an encyption/decryption assembly as found here: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnnetsec/html/SecNetHT10.asp i'm...

C# / C Sharp

Trying To Query Remote Machine Using WMI & ASP.NET Fails With "Invalid Class"

by: Winterminute | last post by:

I am trying to read a list of install programs using WMI with ASP.NET/C#. However, it fails with "Invalid Class". I have confirmed that if I query LOCALHOST then it works fine, but if I query a...

ASP.NET

trying to make my "fillvalues" function work...

by: Martin Jørgensen | last post by:

Hi, I'm trying to move a matlab program into c language. For those who knows matlab, this is the line I want to program in c: hx(1:nx,1:ny) = 0; % nx=10, ny=10 It works on a 2-dimensional...

C / C++

"Invalid Function" When accessing a 'MVFS' drive.

by: mtanq | last post by:

My project is an ASP-based web page, with C# as the language of choice. I have written a simple tree view that displays the structure of a file system mounted as 'MVFS' (On my system, it's W:/)....

C# / C Sharp

HELP! Trying to add Email Notification to Download File Code

by: matt | last post by:

I have used some free code for listing files for download, but I want to send an email to the administrator when the file has been downloaded. I have got some code in here that does it, but it will...

PHP

invalid conversion from int to file

by: fazulu deen | last post by:

Hi all, For the following code : file_ptr = fopen("pass_fail.txt", "a"); // error line 393 fdisplay(file_ptr, "Test Passed"); fclose(file_ptr);

C / C++

tinyp2p - trying to get it to work

by: Fantus | last post by:

Hello I am doing a small research and I found this: http://www.freedom-to-tinker.com/tinyp2p.html I tried to get it to work. So far I managed to run it as a server using a command:

Python

trying to match a string

by: arnimavidyarthy | last post by:

Hi, Hi, I am taking a string as an input from the user and it should only contain the chars:L , M or R I tried the folllowing in kodos but they are still not perfect:

Python

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Trying to fix Invalid CSV File

Similar topics