How to Parse a CSV formatted text file

Ram Laxman

Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Ram Laxman

Ram Laxman

Nov 14 '05 #1

Subscribe Post Reply

19505

Phlip

Ram Laxman wrote:

I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Parsing is tricky. Consider these rules:

- \n is absolute. All lines must be unbroken
- "" precedes , - so commas inside strings are text, not delimiters
- quotes inside "" need an escape, either \n or ""
- escapes need escapes - \\ is \

Try this project to learn more:

http://c2.com/cgi/wiki?MsWindowsResourceLint

First, we express those rules (one by one) as test cases:

TEST_(TestCase, pullNextToken_comma)
{

Source aSource("a , b\nc, \n d");

string
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("a", token);
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("b", token);
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("c", token);
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("d", token);
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("", token);
// EOF!

}

struct
TestTokens: TestCase
{

void
test_a_b_d(string input)
{
Source aSource(input);
string
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("a", token);
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("b", token);
// token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("c",
token);
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("d", token);
token = aSource.pullNextToken(); CPPUNIT_ASSERT_EQUAL("", token);
// EOF!
}

};

TEST_(TestTokens?, elideComments)
{
test_a_b_d("a b\n //c\n d");
test_a_b_d("a b\n//c \n d");
test_a_b_d("a b\n // c \"neither\" \n d");
test_a_b_d("a b\n // c \"neither\" \n d//");
test_a_b_d("//\na b\n // c \"neither\" \n d//");
test_a_b_d("//c\na b\n // c \"neither\" \n d//");
test_a_b_d("// c\na b\n // c \"neither\" \n d//");
test_a_b_d("//c \na b\n // c \"neither\" \n d//");
test_a_b_d("// \na b\n // c \"neither\" \n d//");
test_a_b_d(" // \na b\n // c \"neither\" \n d//");
}

TEST_(TestTokens?, elideStreamComments)
{
test_a_b_d("a b\n /*c*/\n d");
test_a_b_d("a b\n/*c*/ \n d");
test_a_b_d("a b\n /* c \"neither\" */\n d");
test_a_b_d("a b\n /* c \"neither\" \n */ d//");
test_a_b_d("//\na b\n /* c \"neither\" */ \n d/**/");
test_a_b_d("//c\na b\n // c \"neither\" \n d/* */");
test_a_b_d("/* c\n*/a b\n // c \"neither\" \n d//");
test_a_b_d("//c \na b\n // c \"neither\" \n d//");
test_a_b_d("// \na b\n // c \"neither\" \n d//");
test_a_b_d(" // \na b\n // c \"neither\" \n d//");
}

Those tests re-use the fixture test_a_b_d() to ensure that every one of
those strings parse into a, b, & d, skipping (for whatever reason) c.

You will need tests that show slightly different behaviors. But write your
tests one at a time. I wrote every single line you see here, essentially in
order, and got it to work before adding the next line. Don't write all your
tests at once, because when programming you should never go more than 1~10
edits before passing all tests.

Now here's the source of Source (which means "source of tokens"):

class
Source
{
public:
Source(string const & rc = ""):
m_rc(rc),
m_bot(0),
m_eot(0)
{}

void setResource(string const & rc) { m_rc = rc; }
size_type getBOT() { return m_bot; }
string const & getPriorToken() { return m_priorToken; }
string const & getCurrentToken() { return m_currentToken; }

string const &
pullNextToken()
{
m_priorToken = m_currentToken;
extractNextToken();
return m_currentToken;
}

size_type
getLineNumber(size_type at)
{
size_type lineNumber = 1;

for(size_type idx(0); idx < at; ++idx)
if ('\n' == m_rc[idx])
++lineNumber;

return lineNumber;
}

string
getLine(size_type at)
{
size_type bol = m_rc.rfind('\n', at);
if (string::npos == bol) bol = 0; else ++bol;
size_type eol = m_rc.find('\n', at);
if (string::npos == eol) eol = m_rc.length(); else ++eol;
return m_rc.substr(bol, eol - bol);
}

private:

string const &
extractNextToken()
{
char static const delims[] = " \t\n,";

m_bot = m_rc.find_first_not_of(delims, m_eot);

if (string::npos == m_bot)
m_currentToken = "";
else if (m_rc[m_bot] == '"')
m_currentToken = parseString();
else if (m_rc.substr(m_bot, 2) == "//")
{
if (skipUntil("\n"))
return extractNextToken();
}
else if (m_rc.substr(m_bot, 2) == "/*")
{
if (skipUntil("*/"))
return extractNextToken();
}
/* else if (m_rc.substr(m_bot, 1) == "#")
{
string line = getLine(m_bot);
size_type at(0);
while(isspace(line[at]) && at < line.size()) ++at;
if ('#' == line[at])
{
m_eot = m_bot + 1;
if (skipUntil("\n"))
return extractNextToken();
}
}*/
else
{
m_eot = m_rc.find_first_of(" \n,/", m_bot);
m_currentToken = m_rc.substr(m_bot, m_eot - m_bot);
}

if ('#' == m_currentToken[0])
{
// assert(m_rc.substr(m_bot, 1) == "#");
string line = getLine(m_bot);
size_type at(0);
while(isspace(line[at]) && at < line.size()) ++at;

if ('#' == line[at])
{
--m_eot;
if (skipUntil("\n"))
return extractNextToken();
}
}
return m_currentToken;
}

bool
skipUntil(char const * delimiter)
{
m_eot = m_rc.find(delimiter, m_eot + 1);

if (string::npos == m_eot)
{
m_currentToken = "";
return false;
}
m_eot += strlen(delimiter);
return true;
}

char
parseStringChar()
{
if (m_rc[m_eot] == '\\')
{
m_eot += 1;
char escapee(m_rc[m_eot++]);

switch (escapee)
{
case 'n' : return '\n';
case 'r' : return '\r';
case 't' : return '\t';
case '0' : return '\0';
case '\\': return '\\';
case 'a' : return '\a';
default : // TODO \x, \v \b, \f
if (isdigit(escapee))
{
string slug = m_rc.substr(m_eot - 1, 3);
return char(strtol(slug.c_str(), NULL, 8));
}
else
//assert(false);
return escapee;
}
}
else if (m_rc[m_eot] == '"' && m_rc[m_eot+1] == '"')
m_eot++;

return m_rc[m_eot++];
}

string
parseString()
{
m_eot = m_bot + 1;
string z;

while ( m_eot < m_rc.length() &&
( m_rc[m_eot] != '"' ||
m_rc[m_eot + 1] == '"' ) )
z += parseStringChar();

if (m_eot < m_rc.length())
m_eot += 1;

return z;
}

string m_rc;
size_type m_bot;
size_type m_eot;
string m_priorToken;
string m_currentToken;
};

That looks really ugly & long, because it hides so much behind such a narrow
interface. (I don't know if I copied all of it in, either.) But it
demonstrates (possibly) correct usage of std::string and std::vector.

Do not copy my source into your editor and try to run it. It will not parse
CVS. Start your project like this:

#include <assert.h>
#include <string>
#include <vector>
typedef std::vector<std::string> strings_t;

strings_t parse(std::string input)
{
strings_t result;
return result;
}

int main()
{
assert("a" == parse("a,b")[0]);
}

If that compiles, it >will< crash if you run it.

Now fix parse() so that it _only_ does not crash, and passes this test. Make
the implementation as stupid as you like.

Then add a test:

assert("a" == parse("a,b")[0]);
assert("b" == parse("a,b")[1]);

Keep going. Make the implementation just a little better after each test.
Write a set of tests for each of the parsing rules I listed. When the new
parse() function is full-featured, put it to work in your program.

All programs should be written by generating long lists of simple tests like
this. That keeps the bug count very low, and prevents wasting hours and
hours with a debugger.

--
Phlip
http://www.xpsd.org/cgi-bin/wiki?Tes...UserInterfaces

Nov 14 '05 #2

Willem

Ram wrote:
) Hi all,
) I have a text file which have data in CSV format.
) "empno","phonenumber","wardnumber"
) 12345,2234353,1000202
) 12326,2243653,1000098
) Iam a beginner of C/C++ programming.
) I don't know how to tokenize the comma separated values.I used strtok
) function reading line by line using fgets.but it gives some weird
) behavior.It doesnot stripout the "" fully.Could any body have sample
) code for the same so that it will be helfful for my reference?

Here's a tip: Look for a library that scans CSV files.

And if you really want to do it yourself, you really don't want to be using
stuff like strtok. Assuming you have one complete line in memory, you're
better off searching for the commas (and quotes) yourself, that's really
not so hard. Just put NULs where the commas are, and point to the
beginning of the strings (just after the comma). You can then pass these
pointers as strings to another parsing routine that turns stuff without
quotes into integers, and stuff with quotes into strings or whatever.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Nov 14 '05 #3

Phlip

Willem wrote:

Ram wrote:
) Hi all,
) I have a text file which have data in CSV format.
) "empno","phonenumber","wardnumber"
) 12345,2234353,1000202
) 12326,2243653,1000098
) Iam a beginner of C/C++ programming.
) I don't know how to tokenize the comma separated values.I used strtok
) function reading line by line using fgets.but it gives some weird
) behavior.It doesnot stripout the "" fully.Could any body have sample
) code for the same so that it will be helfful for my reference?

Here's a tip: Look for a library that scans CSV files.

Hi Willem! Welcome to the first hard projects of this semester. So far, a
professor somewhere has assumed their class was reading the right chapters
in their tutorial, and has hit them with the first non-Hello World project.

Someone just posted the same question to news:comp.programming .

--
Phlip
http://www.xpsd.org/cgi-bin/wiki?Tes...UserInterfaces

Nov 14 '05 #4

Mike Wahler

"Ram Laxman" <ra********@india.com> wrote in message
news:24**************************@posting.google.c om...

Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Ram Laxman

#include <cstdlib>
#include <fstream>
#include <ios>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>

int main()
{
std::ifstream ifs("csv.txt");
if(!ifs)
{
std::cerr << "Cannot open input\n";
return EXIT_FAILURE;
}

const std::streamsize width(15);
std::cout << std::left;

std::string line;
while(std::getline(ifs, line))
{
std::string tok1;
std::istringstream iss(line);
while(std::getline(iss, tok1, ','))
{
if(tok1.find('"') != std::string::npos)
{
std::string tok2;
std::istringstream iss(tok1);
while(std::getline(iss, tok2, '"'))
{
if(!tok2.empty())
std::cout << std::setw(width) << tok2;
}
}
else
std::cout << std::setw(width) << tok1;

std::cout << ' ';

}
std::cout << " \n";
}

if(!ifs && !ifs.eof())
std::cerr << "Error reading input\n";

return 0;
}

Input file:

"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098

Output:

empno phonenumber wardnumber
12345 2234353 1000202
12326 2243653 1000098

-Mike

Nov 14 '05 #5

Jon Bell

In article <mg*****************@newsread1.news.pas.earthlink. net>,
Mike Wahler <mk******@mkwahler.net> wrote:

[code snipped]

Input file:

"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098

Try changing the first line so one of the tokens contains a comma, e.g.

"empno","phone, number","wardnumber"

;-)

I started to work on a solution, too, and then I thought about embedded
commas, and went, "uh oh..."

--
Jon Bell <jt*******@presby.edu> Presbyterian College
Dept. of Physics and Computer Science Clinton, South Carolina USA

Nov 14 '05 #6

Phlip

Mike Wahler wrote:

#include <cstdlib>

Hi Mike!

I just wanted to be the first to remind you that the FAQ advises against
doing others' homework - fun though it may be. (Advising the newbie to throw
in a few Design Patterns is better sport, of course...)

--
Phlip
http://www.xpsd.org/cgi-bin/wiki?Tes...UserInterfaces

Nov 14 '05 #7

Mike Wahler

"Phlip" <ph*******@yahoo.com> wrote in message
news:Yt*******************@newssvr16.news.prodigy. com...

Mike Wahler wrote:
#include <cstdlib>
Hi Mike!

I just wanted to be the first to remind you that the FAQ advises against
doing others' homework - fun though it may be.

Yes, I realize that.
(Advising the newbie to throw
in a few Design Patterns is better sport, of course...)

I very much doubt that the code would be accepted 'as is'
by an instructor -- unless the student can explain it --
in which case he would have actually studied and learned... :-)
Anyway, it seems that OP isn't quite sure whether he's learning
C or C++.

-Mike

Nov 14 '05 #8

Mike Wahler

"Jon Bell" <jt*******@presby.edu> wrote in message
news:c0**********@jtbell.presby.edu...

In article <mg*****************@newsread1.news.pas.earthlink. net>,
Mike Wahler <mk******@mkwahler.net> wrote:

[code snipped]
Input file:

"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098

Try changing the first line so one of the tokens contains a comma, e.g.

"empno","phone, number","wardnumber"

;-)

I started to work on a solution, too, and then I thought about embedded
commas, and went, "uh oh..."

Well, yes I did think about bad input, but thought I'd leave that
to the OP. IOW I gave a very 'literal' answer that only addressed
the exact input cited by the OP. :-)
-Mike

Nov 14 '05 #9

Jon Bell

In article <vG******************@newsread1.news.pas.earthlink .net>,
Mike Wahler <mk******@mkwahler.net> wrote:

Well, yes I did think about bad input, but thought I'd leave that
to the OP. IOW I gave a very 'literal' answer that only addressed
the exact input cited by the OP. :-)

It would be interesting to find out if the instructor actually intended
the students to go whole hog and deal with embedded commas, escaped
quotes, etc. If it's an introductory programming course, it's quite
possible they don't need to worry about those details for the purposes of
the assignment.

--
Jon Bell <jt*******@presby.edu> Presbyterian College
Dept. of Physics and Computer Science Clinton, South Carolina USA

Nov 14 '05 #10

Derk Gwen

ra********@india.com (Ram Laxman) wrote:
# Hi all,
# I have a text file which have data in CSV format.
# "empno","phonenumber","wardnumber"
# 12345,2234353,1000202
# 12326,2243653,1000098
# Iam a beginner of C/C++ programming.
# I don't know how to tokenize the comma separated values.I used strtok
# function reading line by line using fgets.but it gives some weird
# behavior.It doesnot stripout the "" fully.Could any body have sample
# code for the same so that it will be helfful for my reference?

This is probably a type 3 language, so you can probably use a finite
state machine. If you're just beginning, that can be an intimidating
bit of jargon, but FSMs are actually easy to understand, and if you
want to be a programmer, you have to understand them. They pop up all
over the place.

You can #defines to abstract the FSM with something like

#define FSM(name) static int name(FILE *file) {int ch=0,m=0,n=0; char *s=0;
#define endFSM return -1;}

#define state(name) name: ch = fgetc(stdin); e_##name: switch (ch) {
#define endstate } return -1;

#define is(character) case character:
#define any default:
#define next(name) ;goto name
#define emove(name) ;goto e_##name;
#define final(name,value) name: e_##name: free(s); return value;

#define shift ;if (n+1>=m) {m = 2*(n+1); s = realloc(s,m);} s[n++] = ch;
#define discard ;m = n = 0; s = 0;
#define dispose ;free(s) discard

static void got_empno(char *s);
static void got_phonenumber(char *s);
static void got_wardnumber(char *s);
static void got_csventry(void);

FSM(csv_parser)
state(empno)
is('"') next(quoted_empno)
is(EOF) next(at_end)
is(',') got_empno(s) discard next(phonenumber)
any shift next(empno)
endstate
state(quoted_empno)
is('"') next(empno)
is(EOF) next(at_end_in_string)
any shift next(empno)
endstate
state(phonenumber)
is('"') next(quoted_phonenumber)
is(EOF) next(at_end_in_entry)
is(',') got_phonenumber(s) discard next(wardnumber)
any shift next(phonenumber)
endstate
state(quoted_phonenumber)
is('"') next(phonenumber)
is(EOF) next(at_end_in_string)
any shift next(phonenumber)
endstate
state(wardnumber)
is('"') next(quoted_wardnumber)
is(EOF)
got_wardnumber(s); got_csventry() discard
next(at_end)
is('\n')
got_wardnumber(s); got_csventry() discard
next(empno)
is(',') got_wardnumber(s) discard next(unexpected_field)
any shift next(wardnumber)
endstate
state(quoted_wardnumber)
is('"') next(wardnumber)
is(EOF) next(at_end_in_string)
any shift next(wardnumber)
endstate
final(at_end,0)
final(at_end_in_string,1)
final(unexpected_field,2)
endFSM

....
int rc = csv_parser(stdin);
// calls
// got_empno(empno-string)
// got_phonenumber(phonenumber-string)
// got_wardnumber(wardnumber-string)
// got_csventry()
// for each entry
switch (rc) {
case -1: fputs("parser failure\n",stderr); break;
case 1: fputs("end of file in a string\n",stderr); break;
case 2: fputs("too many fields\n",stderr); break;
}
....

--
Derk Gwen http://derkgwen.250free.com/html/index.html
I have no idea what you just said.
I get that alot.

Nov 14 '05 #11

Jack Klein

On Sat, 07 Feb 2004 18:38:10 GMT, "Mike Wahler"
<mk******@mkwahler.net> wrote in comp.lang.c:

"Ram Laxman" <ra********@india.com> wrote in message
news:24**************************@posting.google.c om...
Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Ram Laxman

#include <cstdlib>
#include <fstream>
#include <ios>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>

[snip]

Mike, please do NOT post C++ code to messages crossposted to
comp.lang.c.

Thanks

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.contrib.andrew.cmu.edu/~a...FAQ-acllc.html

Nov 14 '05 #12

Mike Wahler

"Jack Klein" <ja*******@spamcop.net> wrote in message
news:n7********************************@4ax.com...

[snip]

Mike, please do NOT post C++ code to messages crossposted to
comp.lang.c.

Oops, sorry, wasn't paying attention. Thanks for the heads-up.

-Mike

Nov 14 '05 #13

bartek

ra********@india.com (Ram Laxman) wrote in
news:24**************************@posting.google.c om:

Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Check out the amazing Spirit framework.
It's available from Boost libraries: http://www.boost.org

Nov 14 '05 #14

Mark McIntyre

On 7 Feb 2004 09:39:14 -0800, in comp.lang.c , ra********@india.com
(Ram Laxman) wrote:

Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.

yes, you need to handle that sort of stuff yourself. Personally I'd
use strtok on this sort of data, since embedded commas should not
exist. Consider the 1st line a special case.

--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.angelfire.com/ms3/bchambless0/welcome_to_clc.html>
----== Posted via Newsfeed.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeed.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= 19 East/West-Coast Specialized Servers - Total Privacy via Encryption =---

Nov 14 '05 #15

Joe Wright

Mark McIntyre wrote:

On 7 Feb 2004 09:39:14 -0800, in comp.lang.c , ra********@india.com
(Ram Laxman) wrote:
Hi all,
I have a text file which have data in CSV format.
"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.

yes, you need to handle that sort of stuff yourself. Personally I'd
use strtok on this sort of data, since embedded commas should not
exist. Consider the 1st line a special case.

I don't know of a 'Standard' defining .csv but this is normal output
from Visual FoxPro..

first,last
"Mac "The Knife" Peter","Boswell, Jr."

But strangely, Excel reads it back wrong. Go figure.
"Failure is not an option. With M$ it is bundled with every package."

The format started with dBASE I think and goes something like this..

Fields are alphanumerics separated by commas. Fields of type 'Character'
are further delimited with '"' so that they might contain comma and '"'
itself. The Rules are something like this..

The first field begins with the first character on the line.
Fields end at a naked ',' comma or '\n' newline.
Delimited fields begin with '"' and end with '"' and comma or newline.
The delimiters are not a literal part of the field. Any comma or '"'
within the delimiters are literals.

--
Joe Wright http://www.jw-wright.com
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Nov 14 '05 #16

Mark McIntyre

On Sun, 08 Feb 2004 14:43:05 GMT, in comp.lang.c , Joe Wright
<jo********@earthlink.net> wrote:

Mark McIntyre wrote:

yes, you need to handle that sort of stuff yourself. Personally I'd
use strtok on this sort of data, since embedded commas should not
exist. Consider the 1st line a special case.

I don't know of a 'Standard' defining .csv but this is normal output
from Visual FoxPro..

snip example w/ embedded commas.

Interesting, but hte OP's data was employee numbers, phone numbers and
ward numbers. I find Occam's Razor to be efficient in such cases.

--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.angelfire.com/ms3/bchambless0/welcome_to_clc.html>
----== Posted via Newsfeed.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeed.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= 19 East/West-Coast Specialized Servers - Total Privacy via Encryption =---

Nov 14 '05 #17

David Harmon

On Sun, 08 Feb 2004 14:43:05 GMT in comp.lang.c++, Joe Wright
<jo********@earthlink.net> was alleged to have written:

I don't know of a 'Standard' defining .csv but this is normal output
from Visual FoxPro..

first,last
"Mac "The Knife" Peter","Boswell, Jr."

But strangely, Excel reads it back wrong.

Excel expects quotes within the field to be doubled. In fact, I would
go so far as to say FoxPro is wrong.

More lenient parsing would recognize a quote not followed by a comma or
newline as contained within the field. This creates some ambiguities,
since quoted fields can also contain newline.

There is no standard, but see http://www.wotsit.org/download.asp?f=csv

Nov 14 '05 #18

Gordon Burditt

> I have a text file which have data in CSV format.

What *IS* CSV format? The following "definition by example"
isn't very complete.

"empno","phonenumber","wardnumber"
12345,2234353,1000202
12326,2243653,1000098

Your examples do not handle the corner cases where a string
contains commas, quotes, and/or newlines. If your definition
introduces an "escape" character, also worry about strings
consisting of several of those characters. Also, can single
quotes be used in place of double quotes? Can a single quote
match a double quote or vice versa?

Also it isn't explained what isn't a valid CSV format. How
about these:

,,,,,,,,,,,,,,,,,,,,
,""""""""""""""""""""""""""",
,"""""""""""""""""""""""""""",
,""""""""""""""""""""""""""""",
"\\\\\\\\\\\\\\\\\"
"\\\\\\\\\\\\\\\\\\"
"\\\\\\\\\\\\\\\\\\\"
"""""""""""""""""""""""
""""""""""""""""""""""""
"""""""""""""""""""""""""
""""""""""""""""""""""""""

Gordon L. Burditt

Nov 14 '05 #19

Dietmar Kuehl

Joe Wright <jo********@earthlink.net> wrote:

I don't know of a 'Standard' defining .csv but this is normal output
from Visual FoxPro..

first,last
"Mac "The Knife" Peter","Boswell, Jr."

But strangely, Excel reads it back wrong. Go figure.
"Failure is not an option. With M$ it is bundled with every package."

So, you are saying this is not at all a homework assignment but rather a
request from a Microsoft engineer asking for correct code dealing with
their files?
--
<mailto:di***********@yahoo.com> <http://www.dietmar-kuehl.de/>
Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/>

Nov 14 '05 #20

Joe Wright

Dietmar Kuehl wrote:

Joe Wright <jo********@earthlink.net> wrote:
I don't know of a 'Standard' defining .csv but this is normal output
from Visual FoxPro..

first,last
"Mac "The Knife" Peter","Boswell, Jr."

But strangely, Excel reads it back wrong. Go figure.
"Failure is not an option. With M$ it is bundled with every package."

So, you are saying this is not at all a homework assignment but rather a
request from a Microsoft engineer asking for correct code dealing with
their files?

I'm not sure I follow you. It's certainly not my homework assignment.
--
Joe Wright http://www.jw-wright.com
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Nov 14 '05 #21

Phlip

Joe Wright wrote:

I'm not sure I follow you. It's certainly not my homework assignment.

The question is not whether it's homework (or whether a very similar
question arrived within a few hours).

The question is whether the group is asked to do someone's learning for
them.

--
Phlip
http://www.xpsd.org/cgi-bin/wiki?Tes...UserInterfaces

Nov 14 '05 #22

Programmer Dude

David Harmon wrote:

"Mac "The Knife" Peter","Boswell, Jr."

But strangely, Excel reads it back wrong.

Excel expects quotes within the field to be doubled. In fact, I
would go so far as to say FoxPro is wrong.

I would agree FoxPro is wrong. It appears to require (based on the
spec presented upthread), seeing the two-character sequence (",) or
the two-character sequence ("<newline>). That is, if the field
started with the (") character.

I might look for the three-character sequence (",") (or "<newline>),
but I still think this is a broken spec. Without being able to
escape the double-quote, you simply can't guarentee that there isn't
a valid delimiter sequence instring.

Also, this spec requires control of the CSV *emitter* (which, to me,
lacks robustness). The spec requires the writer of the values be
sure to not include spaces--in this case, between the final quote
and the comma. I'd rather a CSV reader that can handle:

" " , foobar , 42 , "Hello, World!" ,, , "Jonas ""J"" Jamison",

Without worrying about padding spaces around the commas.

What's maybe more an issue is how quotes are escaped. One standard
(used by MS and others) is doubling the quote. The other common one
uses an "escape char", such as the backslash. A *really* good CSV
parser should, IMO, detect both *AND* allow for single-quoting as
well as double-quoting.

--
|_ CJSonnack <Ch***@Sonnack.com> _____________| How's my programming? |
|_ http://www.Sonnack.com/ ___________________| Call: 1-800-DEV-NULL |
|_____________________________________________|___ ____________________|

Nov 14 '05 #23

How to Parse a CSV formatted text file

Similar topics