How to Parse a CSV formatted text file

Ram Laxman

Hi all,
I have a text file which have data in CSV format.
"empno","phonen umber","wardnum ber"
12345,2234353,1 000202
12326,2243653,1 000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Ram Laxman

Ram Laxman

Nov 14 '05 #1

Subscribe Reply

19626

Phlip

Ram Laxman wrote:

I have a text file which have data in CSV format.
"empno","phonen umber","wardnum ber"
12345,2234353,1 000202
12326,2243653,1 000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Parsing is tricky. Consider these rules:

- \n is absolute. All lines must be unbroken
- "" precedes , - so commas inside strings are text, not delimiters
- quotes inside "" need an escape, either \n or ""
- escapes need escapes - \\ is \

Try this project to learn more:

http://c2.com/cgi/wiki?MsWindowsResourceLint

First, we express those rules (one by one) as test cases:

TEST_(TestCase, pullNextToken_c omma)
{

Source aSource("a , b\nc, \n d");

string
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("a", token);
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("b", token);
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("c", token);
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("d", token);
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("", token);
// EOF!

}

struct
TestTokens: TestCase
{

void
test_a_b_d(stri ng input)
{
Source aSource(input);
string
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("a", token);
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("b", token);
// token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("c",
token);
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("d", token);
token = aSource.pullNex tToken(); CPPUNIT_ASSERT_ EQUAL("", token);
// EOF!
}

};

TEST_(TestToken s?, elideComments)
{
test_a_b_d("a b\n //c\n d");
test_a_b_d("a b\n//c \n d");
test_a_b_d("a b\n // c \"neither\" \n d");
test_a_b_d("a b\n // c \"neither\" \n d//");
test_a_b_d("//\na b\n // c \"neither\" \n d//");
test_a_b_d("//c\na b\n // c \"neither\" \n d//");
test_a_b_d("// c\na b\n // c \"neither\" \n d//");
test_a_b_d("//c \na b\n // c \"neither\" \n d//");
test_a_b_d("// \na b\n // c \"neither\" \n d//");
test_a_b_d(" // \na b\n // c \"neither\" \n d//");
}

TEST_(TestToken s?, elideStreamComm ents)
{
test_a_b_d("a b\n /*c*/\n d");
test_a_b_d("a b\n/*c*/ \n d");
test_a_b_d("a b\n /* c \"neither\" */\n d");
test_a_b_d("a b\n /* c \"neither\" \n */ d//");
test_a_b_d("//\na b\n /* c \"neither\" */ \n d/**/");
test_a_b_d("//c\na b\n // c \"neither\" \n d/* */");
test_a_b_d("/* c\n*/a b\n // c \"neither\" \n d//");
test_a_b_d("//c \na b\n // c \"neither\" \n d//");
test_a_b_d("// \na b\n // c \"neither\" \n d//");
test_a_b_d(" // \na b\n // c \"neither\" \n d//");
}

Those tests re-use the fixture test_a_b_d() to ensure that every one of
those strings parse into a, b, & d, skipping (for whatever reason) c.

You will need tests that show slightly different behaviors. But write your
tests one at a time. I wrote every single line you see here, essentially in
order, and got it to work before adding the next line. Don't write all your
tests at once, because when programming you should never go more than 1~10
edits before passing all tests.

Now here's the source of Source (which means "source of tokens"):

class
Source
{
public:
Source(string const & rc = ""):
m_rc(rc),
m_bot(0),
m_eot(0)
{}

void setResource(str ing const & rc) { m_rc = rc; }
size_type getBOT() { return m_bot; }
string const & getPriorToken() { return m_priorToken; }
string const & getCurrentToken () { return m_currentToken; }

string const &
pullNextToken()
{
m_priorToken = m_currentToken;
extractNextToke n();
return m_currentToken;
}

size_type
getLineNumber(s ize_type at)
{
size_type lineNumber = 1;

for(size_type idx(0); idx < at; ++idx)
if ('\n' == m_rc[idx])
++lineNumber;

return lineNumber;
}

string
getLine(size_ty pe at)
{
size_type bol = m_rc.rfind('\n' , at);
if (string::npos == bol) bol = 0; else ++bol;
size_type eol = m_rc.find('\n', at);
if (string::npos == eol) eol = m_rc.length(); else ++eol;
return m_rc.substr(bol , eol - bol);
}

private:

string const &
extractNextToke n()
{
char static const delims[] = " \t\n,";

m_bot = m_rc.find_first _not_of(delims, m_eot);

if (string::npos == m_bot)
m_currentToken = "";
else if (m_rc[m_bot] == '"')
m_currentToken = parseString();
else if (m_rc.substr(m_ bot, 2) == "//")
{
if (skipUntil("\n" ))
return extractNextToke n();
}
else if (m_rc.substr(m_ bot, 2) == "/*")
{
if (skipUntil("*/"))
return extractNextToke n();
}
/* else if (m_rc.substr(m_ bot, 1) == "#")
{
string line = getLine(m_bot);
size_type at(0);
while(isspace(l ine[at]) && at < line.size()) ++at;
if ('#' == line[at])
{
m_eot = m_bot + 1;
if (skipUntil("\n" ))
return extractNextToke n();
}
}*/
else
{
m_eot = m_rc.find_first _of(" \n,/", m_bot);
m_currentToken = m_rc.substr(m_b ot, m_eot - m_bot);
}

if ('#' == m_currentToken[0])
{
// assert(m_rc.sub str(m_bot, 1) == "#");
string line = getLine(m_bot);
size_type at(0);
while(isspace(l ine[at]) && at < line.size()) ++at;

if ('#' == line[at])
{
--m_eot;
if (skipUntil("\n" ))
return extractNextToke n();
}
}
return m_currentToken;
}

bool
skipUntil(char const * delimiter)
{
m_eot = m_rc.find(delim iter, m_eot + 1);

if (string::npos == m_eot)
{
m_currentToken = "";
return false;
}
m_eot += strlen(delimite r);
return true;
}

char
parseStringChar ()
{
if (m_rc[m_eot] == '\\')
{
m_eot += 1;
char escapee(m_rc[m_eot++]);

switch (escapee)
{
case 'n' : return '\n';
case 'r' : return '\r';
case 't' : return '\t';
case '0' : return '\0';
case '\\': return '\\';
case 'a' : return '\a';
default : // TODO \x, \v \b, \f
if (isdigit(escape e))
{
string slug = m_rc.substr(m_e ot - 1, 3);
return char(strtol(slu g.c_str(), NULL, 8));
}
else
//assert(false);
return escapee;
}
}
else if (m_rc[m_eot] == '"' && m_rc[m_eot+1] == '"')
m_eot++;

return m_rc[m_eot++];
}

string
parseString()
{
m_eot = m_bot + 1;
string z;

while ( m_eot < m_rc.length() &&
( m_rc[m_eot] != '"' ||
m_rc[m_eot + 1] == '"' ) )
z += parseStringChar ();

if (m_eot < m_rc.length())
m_eot += 1;

return z;
}

string m_rc;
size_type m_bot;
size_type m_eot;
string m_priorToken;
string m_currentToken;
};

That looks really ugly & long, because it hides so much behind such a narrow
interface. (I don't know if I copied all of it in, either.) But it
demonstrates (possibly) correct usage of std::string and std::vector.

Do not copy my source into your editor and try to run it. It will not parse
CVS. Start your project like this:

#include <assert.h>
#include <string>
#include <vector>
typedef std::vector<std ::string> strings_t;

strings_t parse(std::stri ng input)
{
strings_t result;
return result;
}

int main()
{
assert("a" == parse("a,b")[0]);
}

If that compiles, it >will< crash if you run it.

Now fix parse() so that it _only_ does not crash, and passes this test. Make
the implementation as stupid as you like.

Then add a test:

assert("a" == parse("a,b")[0]);
assert("b" == parse("a,b")[1]);

Keep going. Make the implementation just a little better after each test.
Write a set of tests for each of the parsing rules I listed. When the new
parse() function is full-featured, put it to work in your program.

All programs should be written by generating long lists of simple tests like
this. That keeps the bug count very low, and prevents wasting hours and
hours with a debugger.

--
Phlip
http://www.xpsd.org/cgi-bin/wiki?Tes...UserInterfaces

Nov 14 '05 #2

Willem

Ram wrote:
) Hi all,
) I have a text file which have data in CSV format.
) "empno","phonen umber","wardnum ber"
) 12345,2234353,1 000202
) 12326,2243653,1 000098
) Iam a beginner of C/C++ programming.
) I don't know how to tokenize the comma separated values.I used strtok
) function reading line by line using fgets.but it gives some weird
) behavior.It doesnot stripout the "" fully.Could any body have sample
) code for the same so that it will be helfful for my reference?

Here's a tip: Look for a library that scans CSV files.

And if you really want to do it yourself, you really don't want to be using
stuff like strtok. Assuming you have one complete line in memory, you're
better off searching for the commas (and quotes) yourself, that's really
not so hard. Just put NULs where the commas are, and point to the
beginning of the strings (just after the comma). You can then pass these
pointers as strings to another parsing routine that turns stuff without
quotes into integers, and stuff with quotes into strings or whatever.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Nov 14 '05 #3

Phlip

Willem wrote:

Ram wrote:
) Hi all,
) I have a text file which have data in CSV format.
) "empno","phonen umber","wardnum ber"
) 12345,2234353,1 000202
) 12326,2243653,1 000098
) Iam a beginner of C/C++ programming.
) I don't know how to tokenize the comma separated values.I used strtok
) function reading line by line using fgets.but it gives some weird
) behavior.It doesnot stripout the "" fully.Could any body have sample
) code for the same so that it will be helfful for my reference?

Here's a tip: Look for a library that scans CSV files.

Hi Willem! Welcome to the first hard projects of this semester. So far, a
professor somewhere has assumed their class was reading the right chapters
in their tutorial, and has hit them with the first non-Hello World project.

Someone just posted the same question to news:comp.progr amming .

--
Phlip
http://www.xpsd.org/cgi-bin/wiki?Tes...UserInterfaces

Nov 14 '05 #4

Mike Wahler

"Ram Laxman" <ra********@ind ia.com> wrote in message
news:24******** *************** ***@posting.goo gle.com...

Hi all,
I have a text file which have data in CSV format.
"empno","phonen umber","wardnum ber"
12345,2234353,1 000202
12326,2243653,1 000098
Iam a beginner of C/C++ programming.
I don't know how to tokenize the comma separated values.I used strtok
function reading line by line using fgets.but it gives some weird
behavior.It doesnot stripout the "" fully.Could any body have sample
code for the same so that it will be helfful for my reference?

Ram Laxman

#include <cstdlib>
#include <fstream>
#include <ios>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>

int main()
{
std::ifstream ifs("csv.txt");
if(!ifs)
{
std::cerr << "Cannot open input\n";
return EXIT_FAILURE;
}

const std::streamsize width(15);
std::cout << std::left;

std::string line;
while(std::getl ine(ifs, line))
{
std::string tok1;
std::istringstr eam iss(line);
while(std::getl ine(iss, tok1, ','))
{
if(tok1.find('" ') != std::string::np os)
{
std::string tok2;
std::istringstr eam iss(tok1);
while(std::getl ine(iss, tok2, '"'))
{
if(!tok2.empty( ))
std::cout << std::setw(width ) << tok2;
}
}
else
std::cout << std::setw(width ) << tok1;

std::cout << ' ';

}
std::cout << " \n";
}

if(!ifs && !ifs.eof())
std::cerr << "Error reading input\n";

return 0;
}

Input file:

"empno","phonen umber","wardnum ber"
12345,2234353,1 000202
12326,2243653,1 000098

Output:

empno phonenumber wardnumber
12345 2234353 1000202
12326 2243653 1000098

-Mike

Nov 14 '05 #5

Jon Bell

In article <mg************ *****@newsread1 .news.pas.earth link.net>,
Mike Wahler <mk******@mkwah ler.net> wrote:

[code snipped]

Input file:

"empno","phone number","wardnu mber"
12345,2234353, 1000202
12326,2243653, 1000098

Try changing the first line so one of the tokens contains a comma, e.g.

"empno","ph one, number","wardnu mber"

;-)

I started to work on a solution, too, and then I thought about embedded
commas, and went, "uh oh..."

--
Jon Bell <jt*******@pres by.edu> Presbyterian College
Dept. of Physics and Computer Science Clinton, South Carolina USA

Nov 14 '05 #6

Phlip

Mike Wahler wrote:

#include <cstdlib>

Hi Mike!

I just wanted to be the first to remind you that the FAQ advises against
doing others' homework - fun though it may be. (Advising the newbie to throw
in a few Design Patterns is better sport, of course...)

--
Phlip
http://www.xpsd.org/cgi-bin/wiki?Tes...UserInterfaces

Nov 14 '05 #7

Mike Wahler

"Phlip" <ph*******@yaho o.com> wrote in message
news:Yt******** ***********@new ssvr16.news.pro digy.com...

Mike Wahler wrote:
#include <cstdlib>
Hi Mike!

I just wanted to be the first to remind you that the FAQ advises against
doing others' homework - fun though it may be.

Yes, I realize that.
(Advising the newbie to throw
in a few Design Patterns is better sport, of course...)

I very much doubt that the code would be accepted 'as is'
by an instructor -- unless the student can explain it --
in which case he would have actually studied and learned... :-)
Anyway, it seems that OP isn't quite sure whether he's learning
C or C++.

-Mike

Nov 14 '05 #8

Mike Wahler

"Jon Bell" <jt*******@pres by.edu> wrote in message
news:c0******** **@jtbell.presb y.edu...

In article <mg************ *****@newsread1 .news.pas.earth link.net>,
Mike Wahler <mk******@mkwah ler.net> wrote:

[code snipped]
Input file:

"empno","phone number","wardnu mber"
12345,2234353, 1000202
12326,2243653, 1000098

Try changing the first line so one of the tokens contains a comma, e.g.

"empno","ph one, number","wardnu mber"

;-)

I started to work on a solution, too, and then I thought about embedded
commas, and went, "uh oh..."

Well, yes I did think about bad input, but thought I'd leave that
to the OP. IOW I gave a very 'literal' answer that only addressed
the exact input cited by the OP. :-)
-Mike

Nov 14 '05 #9

Jon Bell

In article <vG************ ******@newsread 1.news.pas.eart hlink.net>,
Mike Wahler <mk******@mkwah ler.net> wrote:

Well, yes I did think about bad input, but thought I'd leave that
to the OP. IOW I gave a very 'literal' answer that only addressed
the exact input cited by the OP. :-)

It would be interesting to find out if the instructor actually intended
the students to go whole hog and deal with embedded commas, escaped
quotes, etc. If it's an introductory programming course, it's quite
possible they don't need to worry about those details for the purposes of
the assignment.

--
Jon Bell <jt*******@pres by.edu> Presbyterian College
Dept. of Physics and Computer Science Clinton, South Carolina USA

Nov 14 '05 #10

Similar topics

872

How to Parse a CSV formatted text file

by: Ram Laxman | last post by:

Hi all, I have a text file which have data in CSV format. "empno","phonenumber","wardnumber" 12345,2234353,1000202 12326,2243653,1000098 Iam a beginner of C/C++ programming. I don't know how to tokenize the comma separated values.I used strtok function reading line by line using fgets.but it gives some weird behavior.It doesnot stripout the "" fully.Could any body have sample code for the same so that it will be helfful for my...

C / C++

9142

importing a fixed field length formatted text file into a dataset

by: Neil Robbins | last post by:

I have a text file that stores a number of records that I need to access in a vb.net application. Each of the fields that make up a record are of a fixed number of bytes. So for instance there is an address field of 240 bytes and there is an id field of 12 bytes. Where the data stored in a field does not fill the available number of bytes then spaces " " are inserted to fill the remaining bytes. There are no delimiters, just fields of a...

Visual Basic .NET

3760

Parse Large Text File

by: liming | last post by:

Hi all, I have to parse two text files on a weekly basis. Each range from 300kb to 1mb in total. Each text file has 5 columns (name,id, dollar, startdate,enddate), everytime, a) I need to parse each row, extract each column 2) check if the data already exisinst in the db between startdate and end date 3) if not, then insert them into the the database, else, modify the record with the new data.

ASP.NET

2489

Decoding a 30 MB formatted text file

by: pkirk25 | last post by:

My data is in a big file that I have no control over. Sometimes its over 30 MB and often there are several of them. It is machine generated and is nicely formatted. Example text follows: AuctioneerSnapshotDB = { = { = 20, = 1, = {

C / C++

7716

Export Table Data To Formatted Text File

by: NEWSGROUPS | last post by:

I have data in a table in an Access 2000 database that needs to be exported to a formatted text file. For instance, the first field is an account number that is formatted in the table as text and is 8 characters long. This field needs to be exported as pic(15) padded in the front with 0's (zeros). The next field an ID name that is 15 characters that needs to be exported as pic(20) padded with trailing spaces. There are about 5 fields in...

Microsoft Access / VBA

1524

Parse a text file with two carriage returns at the end of each line

by: mscw | last post by:

Hi, I'm trying parse a text file I'm pulling into VB6, but am unable to do like I usually do (Line Input/Print) since the text file inserts two carriage returns at the end of each line instead of one. I think the program ends because it assumes the second line (which is blank line due to the second carriage return) is EOF. Any suggestions? I'm stumped. Thanks!

Visual Basic 4 / 5 / 6

1306

How to parse a text file in ASP?

by: ghostface | last post by:

**How do I parse a textfile and edit only a certain part of it. Specifically, just the last column. My textfile looks like this. #Server Group 1 !Name01,192.168.2.201,5901,123456,description01,\\p4d1,01,on !Name02,192.168.2.202,5902,123456,description02,\\p4d1,01,on **Now I have an ASP page with buttons. **example. **When I click button1.. It means it will parse row 1 and last column. Changing

ASP / Active Server Pages

9572

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10562

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10319

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10303

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9132

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7608

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

5508

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5639

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4282

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp