By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,496 Members | 1,527 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

How to parse a file in C++

AdrianH
Expert 100+
P: 1,251
Assumptions
I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C++ programming.


FYI
Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make sense of it.

Think of lexing as reading in a bunch of words, and parsing as reading in a sentence. Each word means something, but without the context of the sentence, it doesn’t mean anything very useful.

I didn’t use the title “How to do Lexical Analysis in C++” because most of you probably don’t know what that means. If you do, then I apologies.


Introduction
Hi, last time I showed you all how to parse a file in C. In this article, I will now address how to parse a file in C++.

For those who haven’t read that article, please read it under the section of Streams and Files as this is the same for C++ as it is for C. However, when using the C++ streams, instead of using stdin, stdout and stderr, you use cin, cout and cerr respectively.


Buffering and Double Buffering
Double buffering means to dump from one buffer into another prior to processing/displaying. In C++ all of the stream libraries are buffered.


Parsing a File
Parsing a file can be done quite simply using the described buffering techniques.

Parsing Without Double Buffering
To parse a file without double buffering is not always possible. The only way to do it would be to read and store only numbers.

E.g. here is a sample file:
Expand|Select|Wrap|Line Numbers
  1. 1, 2, 3, 4, 5
  2. 6, 7, 8, 9, 10
  3.  
To read that in without double buffering you could loop around the following:
Expand|Select|Wrap|Line Numbers
  1. // CODE FRAGMENT 1
  2. int itemsParsed = 0;
  3. int items[5];
  4. for (itemsParsed = 0; itemsParsed < 5 && cin.good(); ++itemsParsed) {
  5.   cin >> items[itemsParsed];
  6.   if (itemsParsed != 4 && cin.peek() == ’,’) {
  7.     cin.ignore(1);  // clear out comma
  8.   }
  9. }
  10. if (!cin.good()) {
  11.   --itemsParsed;
  12.   // check what flag was set and act appropriately
  13.   //...
  14.   if (!cin.eof()) {
  15.     cin.clear(); // Clear the error flag (unless it is eof)
  16.   }
  17. }
  18.  
Note that commas are required in the input stream after every number. There can be 0 or more whitespaces after the comma. A whitespace can be a regular space, a tab, vertical tab (rarely ever used), a carriage return or a line feed.

Code Fragment 1a is a bit simpler as it separates the normal code flow from the exceptional one using C++ exception handling.
Expand|Select|Wrap|Line Numbers
  1. // CODE FRAGMENT 1A
  2. int itemsParsed = 0;
  3. int items[5];
  4. cin.exceptions(~ios::goodbit);  // turn on exceptions
  5. try {
  6.   for (itemsParsed = 0; itemsParsed < 5; ++itemsParsed) {
  7.     cin >> items[itemsParsed];
  8.     if (itemsParsed != 4 && cin.peek() == ’,’) {
  9.       cin.ignore(1);  // clear out comma
  10.     }
  11.   }
  12. }
  13. catch(ios_base::failure failure) {
  14.   assert (!cin.good());
  15.   // check what flag was set and act appropriately
  16.   //...
  17.   if (!cin.eof()) {
  18.     cin.clear(); // Clear the error flag (unless it is eof)
  19.   }
  20. }
  21.  
Both Code Fragment 1 and Code Fragment 1A are patterned after Code Fragment 1 in How to Parse a File in C. Some may argue that the C code is more readable. This may be true in some cases, but the C code lacks one thing, it is meant only for base types.

In C++, the extraction operator (‘>>’) allows you to do something different. You can overload that operator and make it read in anything you want just as if it were part of the language. What it is in fact is only a call to a function. One could do something similar in C, but it would look like a function call. All that operator overloading is, is syntactic sugar making an operator just a callable function. Some say it isn’t necessary, others say that it makes it cleaner. My opinion is that I have none. It is just another way of doing the same thing. I think the saying goes “same s**t, different shovel”. ;)

The following code fragment shows just how to use this sweetened syntax to your advantage.
Expand|Select|Wrap|Line Numbers
  1. #include <iostream>
  2. #include <assert.h>
  3. using namespace std;
  4.  
  5. // CODE 1
  6. class Point2D
  7. {
  8.   int x, y;
  9.  
  10. public:
  11.   Point2D() : x(0), y(0) {}
  12.  
  13.   int getX() { return x; }
  14.   int getY() { return y; }
  15.  
  16.   void setX(int x) { this->x = x; }
  17.   void setY(int y) { this->y = y; }
  18. };
  19.  
  20. istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure)
  21. {
  22.   ios_base::iostate oldIOState = is.exceptions();
  23.   cin.exceptions(~ios::goodbit);  // turn on exceptions
  24.   try {
  25.     int val;
  26.     is >> val;
  27.     point.setX(val);
  28.     if (is.peek() == ’,’) {
  29.       is.ignore(1);
  30.     }
  31.     else {
  32.       is.setstate(ios_base::failbit);
  33.       throw ios_base::failure(“Missing comma separator”);
  34.     }
  35.     is >> val;
  36.     point.setY(val);
  37.   }
  38.   catch(ios_base::failure failure) {
  39.     assert (!is.good());
  40.     // check what flag was set and act appropriately
  41.     //...
  42.     is.exceptions(oldIOState);  // restoring old IO exception handling
  43.     throw; // there is no way to recover the stream without more info
  44.   }
  45.   is.exceptions(oldIOState);  // restoring old IO exception handling
  46.   return is;
  47. }
  48.  
  49. int main()
  50. {
  51.   Point2D point;
  52.   try {
  53.     cin >> point;
  54.     cout << “(“ << point.getX() << “, “ << point.getY() << “)” << endl;  
  55.   }
  56.   catch(ios_base::failure failure) {
  57.     if (cin.bad()) {
  58.       cout << “cin bad” << endl;
  59.     }
  60.     if (cin.fail()) {
  61.       cout << “cin failed” << endl;
  62.     }
  63.     if (cin.eof()) {
  64.       cout << “cin hit eof” << endl;
  65.     }
  66.     if (!cin.eof()) {
  67.       cin.clear(); // Clear the error flag (unless it is eof)
  68.     }
  69.   }
  70. }
  71.  
Now what CODE 1 does is that you create a class and overload the extraction operator, thus allowing you to extract data from a stream and have it placed into the class. You never have to write this code again. Additionally, you can overload the insertion operator (‘<<’) and have it so it outputs like I did without having to write it out explicitly every time like I did. I leave that as an exercise up to the reader.

You should note that I am using the interface functions to write to the class. If I wanted to closely couple the extraction operator with the class, then I may not want to read the data into a temporary variable and then copy it over to the class. To do that it would require that I make the extraction operator a friend of the class. The modifications follow:
Expand|Select|Wrap|Line Numbers
  1. // CODE 1A
  2.  
  3. // SEE NOTE BELOW regarding next two lines
  4. class Point2D;
  5. istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure);
  6.  
  7. class Point2D
  8. {
  9.   int x, y;
  10.   friend
  11.     istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure);
  12. public:
  13.   Point2D() : x(0), y(0) {}
  14.  
  15.   int getX() { return x; }
  16.   int getY() { return y; }
  17.  
  18.   void setX(int x) { this->x = x; }
  19.   void setY(int y) { this->y = y; }
  20. };
  21.  
  22. istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure)
  23. {
  24.   ios_base::iostate oldIOState = is.exceptions();
  25.   cin.exceptions(~ios::goodbit);  // turn on exceptions
  26.   try {
  27.     is >> point.x;
  28.     if (is.peek() == ’,’) {
  29.       is.ignore(1);
  30.     }
  31.     else {
  32.       is.setstate(ios_base::failbit);
  33.       throw ios_base::failure(“Missing comma separator”);
  34.     }
  35.     is >> point.y;
  36.   }
  37.   catch(ios_base::failure failure) {
  38.     assert (!is.good());
  39.     // check what flag was set and act appropriately
  40.     //...
  41.     is.exceptions(oldIOState);  // restoring old IO exception handling
  42.     throw; // there is no way to recover the stream without more info
  43.   }
  44.   is.exceptions(oldIOState);  // restoring old IO exception handling
  45.   return is;
  46. }
  47.  
NOTE: Lines 4 and 5 at the top of CODE 1 are very important. When declaring a friend, you must either declare or define the function call before the class it is declared as a friend in. If you don’t, the compiler will probably complain about friend injection which is a deprecated feature, or that the function was not declared.

To do regular expression parsing of a stream, you would need a regular expression library. That is beyond the scope of this document. In standard C++, there is no equivalent of scanf’s character classes. You would have to implement them or something like them yourself or download a non-standard library that someone else has created. For this, I would recommend BOOST.org as a good resource. They make libraries that are presented to the C++ committee for possible inclusion in the next standards revision.


Parsing Using Double Buffering
In the previous section, I showed how to not double buffer the data. There are times however, when this is not possible.

Reading in a string or series of characters intrinsically requires the use of double buffering. The data is read to an internal buffer and then copied to your programme’s data space. You can then do with it any way you wish, by either further processing it or displaying it.

To read in a whitespace delimited string, you can still use the extraction operator but use it on a string or a char array

NOTE: use width() function on the input stream when using extraction operator on a char array or you may overrun your buffer, its parameter includes the terminating NULL. Alternatively, you can use setw() but you must include <iomanip> header file.

To read in a line, use the getline() function from the string library (#include <string>). It too will take care of allocation for you. Alternatively, use the native istream::getline() but you must specify the size of the buffer.


Parsing Using Triple Buffering
Yes, you can buffer the buffer’s buffer. Why would you want to do this? One reason I can think of is to decouple parts of your code from a stream. However, unlike in C where you need to defined your functions passing c-strings, you can use call functions that are already accepting istreams and ostreams. To do this, you use a stringstream (bidirectional), istringstream (input only) or ostringstream (output only). This can simplify your design and allows you to easily debug already existing systems that use C++ streams.

The following is a simple example of using a bidirectional stringstream.
Expand|Select|Wrap|Line Numbers
  1. #include <sstream>
  2. #include <iostream>
  3. using namespace std;
  4.  
  5. int main()
  6. {
  7.   stringstream ss;
  8.   char buffer1[20] = {}, buffer2[20] = {};
  9.  
  10.   ss << "hello there";
  11.   ss >> buffer1 >> buffer2;
  12.   cout << buffer1 << endl;
  13.   cout << buffer2 << endl;
  14. }
Since stringstream inherits from istream and ostream, you can use it just like you would one of those classes. Further, istringstream inherits from istream and ostringstream inherits from ostream.


Binary files
As in the “How to Parse in C” document, binary files are beyond this documents scope. If there is enough interest, I will write about it in another document.


Conclusion
Parsing a file is not very difficult but certainly different then in C and though you can read in to a char array like in C, it is not recommended unless you have good reason to do so and take appropriate precautions.

If you have any questions or find anything unclear. Feel free to post a message and I will get back to you and/or update the document when I can.


Adrian


This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.




Revision History:
25/05/2007 11:06 ADT
  • Initial Post
26/05/2007 08:19 ADT
  • Used wrong tag to close code block near end. Fixed
  • Bolding not working in code block, needed to use comment and reference lines instead.
  • Forgot to set state and throw exception when comma not found in CODE 1 & CODE 1A. Fixed.
  • Title reference to “How to Parse a File in C” was wrong, Fixed.
  • Made reference to BOOST.org as a good resource for libraries.
  • Tried to make Parsing Using Triple Buffering clearer and highlight the differences compared to C.
  • Updated Conclusion.

29/05/2007 12:23 ADT
  • Removed reference to stdio buffer and replace with just buffer.
  • Added FYI at beginning of document.


This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
May 26 '07 #1
Share this Article
Share on Google+
5 Comments


Expert 10K+
P: 11,448
Thanks Jos. It isn't a nitpick, it was an error. ;)

I will try and read your stuff too. Just got to go now to the market.


Adrian
Happy shopping ;-) I have another little nitpick: strictly speaking the verb 'parsing'
is much more than a lexical analysis on the incoming character stream. Maybe
it would be handy to mention that shift of meaning w.r.t. parsing in a compiler
technology way of the definition and the less strictly defined term where what
you are writing about is meant.

kind regards,

Jos
May 26 '07 #2

AdrianH
Expert 100+
P: 1,251
Happy shopping ;-) I have another little nitpick: strictly speaking the verb 'parsing'
is much more than a lexical analysis on the incoming character stream. Maybe
it would be handy to mention that shift of meaning w.r.t. parsing in a compiler
technology way of the definition and the less strictly defined term where what
you are writing about is meant.

kind regards,

Jos
Hmmmm. I generally agree with these definitions when it comes to parsing. I guess what I am talking about is using C/C++ as a lexer and leaving the parsing up to the user defined programme.

Perhaps I will make a note in it somewhere; I don’t think most people would know what lexing is.


Adrian
May 26 '07 #3

P: 12
when ever you write the code do show us the Output too.
i learn c, c++ months pass by, till today i cant fluent it. what happen to me.
Jun 24 '07 #4

AdrianH
Expert 100+
P: 1,251
when ever you write the code do show us the Output too.
i learn c, c++ months pass by, till today i cant fluent it. what happen to me.
Noted. I'll update sometime in the future.


Adrian
Jul 10 '07 #5

P: 55
Why got two cout but only one output is display.

By the way, what is stream ?

Thanks for your help.
Mar 2 '08 #6