I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C++ programming.
FYI
Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make sense of it.
Think of lexing as reading in a bunch of words, and parsing as reading in a sentence. Each word means something, but without the context of the sentence, it doesn’t mean anything very useful.
I didn’t use the title “How to do Lexical Analysis in C++” because most of you probably don’t know what that means. If you do, then I apologies.
Introduction
Hi, last time I showed you all how to parse a file in C. In this article, I will now address how to parse a file in C++.
For those who haven’t read that article, please read it under the section of Streams and Files as this is the same for C++ as it is for C. However, when using the C++ streams, instead of using stdin, stdout and stderr, you use cin, cout and cerr respectively.
Buffering and Double Buffering
Double buffering means to dump from one buffer into another prior to processing/displaying. In C++ all of the stream libraries are buffered.
Parsing a File
Parsing a file can be done quite simply using the described buffering techniques.
Parsing Without Double Buffering
To parse a file without double buffering is not always possible. The only way to do it would be to read and store only numbers.
E.g. here is a sample file:
Expand|Select|Wrap|Line Numbers
- 1, 2, 3, 4, 5
- 6, 7, 8, 9, 10
Expand|Select|Wrap|Line Numbers
- // CODE FRAGMENT 1
- int itemsParsed = 0;
- int items[5];
- for (itemsParsed = 0; itemsParsed < 5 && cin.good(); ++itemsParsed) {
- cin >> items[itemsParsed];
- if (itemsParsed != 4 && cin.peek() == ’,’) {
- cin.ignore(1); // clear out comma
- }
- }
- if (!cin.good()) {
- --itemsParsed;
- // check what flag was set and act appropriately
- //...
- if (!cin.eof()) {
- cin.clear(); // Clear the error flag (unless it is eof)
- }
- }
Code Fragment 1a is a bit simpler as it separates the normal code flow from the exceptional one using C++ exception handling.
Expand|Select|Wrap|Line Numbers
- // CODE FRAGMENT 1A
- int itemsParsed = 0;
- int items[5];
- cin.exceptions(~ios::goodbit); // turn on exceptions
- try {
- for (itemsParsed = 0; itemsParsed < 5; ++itemsParsed) {
- cin >> items[itemsParsed];
- if (itemsParsed != 4 && cin.peek() == ’,’) {
- cin.ignore(1); // clear out comma
- }
- }
- }
- catch(ios_base::failure failure) {
- assert (!cin.good());
- // check what flag was set and act appropriately
- //...
- if (!cin.eof()) {
- cin.clear(); // Clear the error flag (unless it is eof)
- }
- }
In C++, the extraction operator (‘>>’) allows you to do something different. You can overload that operator and make it read in anything you want just as if it were part of the language. What it is in fact is only a call to a function. One could do something similar in C, but it would look like a function call. All that operator overloading is, is syntactic sugar making an operator just a callable function. Some say it isn’t necessary, others say that it makes it cleaner. My opinion is that I have none. It is just another way of doing the same thing. I think the saying goes “same s**t, different shovel”. ;)
The following code fragment shows just how to use this sweetened syntax to your advantage.
Expand|Select|Wrap|Line Numbers
- #include <iostream>
- #include <assert.h>
- using namespace std;
- // CODE 1
- class Point2D
- {
- int x, y;
- public:
- Point2D() : x(0), y(0) {}
- int getX() { return x; }
- int getY() { return y; }
- void setX(int x) { this->x = x; }
- void setY(int y) { this->y = y; }
- };
- istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure)
- {
- ios_base::iostate oldIOState = is.exceptions();
- cin.exceptions(~ios::goodbit); // turn on exceptions
- try {
- int val;
- is >> val;
- point.setX(val);
- if (is.peek() == ’,’) {
- is.ignore(1);
- }
- else {
- is.setstate(ios_base::failbit);
- throw ios_base::failure(“Missing comma separator”);
- }
- is >> val;
- point.setY(val);
- }
- catch(ios_base::failure failure) {
- assert (!is.good());
- // check what flag was set and act appropriately
- //...
- is.exceptions(oldIOState); // restoring old IO exception handling
- throw; // there is no way to recover the stream without more info
- }
- is.exceptions(oldIOState); // restoring old IO exception handling
- return is;
- }
- int main()
- {
- Point2D point;
- try {
- cin >> point;
- cout << “(“ << point.getX() << “, “ << point.getY() << “)” << endl;
- }
- catch(ios_base::failure failure) {
- if (cin.bad()) {
- cout << “cin bad” << endl;
- }
- if (cin.fail()) {
- cout << “cin failed” << endl;
- }
- if (cin.eof()) {
- cout << “cin hit eof” << endl;
- }
- if (!cin.eof()) {
- cin.clear(); // Clear the error flag (unless it is eof)
- }
- }
- }
You should note that I am using the interface functions to write to the class. If I wanted to closely couple the extraction operator with the class, then I may not want to read the data into a temporary variable and then copy it over to the class. To do that it would require that I make the extraction operator a friend of the class. The modifications follow:
Expand|Select|Wrap|Line Numbers
- // CODE 1A
- // SEE NOTE BELOW regarding next two lines
- class Point2D;
- istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure);
- class Point2D
- {
- int x, y;
- friend
- istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure);
- public:
- Point2D() : x(0), y(0) {}
- int getX() { return x; }
- int getY() { return y; }
- void setX(int x) { this->x = x; }
- void setY(int y) { this->y = y; }
- };
- istream& operator>>(istream& is, Point2D& point) throw (ios_base::failure)
- {
- ios_base::iostate oldIOState = is.exceptions();
- cin.exceptions(~ios::goodbit); // turn on exceptions
- try {
- is >> point.x;
- if (is.peek() == ’,’) {
- is.ignore(1);
- }
- else {
- is.setstate(ios_base::failbit);
- throw ios_base::failure(“Missing comma separator”);
- }
- is >> point.y;
- }
- catch(ios_base::failure failure) {
- assert (!is.good());
- // check what flag was set and act appropriately
- //...
- is.exceptions(oldIOState); // restoring old IO exception handling
- throw; // there is no way to recover the stream without more info
- }
- is.exceptions(oldIOState); // restoring old IO exception handling
- return is;
- }
To do regular expression parsing of a stream, you would need a regular expression library. That is beyond the scope of this document. In standard C++, there is no equivalent of scanf’s character classes. You would have to implement them or something like them yourself or download a non-standard library that someone else has created. For this, I would recommend BOOST.org as a good resource. They make libraries that are presented to the C++ committee for possible inclusion in the next standards revision.
Parsing Using Double Buffering
In the previous section, I showed how to not double buffer the data. There are times however, when this is not possible.
Reading in a string or series of characters intrinsically requires the use of double buffering. The data is read to an internal buffer and then copied to your programme’s data space. You can then do with it any way you wish, by either further processing it or displaying it.
To read in a whitespace delimited string, you can still use the extraction operator but use it on a string or a char array
NOTE: use width() function on the input stream when using extraction operator on a char array or you may overrun your buffer, its parameter includes the terminating NULL. Alternatively, you can use setw() but you must include <iomanip> header file.
To read in a line, use the getline() function from the string library (#include <string>). It too will take care of allocation for you. Alternatively, use the native istream::getline() but you must specify the size of the buffer.
Parsing Using Triple Buffering
Yes, you can buffer the buffer’s buffer. Why would you want to do this? One reason I can think of is to decouple parts of your code from a stream. However, unlike in C where you need to defined your functions passing c-strings, you can use call functions that are already accepting istreams and ostreams. To do this, you use a stringstream (bidirectional), istringstream (input only) or ostringstream (output only). This can simplify your design and allows you to easily debug already existing systems that use C++ streams.
The following is a simple example of using a bidirectional stringstream.
Expand|Select|Wrap|Line Numbers
- #include <sstream>
- #include <iostream>
- using namespace std;
- int main()
- {
- stringstream ss;
- char buffer1[20] = {}, buffer2[20] = {};
- ss << "hello there";
- ss >> buffer1 >> buffer2;
- cout << buffer1 << endl;
- cout << buffer2 << endl;
- }
Binary files
As in the “How to Parse in C” document, binary files are beyond this documents scope. If there is enough interest, I will write about it in another document.
Conclusion
Parsing a file is not very difficult but certainly different then in C and though you can read in to a char array like in C, it is not recommended unless you have good reason to do so and take appropriate precautions.
If you have any questions or find anything unclear. Feel free to post a message and I will get back to you and/or update the document when I can.
Adrian
This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
Revision History:
25/05/2007 11:06 ADT
- Initial Post
- Used wrong tag to close code block near end. Fixed
- Bolding not working in code block, needed to use comment and reference lines instead.
- Forgot to set state and throw exception when comma not found in CODE 1 & CODE 1A. Fixed.
- Title reference to “How to Parse a File in C” was wrong, Fixed.
- Made reference to BOOST.org as a good resource for libraries.
- Tried to make Parsing Using Triple Buffering clearer and highlight the differences compared to C.
- Updated Conclusion.
29/05/2007 12:23 ADT
- Removed reference to stdio buffer and replace with just buffer.
- Added FYI at beginning of document.
This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.