By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,377 Members | 3,052 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

How to Parse a File in C

AdrianH
Expert 100+
P: 1,251
Assumptions
I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming.


FYI
Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make sense of it.

Think of lexing as reading in a bunch of words, and parsing as reading in a sentence. Each word means something, but without the context of the sentence, it doesn’t mean anything very useful.

I didn’t use the title “How to do Lexical Analysis in C++” because most of you probably don’t know what that means. If you do, then I apologies.


Introduction
The question of how to parse a file has come up on TSDN fairly frequently, so I’ve decided to write something on the subject to help everyone without having to repeat ourselves over and over again. I also address safety when using these C functions as they can be dangerous if misused.

I am going to first define some terms to give you some background understanding.


Streams and Files
I should state right now, for those new to C, that it can consider many things as a file, including the keyboard (a.k.a. terminal in or standard input) and the display (terminal out or standard output). There is also a third output called standard error which is used to output error messages and by default is output to the terminal (this paragraph is the last time you will here about standard error directly in this document). The three are access through the objects stdin, stdout and stderr.

These three files (or streams) are open at the start of any terminal application as defined by the C Standards Committee, and can do anything that a regular file can do except seek to an arbitrary position in the file. It is because of this, the IO routines are usually applied to streams not files (where files are a specialisation or more concrete example of a stream).

To understand this, think of some flowing water (i.e. a stream of water) coming out of a sink faucet. Stick you finger into it. Consider that with your finger, you are ‘reading’ all the molecules of water flowing past it. But once it is past, you cannot move your finger along the stream and read what you have just read again. Those water molecules have flowed beyond your reach down the drain. You also cannot move your finger back past the faucet’s mouth, you have to be patient and wait for the ‘data’ to come to you.

A file is actually more specialised. It allows for you to seek around the file, which means that you can point to any location in the file to read from it or write to it. A file such as this is called random accessible since you can read it in any order you wish.


Buffering and Double Buffering
Double buffering means to dump from one buffer into another prior to processing/displaying.

Buffered Functions
The stdio.h has several file functions prefixed with f. fopen(), fread(), fwrite(), fflush(), fscanf(), fprintf(), fgetc(), fputc(), fseek() and fclose()[*]. All of these functions are buffered, which means that if you were to read a byte and then read a second byte, only one disk read would occur.

Using these functions require the use of a FILE. FILE is a struct which houses all of the “stuff” required to do what is needed.

This happens because on the first read, not only is a byte read in, but a chunk of data is read in and stored in a buffer. This buffer resides somewhere in memory, allocated by the stdio library. It will then copy that data to the data space that you have specified. When the next byte is requested, the stdio library doesn’t have to request the data from the drive as it has already got a chunk of data already. This can speed up reading considerably.

Writing is similar. Unless you fill up the buffer, explicitly flush the buffer or in the case of text files, output a ‘\n’ character, the data will not be sent to the file.

When reading in bytes or strings of data, you are actually double buffering. This is because it is first read to an internal stdio buffer and then copied to your buffer.

When reading in numbers it is only single buffered as it is read in to the internal stdio buffer, processed there and the value is then written to the memory location you specified.

[*] scanf() and printf() are convenience functions that use the FILE’s stdin and stdout respectively, without having to be told. getchar() and putchar() also use stdin and can replace fgetc() and fputc() respectively.


Non-buffered Functions
There are also several file functions that are not prefixed with f. Namely, open(), read(), write(), lseek() and close(). Note that there is no flush function. This is because this is a low level call and the only buffer associated with these functions is the one provided by the programmer. It should be known that this is not always the case. It can depend on the implementation of the filesystem so is under the operating system control. Also, fscanf() and fprintf() are not available either since this is a very simple interface.

These functions are associated with an int called a file descriptor. It keeps track of all of the “stuff” required to do what is needed just like how a FILE does. However, the tracking is usually done by the operating system not the application.

As I said before, the buffering is implementation dependent. Terminal services (under POSIX compliant systems) are buffered, to flush them requires a call to ioctl() passing the file descriptor and using specific control codes (TCIFLUSH for discarding the input buffered, TCOFLUSH for flushing the output and TCIOFLUSH for doing both).

Disk IO can become double buffered. This is because some operating systems do not allow for reading in anything less than a sector at a time. So to allow for it, the library may write the sector to an internal buffer, copy what is requested and discards the rest. This can significantly slow down the operation of your code as already read data will not be kept track of and it will reread that sector again if reading in the next few bytes.

A file descriptor can be made into a FILE by using the fdopen() command. This can be very useful when you open up a pipe.


Parsing a File
Parsing a file can be done quite simply using the described buffering techniques.

I’m not going to be using file descriptors. From here on in, I will only be using FILE streams. This is because it faster and has a lot of features not offered by the raw file descriptors.

Parsing Without Double Buffering
To parse a file without double buffering is not always possible. The only way to do it would be to read and store only numbers.

E.g. here is a sample file:
Expand|Select|Wrap|Line Numbers
  1. 1, 2, 3, 4, 5
  2. 6, 7, 8, 9, 10
  3.  
To read that in without double buffering you could loop around the following:
Expand|Select|Wrap|Line Numbers
  1. /* CODE FRAGMENT 1 */
  2. int itemsParsed = 0;
  3. int items[5];
  4. itemsParsed =
  5. scanf("%d, %d, %d, %d, %d", &items[0], &items[1], &items[2], &items[3],
  6.       &items[4]);
  7.  
Note that the commas are required in the input stream. The spaces however represent 0 or more whitespaces. A whitespace can be a regular space, a tab, vertical tab (rarely ever used), a carriage return or a line feed.

Also note that scanf() returns the number of items parsed that are not literals. I.e. the commas and the spaces are ignored. You should be looking at this value to ensure that you have received all the items you were expecting.

Expand|Select|Wrap|Line Numbers
  1. /* CODE 1 */
  2. #include <stdio.h>
  3. int main()
  4. {
  5.     int itemsParsed = 0;
  6.     int items[5];
  7.  
  8.     memset(items, 0, sizeof(items));    /* Set all items to zero */
  9.  
  10.     /* Read in 5 comma separated values */
  11.     itemsParsed =
  12.         scanf("%d, %d, %d, %d, %d", &items[0], &items[1], &items[2],
  13.               &items[3], &items[4]);
  14.     printf("itemsParsed = %d\n", itemsParsed);
  15.     printf("%d, %d, %d, %d, %d\n", items[0], items[1], items[2], items[3],
  16.            items[4]);
  17.  
  18.     memset(items, 0, sizeof(items));    /* Set all items to zero */
  19.  
  20.     /* Read in 4 comma separated values with first by with a comma */
  21.     itemsParsed =
  22.         scanf(", %d, %d, %d, %d", &items[0], &items[1], &items[2],
  23.               &items[3]);
  24.     printf("itemsParsed = %d\n", itemsParsed);
  25.     printf("%d, %d, %d, %d\n", items[0], items[1], items[2], items[3]);
  26.     return 0;
  27. }
  28.  
Now what CODE 1 does is to read in 5 comma separated values and then try to read in a comma with 4 more comma separated values.

Try it out. It is also attached to the end of this document as CODE1.zip if you have problems cutting and pasting. If you were to type in 1,2,3,4 it will read in 4 numbers, if you type in 1,2,,3,4,5, it will read in 2 numbers and then 3 numbers.

Now try 1,2,3,4,5. What happened? It did not allow you to type in any more data, it simply said that it didn’t parse anything for the second scanf() call. Why? Because the next character it was to read in was a comma. If you want to ensure that you will read in all the whitespaces prior to a literal character, you must precede it with a space.

But what if you wanted skip all whitespaces except a carriage return or line feed? To do this, you would need to use a character class. A character class is a very simplified regular expression. It will read in one or more characters specified by that class. For instance:

Expand|Select|Wrap|Line Numbers
  1. /* CODE FRAGMENT 2 */
  2. (void)scanf(“%*[ \t\v]”);
  3.  
will read in and discard all spaces, tabs or vertical tabs. I am not storing the number of items parsed because not only do I not care if it has read in anything, but a stared (‘*’) parameter is not included in the number of elements parsed, so it wouldn’t tell me anything anyway. Only parameters that are stored to some location are included in the number of elements parsed return value.

If I wanted to know if I had read in any spaces, tabs or vertical tabs, I could use the “%n” specifier which will state how many characters were read, from the beginning of the format string to when it encounters the “%n” specifier. E.g.:

Expand|Select|Wrap|Line Numbers
  1. /* CODE FRAGMENT 3 */
  2. int bytesRead=0;
  3. (void)scanf(“%*[ \t\v]%n”, &bytesRead);
  4.  
NOTE: I have initialised bytesRead to zero. If I didn’t and no spaces, tabs or vertical tabs were read, it will not update bytesRead so it’s value would be indeterminate.

I’ve also casted the return value to (void). This is because some compilers will warn that I am ignoring the return value. This is because return values should not be ignored, however in this case it is ok to do so. Casting to void tells the compiler that, “Yes, I am aware that I am ignoring the return value and it is a legitimate.”


Parsing Using Double Buffering
In the previous section, I showed how to not double buffer the data. There are times however, when this is not possible.

Reading in a string or series of characters intrinsically requires the use of double buffering. The data is read to an internal stdio buffer and then copied to your programme’s data space. You can then do with it any way you wish, by either further processing it or displaying it.

To read in a whitespace delimited string, you can use scanf()’s format specifier “%<bufferSize-1>s”, where bufferSize is the size of the buffer you are writing to. Never use “%s” alone without a bufferSize specified as this will lead to buffer overflow errors making your programme insecure and cause hard to find bugs. To ensure that your buffer is NULL (‘\0’) terminated so that it can be used as a c-string, you should always set the last element in the buffer to ‘\0’.

To read in a string using delimiters other than or in addition to whitespaces, use the negated character class. “%<bufferSize-1>[^:]” states that it will read in a string that is bufferSize-1 bytes consisting of characters that are not colons (‘:’). Use the same precautions here as I stated with the “%s” specifier.


Parsing Using Triple Buffering
Yes, you can buffer the buffer’s buffer. Why would you want to do this? One reason I can think of is to decouple parts of your code from a stream. This enables you to pass it a string with ease and can simplify debugging.

To read in a line, you can use a non-standard function called getline() available on the web or as a gcc extension. It will automatically allocate the memory you require. You can also use getdelim() which is also a gcc extension. It is very similar to getline() but you can choose a single delimiter.

From there, you can use sscanf() to parse the string in the same way I described with scanf(). You can also use atoi(), atol(), atof(), strtol(), strtoul(), strtod(), strtoll(), strtok() and other c-string manipulators when you use this triple buffering technique.


Binary files
I’ve not said anything about binary files up to this point. This is because of endian incompatibility. Endianness refers to the byte order of the binary representation stored. On Motorola CPUs it is usually stored in big-endian, Intel uses little-endian, network protocols usually use big-endian. If you are writing a programme that is storing data to disk or passing it over a network, such that it may be read from another computer using a different endian scheme, you need to be aware of this problem and how to correct for it.

It is not difficult to do, but it is not trivial either. What you need to do is ensure you are using a common scheme, both on the read and the write end. This is done by either using macros or functions that will reverse the bytes (if necessary) prior to writing them and reverse the bytes again (if necessary) after reading them but before processing them.

This is beyond the scope of this document. If there is enough interest, I will write about it in another document.


Conclusion
Parsing a file is not very difficult and you can safely use scanf() and its relatives if you take appropriate precautions.

If you have any questions or find anything unclear. Feel free to post a message and I will get back to you and/or update the document when I can.


Adrian

Revision History
20/05/2007 13:00
  • Corrected grammatical error
20/05/2007 13:45
  • Corrected error wrt what a FILE is
29/05/2007 12:44
  • Added FYI at beginning of document


This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
Attached Files
File Type: zip CODE1.zip (392 Bytes, 1002 views)
Jun 4 '07 #1
Share this Article
Share on Google+
1 Comment


P: 1
Could you post an example of triple buffering. To show how to process the data-chunks, while reading data-chunks from file and writing to another file(stream) in chunks.
Any help would be greatly appreciated.
Sep 15 '10 #2