How to Parse a File in C

1,251 Expert 1GB

Assumptions
I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming.

FYI
Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make sense of it.

Think of lexing as reading in a bunch of words, and parsing as reading in a sentence. Each word means something, but without the context of the sentence, it doesn’t mean anything very useful.

I didn’t use the title “How to do Lexical Analysis in C++” because most of you probably don’t know what that means. If you do, then I apologies.

Introduction
The question of how to parse a file has come up on TSDN fairly frequently, so I’ve decided to write something on the subject to help everyone without having to repeat ourselves over and over again. I also address safety when using these C functions as they can be dangerous if misused.

I am going to first define some terms to give you some background understanding.

Streams and Files
I should state right now, for those new to C, that it can consider many things as a file, including the keyboard (a.k.a. terminal in or standard input) and the display (terminal out or standard output). There is also a third output called standard error which is used to output error messages and by default is output to the terminal (this paragraph is the last time you will here about standard error directly in this document). The three are access through the objects stdin, stdout and stderr.

These three files (or streams) are open at the start of any terminal application as defined by the C Standards Committee, and can do anything that a regular file can do except seek to an arbitrary position in the file. It is because of this, the IO routines are usually applied to streams not files (where files are a specialisation or more concrete example of a stream).

To understand this, think of some flowing water (i.e. a stream of water) coming out of a sink faucet. Stick you finger into it. Consider that with your finger, you are ‘reading’ all the molecules of water flowing past it. But once it is past, you cannot move your finger along the stream and read what you have just read again. Those water molecules have flowed beyond your reach down the drain. You also cannot move your finger back past the faucet’s mouth, you have to be patient and wait for the ‘data’ to come to you.

A file is actually more specialised. It allows for you to seek around the file, which means that you can point to any location in the file to read from it or write to it. A file such as this is called random accessible since you can read it in any order you wish.

Buffering and Double Buffering
Double buffering means to dump from one buffer into another prior to processing/displaying.

Buffered Functions
The stdio.h has several file functions prefixed with f. fopen(), fread(), fwrite(), fflush(), fscanf(), fprintf(), fgetc(), fputc(), fseek() and fclose()[*]. All of these functions are buffered, which means that if you were to read a byte and then read a second byte, only one disk read would occur.

Using these functions require the use of a FILE. FILE is a struct which houses all of the “stuff” required to do what is needed.

This happens because on the first read, not only is a byte read in, but a chunk of data is read in and stored in a buffer. This buffer resides somewhere in memory, allocated by the stdio library. It will then copy that data to the data space that you have specified. When the next byte is requested, the stdio library doesn’t have to request the data from the drive as it has already got a chunk of data already. This can speed up reading considerably.

Writing is similar. Unless you fill up the buffer, explicitly flush the buffer or in the case of text files, output a ‘\n’ character, the data will not be sent to the file.

When reading in bytes or strings of data, you are actually double buffering. This is because it is first read to an internal stdio buffer and then copied to your buffer.

When reading in numbers it is only single buffered as it is read in to the internal stdio buffer, processed there and the value is then written to the memory location you specified.

[*] scanf() and printf() are convenience functions that use the FILE’s stdin and stdout respectively, without having to be told. getchar() and putchar() also use stdin and can replace fgetc() and fputc() respectively.

Non-buffered Functions
There are also several file functions that are not prefixed with f. Namely, open(), read(), write(), lseek() and close(). Note that there is no flush function. This is because this is a low level call and the only buffer associated with these functions is the one provided by the programmer. It should be known that this is not always the case. It can depend on the implementation of the filesystem so is under the operating system control. Also, fscanf() and fprintf() are not available either since this is a very simple interface.

These functions are associated with an int called a file descriptor. It keeps track of all of the “stuff” required to do what is needed just like how a FILE does. However, the tracking is usually done by the operating system not the application.

As I said before, the buffering is implementation dependent. Terminal services (under POSIX compliant systems) are buffered, to flush them requires a call to ioctl() passing the file descriptor and using specific control codes (TCIFLUSH for discarding the input buffered, TCOFLUSH for flushing the output and TCIOFLUSH for doing both).

Disk IO can become double buffered. This is because some operating systems do not allow for reading in anything less than a sector at a time. So to allow for it, the library may write the sector to an internal buffer, copy what is requested and discards the rest. This can significantly slow down the operation of your code as already read data will not be kept track of and it will reread that sector again if reading in the next few bytes.

A file descriptor can be made into a FILE by using the fdopen() command. This can be very useful when you open up a pipe.

Parsing a File
Parsing a file can be done quite simply using the described buffering techniques.

I’m not going to be using file descriptors. From here on in, I will only be using FILE streams. This is because it faster and has a lot of features not offered by the raw file descriptors.

Parsing Without Double Buffering
To parse a file without double buffering is not always possible. The only way to do it would be to read and store only numbers.

E.g. here is a sample file:

Expand|Select|Wrap|Line Numbers

 
1, 2, 3, 4, 5

6, 7, 8, 9, 10

To read that in without double buffering you could loop around the following:

Expand|Select|Wrap|Line Numbers

 
/* CODE FRAGMENT 1 */

int itemsParsed = 0;

int items[5];

itemsParsed =

scanf("%d, %d, %d, %d, %d", &items[0], &items[1], &items[2], &items[3],

      &items[4]);

Note that the commas are required in the input stream. The spaces however represent 0 or more whitespaces. A whitespace can be a regular space, a tab, vertical tab (rarely ever used), a carriage return or a line feed.

Also note that scanf() returns the number of items parsed that are not literals. I.e. the commas and the spaces are ignored. You should be looking at this value to ensure that you have received all the items you were expecting.

Expand|Select|Wrap|Line Numbers

 
/* CODE 1 */

#include <stdio.h>

int main()

{

    int itemsParsed = 0;

    int items[5];
 
    memset(items, 0, sizeof(items));    /* Set all items to zero */
 
    /* Read in 5 comma separated values */

    itemsParsed =

        scanf("%d, %d, %d, %d, %d", &items[0], &items[1], &items[2],

              &items[3], &items[4]);

    printf("itemsParsed = %d\n", itemsParsed);

    printf("%d, %d, %d, %d, %d\n", items[0], items[1], items[2], items[3],

           items[4]);
 
    memset(items, 0, sizeof(items));    /* Set all items to zero */
 
    /* Read in 4 comma separated values with first by with a comma */

    itemsParsed =

        scanf(", %d, %d, %d, %d", &items[0], &items[1], &items[2],

              &items[3]);

    printf("itemsParsed = %d\n", itemsParsed);

    printf("%d, %d, %d, %d\n", items[0], items[1], items[2], items[3]);

    return 0;

}

Now what CODE 1 does is to read in 5 comma separated values and then try to read in a comma with 4 more comma separated values.

Try it out. It is also attached to the end of this document as CODE1.zip if you have problems cutting and pasting. If you were to type in 1,2,3,4 it will read in 4 numbers, if you type in 1,2,,3,4,5, it will read in 2 numbers and then 3 numbers.

Now try 1,2,3,4,5. What happened? It did not allow you to type in any more data, it simply said that it didn’t parse anything for the second scanf() call. Why? Because the next character it was to read in was a comma. If you want to ensure that you will read in all the whitespaces prior to a literal character, you must precede it with a space.

But what if you wanted skip all whitespaces except a carriage return or line feed? To do this, you would need to use a character class. A character class is a very simplified regular expression. It will read in one or more characters specified by that class. For instance:

Expand|Select|Wrap|Line Numbers

 
/* CODE FRAGMENT 2 */

(void)scanf(“%*[ \t\v]”);

will read in and discard all spaces, tabs or vertical tabs. I am not storing the number of items parsed because not only do I not care if it has read in anything, but a stared (‘*’) parameter is not included in the number of elements parsed, so it wouldn’t tell me anything anyway. Only parameters that are stored to some location are included in the number of elements parsed return value.

If I wanted to know if I had read in any spaces, tabs or vertical tabs, I could use the “%n” specifier which will state how many characters were read, from the beginning of the format string to when it encounters the “%n” specifier. E.g.:

Expand|Select|Wrap|Line Numbers

 
/* CODE FRAGMENT 3 */

int bytesRead=0;

(void)scanf(“%*[ \t\v]%n”, &bytesRead);

NOTE: I have initialised bytesRead to zero. If I didn’t and no spaces, tabs or vertical tabs were read, it will not update bytesRead so it’s value would be indeterminate.

I’ve also casted the return value to (void). This is because some compilers will warn that I am ignoring the return value. This is because return values should not be ignored, however in this case it is ok to do so. Casting to void tells the compiler that, “Yes, I am aware that I am ignoring the return value and it is a legitimate.”

Parsing Using Double Buffering
In the previous section, I showed how to not double buffer the data. There are times however, when this is not possible.

Reading in a string or series of characters intrinsically requires the use of double buffering. The data is read to an internal stdio buffer and then copied to your programme’s data space. You can then do with it any way you wish, by either further processing it or displaying it.

To read in a whitespace delimited string, you can use scanf()’s format specifier “%<bufferSize-1>s”, where bufferSize is the size of the buffer you are writing to. Never use “%s” alone without a bufferSize specified as this will lead to buffer overflow errors making your programme insecure and cause hard to find bugs. To ensure that your buffer is NULL (‘\0’) terminated so that it can be used as a c-string, you should always set the last element in the buffer to ‘\0’.

To read in a string using delimiters other than or in addition to whitespaces, use the negated character class. “%<bufferSize-1>[^:]” states that it will read in a string that is bufferSize-1 bytes consisting of characters that are not colons (‘:’). Use the same precautions here as I stated with the “%s” specifier.

Parsing Using Triple Buffering
Yes, you can buffer the buffer’s buffer. Why would you want to do this? One reason I can think of is to decouple parts of your code from a stream. This enables you to pass it a string with ease and can simplify debugging.

To read in a line, you can use a non-standard function called getline() available on the web or as a gcc extension. It will automatically allocate the memory you require. You can also use getdelim() which is also a gcc extension. It is very similar to getline() but you can choose a single delimiter.

From there, you can use sscanf() to parse the string in the same way I described with scanf(). You can also use atoi(), atol(), atof(), strtol(), strtoul(), strtod(), strtoll(), strtok() and other c-string manipulators when you use this triple buffering technique.

Binary files
I’ve not said anything about binary files up to this point. This is because of endian incompatibility. Endianness refers to the byte order of the binary representation stored. On Motorola CPUs it is usually stored in big-endian, Intel uses little-endian, network protocols usually use big-endian. If you are writing a programme that is storing data to disk or passing it over a network, such that it may be read from another computer using a different endian scheme, you need to be aware of this problem and how to correct for it.

It is not difficult to do, but it is not trivial either. What you need to do is ensure you are using a common scheme, both on the read and the write end. This is done by either using macros or functions that will reverse the bytes (if necessary) prior to writing them and reverse the bytes again (if necessary) after reading them but before processing them.

This is beyond the scope of this document. If there is enough interest, I will write about it in another document.

Conclusion
Parsing a file is not very difficult and you can safely use scanf() and its relatives if you take appropriate precautions.

If you have any questions or find anything unclear. Feel free to post a message and I will get back to you and/or update the document when I can.

Adrian

Revision History
20/05/2007 13:00

Corrected grammatical error

20/05/2007 13:45

Corrected error wrt what a FILE is

29/05/2007 12:44

Added FYI at beginning of document

This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.

Attached Files

CODE1.zip (392 Bytes, 1183 views)

Jun 4 '07 #1

Subscribe Post Reply

63934

Hafeez khan

Could you post an example of triple buffering. To show how to process the data-chunks, while reading data-chunks from file and writing to another file(stream) in chunks.
Any help would be greatly appreciated.

Sep 15 '10 #2

by: Harobed | last post by:

Hi, I have a xml file encode in ISO-8859-15. When xml.dom parse this file, it send this error : xml.parsers.expat.ExpatError: not well-formed (invalid token): line 9, column 46 Line 9 content...

Python

What is wrong? The minidom or the XML file?

by: Anthony Liu | last post by:

I copy-pasted the following sample xml document from http://slis-two.lis.fsu.edu/~xml/sample.html and saved it as samplexml.xml. Please note that I removed the following line <!DOCTYPE...

Python

Best way to parse file into db-type layout?

by: Peter A. Schott | last post by:

I've got a file that seems to come across more like a dictionary from what I can tell. Something like the following format: ###,1,val_1,2,val_2,3,val_3,5,val_5,10,val_10...

Python

XML file parsing/validating with xerces-j

by: Cigdem | last post by:

Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home...

.NET Framework

Template Pages with Parse File

by: IWP506 | last post by:

Hey, I have a lot of common things I want to be included on different pages (i.e. the page title, the header, some buttons and such, etc.). So I was thinking of putting things like...

PHP

Parse File Name From Web Address

by: Lou Civitella | last post by:

Using VB.Net what is the best way to parse a file name from a web address? For example: http://www.website.com/downloads/video1.avi I want to extract video1.avi from the above address. ...

Visual Basic .NET

Parse file into array

by: amfr | last post by:

I was wondering how i could parse the contents of a file into an array. the file would look something like this: gif:image/gif html:text/html jpg:image/jpeg .... As you can see, it contains...

Python

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

How to Parse a File in C

Similar topics