473,703 Members | 4,233 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to Parse a File in C

AdrianH
1,251 Recognized Expert Top Contributor
Assumptions
I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming.


FYI
Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make sense of it.

Think of lexing as reading in a bunch of words, and parsing as reading in a sentence. Each word means something, but without the context of the sentence, it doesn’t mean anything very useful.

I didn’t use the title “How to do Lexical Analysis in C++” because most of you probably don’t know what that means. If you do, then I apologies.


Introduction
The question of how to parse a file has come up on TSDN fairly frequently, so I’ve decided to write something on the subject to help everyone without having to repeat ourselves over and over again. I also address safety when using these C functions as they can be dangerous if misused.

I am going to first define some terms to give you some background understanding.


Streams and Files
I should state right now, for those new to C, that it can consider many things as a file, including the keyboard (a.k.a. terminal in or standard input) and the display (terminal out or standard output). There is also a third output called standard error which is used to output error messages and by default is output to the terminal (this paragraph is the last time you will here about standard error directly in this document). The three are access through the objects stdin, stdout and stderr.

These three files (or streams) are open at the start of any terminal application as defined by the C Standards Committee, and can do anything that a regular file can do except seek to an arbitrary position in the file. It is because of this, the IO routines are usually applied to streams not files (where files are a specialisation or more concrete example of a stream).

To understand this, think of some flowing water (i.e. a stream of water) coming out of a sink faucet. Stick you finger into it. Consider that with your finger, you are ‘reading’ all the molecules of water flowing past it. But once it is past, you cannot move your finger along the stream and read what you have just read again. Those water molecules have flowed beyond your reach down the drain. You also cannot move your finger back past the faucet’s mouth, you have to be patient and wait for the ‘data’ to come to you.

A file is actually more specialised. It allows for you to seek around the file, which means that you can point to any location in the file to read from it or write to it. A file such as this is called random accessible since you can read it in any order you wish.


Buffering and Double Buffering
Double buffering means to dump from one buffer into another prior to processing/displaying.

Buffered Functions
The stdio.h has several file functions prefixed with f. fopen(), fread(), fwrite(), fflush(), fscanf(), fprintf(), fgetc(), fputc(), fseek() and fclose()[*]. All of these functions are buffered, which means that if you were to read a byte and then read a second byte, only one disk read would occur.

Using these functions require the use of a FILE. FILE is a struct which houses all of the “stuff” required to do what is needed.

This happens because on the first read, not only is a byte read in, but a chunk of data is read in and stored in a buffer. This buffer resides somewhere in memory, allocated by the stdio library. It will then copy that data to the data space that you have specified. When the next byte is requested, the stdio library doesn’t have to request the data from the drive as it has already got a chunk of data already. This can speed up reading considerably.

Writing is similar. Unless you fill up the buffer, explicitly flush the buffer or in the case of text files, output a ‘\n’ character, the data will not be sent to the file.

When reading in bytes or strings of data, you are actually double buffering. This is because it is first read to an internal stdio buffer and then copied to your buffer.

When reading in numbers it is only single buffered as it is read in to the internal stdio buffer, processed there and the value is then written to the memory location you specified.

[*] scanf() and printf() are convenience functions that use the FILE’s stdin and stdout respectively, without having to be told. getchar() and putchar() also use stdin and can replace fgetc() and fputc() respectively.


Non-buffered Functions
There are also several file functions that are not prefixed with f. Namely, open(), read(), write(), lseek() and close(). Note that there is no flush function. This is because this is a low level call and the only buffer associated with these functions is the one provided by the programmer. It should be known that this is not always the case. It can depend on the implementation of the filesystem so is under the operating system control. Also, fscanf() and fprintf() are not available either since this is a very simple interface.

These functions are associated with an int called a file descriptor. It keeps track of all of the “stuff” required to do what is needed just like how a FILE does. However, the tracking is usually done by the operating system not the application.

As I said before, the buffering is implementation dependent. Terminal services (under POSIX compliant systems) are buffered, to flush them requires a call to ioctl() passing the file descriptor and using specific control codes (TCIFLUSH for discarding the input buffered, TCOFLUSH for flushing the output and TCIOFLUSH for doing both).

Disk IO can become double buffered. This is because some operating systems do not allow for reading in anything less than a sector at a time. So to allow for it, the library may write the sector to an internal buffer, copy what is requested and discards the rest. This can significantly slow down the operation of your code as already read data will not be kept track of and it will reread that sector again if reading in the next few bytes.

A file descriptor can be made into a FILE by using the fdopen() command. This can be very useful when you open up a pipe.


Parsing a File
Parsing a file can be done quite simply using the described buffering techniques.

I’m not going to be using file descriptors. From here on in, I will only be using FILE streams. This is because it faster and has a lot of features not offered by the raw file descriptors.

Parsing Without Double Buffering
To parse a file without double buffering is not always possible. The only way to do it would be to read and store only numbers.

E.g. here is a sample file:
Expand|Select|Wrap|Line Numbers
  1. 1, 2, 3, 4, 5
  2. 6, 7, 8, 9, 10
  3.  
To read that in without double buffering you could loop around the following:
Expand|Select|Wrap|Line Numbers
  1. /* CODE FRAGMENT 1 */
  2. int itemsParsed = 0;
  3. int items[5];
  4. itemsParsed =
  5. scanf("%d, %d, %d, %d, %d", &items[0], &items[1], &items[2], &items[3],
  6.       &items[4]);
  7.  
Note that the commas are required in the input stream. The spaces however represent 0 or more whitespaces. A whitespace can be a regular space, a tab, vertical tab (rarely ever used), a carriage return or a line feed.

Also note that scanf() returns the number of items parsed that are not literals. I.e. the commas and the spaces are ignored. You should be looking at this value to ensure that you have received all the items you were expecting.

Expand|Select|Wrap|Line Numbers
  1. /* CODE 1 */
  2. #include <stdio.h>
  3. int main()
  4. {
  5.     int itemsParsed = 0;
  6.     int items[5];
  7.  
  8.     memset(items, 0, sizeof(items));    /* Set all items to zero */
  9.  
  10.     /* Read in 5 comma separated values */
  11.     itemsParsed =
  12.         scanf("%d, %d, %d, %d, %d", &items[0], &items[1], &items[2],
  13.               &items[3], &items[4]);
  14.     printf("itemsParsed = %d\n", itemsParsed);
  15.     printf("%d, %d, %d, %d, %d\n", items[0], items[1], items[2], items[3],
  16.            items[4]);
  17.  
  18.     memset(items, 0, sizeof(items));    /* Set all items to zero */
  19.  
  20.     /* Read in 4 comma separated values with first by with a comma */
  21.     itemsParsed =
  22.         scanf(", %d, %d, %d, %d", &items[0], &items[1], &items[2],
  23.               &items[3]);
  24.     printf("itemsParsed = %d\n", itemsParsed);
  25.     printf("%d, %d, %d, %d\n", items[0], items[1], items[2], items[3]);
  26.     return 0;
  27. }
  28.  
Now what CODE 1 does is to read in 5 comma separated values and then try to read in a comma with 4 more comma separated values.

Try it out. It is also attached to the end of this document as CODE1.zip if you have problems cutting and pasting. If you were to type in 1,2,3,4 it will read in 4 numbers, if you type in 1,2,,3,4,5, it will read in 2 numbers and then 3 numbers.

Now try 1,2,3,4,5. What happened? It did not allow you to type in any more data, it simply said that it didn’t parse anything for the second scanf() call. Why? Because the next character it was to read in was a comma. If you want to ensure that you will read in all the whitespaces prior to a literal character, you must precede it with a space.

But what if you wanted skip all whitespaces except a carriage return or line feed? To do this, you would need to use a character class. A character class is a very simplified regular expression. It will read in one or more characters specified by that class. For instance:

Expand|Select|Wrap|Line Numbers
  1. /* CODE FRAGMENT 2 */
  2. (void)scanf(“%*[ \t\v]”);
  3.  
will read in and discard all spaces, tabs or vertical tabs. I am not storing the number of items parsed because not only do I not care if it has read in anything, but a stared (‘*’) parameter is not included in the number of elements parsed, so it wouldn’t tell me anything anyway. Only parameters that are stored to some location are included in the number of elements parsed return value.

If I wanted to know if I had read in any spaces, tabs or vertical tabs, I could use the “%n” specifier which will state how many characters were read, from the beginning of the format string to when it encounters the “%n” specifier. E.g.:

Expand|Select|Wrap|Line Numbers
  1. /* CODE FRAGMENT 3 */
  2. int bytesRead=0;
  3. (void)scanf(“%*[ \t\v]%n”, &bytesRead);
  4.  
NOTE: I have initialised bytesRead to zero. If I didn’t and no spaces, tabs or vertical tabs were read, it will not update bytesRead so it’s value would be indeterminate.

I’ve also casted the return value to (void). This is because some compilers will warn that I am ignoring the return value. This is because return values should not be ignored, however in this case it is ok to do so. Casting to void tells the compiler that, “Yes, I am aware that I am ignoring the return value and it is a legitimate.”


Parsing Using Double Buffering
In the previous section, I showed how to not double buffer the data. There are times however, when this is not possible.

Reading in a string or series of characters intrinsically requires the use of double buffering. The data is read to an internal stdio buffer and then copied to your programme’s data space. You can then do with it any way you wish, by either further processing it or displaying it.

To read in a whitespace delimited string, you can use scanf()’s format specifier “%<bufferSize-1>s”, where bufferSize is the size of the buffer you are writing to. Never use “%s” alone without a bufferSize specified as this will lead to buffer overflow errors making your programme insecure and cause hard to find bugs. To ensure that your buffer is NULL (‘\0’) terminated so that it can be used as a c-string, you should always set the last element in the buffer to ‘\0’.

To read in a string using delimiters other than or in addition to whitespaces, use the negated character class. “%<bufferSize-1>[^:]” states that it will read in a string that is bufferSize-1 bytes consisting of characters that are not colons (‘:’). Use the same precautions here as I stated with the “%s” specifier.


Parsing Using Triple Buffering
Yes, you can buffer the buffer’s buffer. Why would you want to do this? One reason I can think of is to decouple parts of your code from a stream. This enables you to pass it a string with ease and can simplify debugging.

To read in a line, you can use a non-standard function called getline() available on the web or as a gcc extension. It will automatically allocate the memory you require. You can also use getdelim() which is also a gcc extension. It is very similar to getline() but you can choose a single delimiter.

From there, you can use sscanf() to parse the string in the same way I described with scanf(). You can also use atoi(), atol(), atof(), strtol(), strtoul(), strtod(), strtoll(), strtok() and other c-string manipulators when you use this triple buffering technique.


Binary files
I’ve not said anything about binary files up to this point. This is because of endian incompatibility . Endianness refers to the byte order of the binary representation stored. On Motorola CPUs it is usually stored in big-endian, Intel uses little-endian, network protocols usually use big-endian. If you are writing a programme that is storing data to disk or passing it over a network, such that it may be read from another computer using a different endian scheme, you need to be aware of this problem and how to correct for it.

It is not difficult to do, but it is not trivial either. What you need to do is ensure you are using a common scheme, both on the read and the write end. This is done by either using macros or functions that will reverse the bytes (if necessary) prior to writing them and reverse the bytes again (if necessary) after reading them but before processing them.

This is beyond the scope of this document. If there is enough interest, I will write about it in another document.


Conclusion
Parsing a file is not very difficult and you can safely use scanf() and its relatives if you take appropriate precautions.

If you have any questions or find anything unclear. Feel free to post a message and I will get back to you and/or update the document when I can.


Adrian

Revision History
20/05/2007 13:00
  • Corrected grammatical error
20/05/2007 13:45
  • Corrected error wrt what a FILE is
29/05/2007 12:44
  • Added FYI at beginning of document


This document is protected under the Creative Commons Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
Attached Files
File Type: zip CODE1.zip (392 Bytes, 1184 views)
Jun 4 '07 #1
1 64150
Hafeez khan
1 New Member
Could you post an example of triple buffering. To show how to process the data-chunks, while reading data-chunks from file and writing to another file(stream) in chunks.
Any help would be greatly appreciated.
Sep 15 '10 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

1
3064
by: Harobed | last post by:
Hi, I have a xml file encode in ISO-8859-15. When xml.dom parse this file, it send this error : xml.parsers.expat.ExpatError: not well-formed (invalid token): line 9, column 46 Line 9 content some accent char. I would like solve this error. Thank you
2
10283
by: Anthony Liu | last post by:
I copy-pasted the following sample xml document from http://slis-two.lis.fsu.edu/~xml/sample.html and saved it as samplexml.xml. Please note that I removed the following line <!DOCTYPE DOCUMENT SYSTEM "simple.dtd"> from the original xml sample. <?XML version="1.0" encoding="UTF-8"?>
19
2464
by: Peter A. Schott | last post by:
I've got a file that seems to come across more like a dictionary from what I can tell. Something like the following format: ###,1,val_1,2,val_2,3,val_3,5,val_5,10,val_10 ###,1,val_1,2,val_2,3,val_3,5,val_5,11,val_11,25,val_25,967,val_967 In other words, different layouts (defined mostly by what is in val_1, val_2, val_3). The ,#, fields indicate what "field" from our mainframe the corresponding value
2
3955
by: Cigdem | last post by:
Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home Canonicalpath-Directory4: \\wkdis3\ROOT\home\bwe\ You selected the file named AAA.XML getXmlAlgorithmDocument(): IOException Not logged in
3
2233
by: IWP506 | last post by:
Hey, I have a lot of common things I want to be included on different pages (i.e. the page title, the header, some buttons and such, etc.). So I was thinking of putting things like "*PAGETITLE*" into my html documents, then having a parse.php file that would replace *PAGETITLE* with the title of the pages. So I would structure my links like
2
4164
by: Lou Civitella | last post by:
Using VB.Net what is the best way to parse a file name from a web address? For example: http://www.website.com/downloads/video1.avi I want to extract video1.avi from the above address. Thanks In Advance, Lou
7
3296
by: amfr | last post by:
I was wondering how i could parse the contents of a file into an array. the file would look something like this: gif:image/gif html:text/html jpg:image/jpeg .... As you can see, it contains the mime type and the file extension seperated by commas, 1 per line. I was wondering if it was possible to
0
9257
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
9017
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8969
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7876
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5923
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4434
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4689
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3125
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2462
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.