By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
449,264 Members | 1,763 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 449,264 IT Pros & Developers. It's quick & easy.

Parsing large files

P: n/a
hi,

I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
Any ideas??

TIA
Aditya

Sep 13 '06 #1
Share this Question
Share on Google+
2 Replies


P: n/a
aditya.raghunath wrote:
I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
Any ideas??
Getline is reading them as strings, copying each string. That wastes time
both allocating a random sized block of memory, then copying in the CPU. A
hard drive has a DMA channel that its driver can exploit, but strings
probably can't use this.

Then, your OS and possibly your C++ are buffering the file ahead of the
string. This is partly because the read-write head keeps flying over the
file, so the drive buffer might as well take it in, and partly because some
Standard Library systems also buffer the file.

One way to fix this is not use getline(), and not copy the string. You
should stream each byte of your file into your program, and your program
should use a state table to parse and figure out what to do with each one.
This technique makes better use of the read-ahead buffers, and it ought to
lead to a better design.

Another way is to use OS-specific functions (which are off-topic here), to
map the file into memory. Then you can point into the file with a real C++
pointer. If you can then run this pointer from one end of the file to the
other, you should accurately exploit the DMA channel between the hard drive
and the CPU. Then, if your pointer instead skips around, you will at least
only use the OS's virtual paging mechanism to read and write the actual
file, with no intervening OS or C++ buffers.

Then next way is to use OS-specific functions that batch together many
commands to the driver of your hard drive. Obviously only an OS-specific
newsgroup can even advise you about these situations.

--
Phlip
http://www.greencheese.us/ZeekLand <-- NOT a blog!!!
Sep 13 '06 #2

P: n/a
In article <11*********************@h48g2000cwc.googlegroups. com>,
ad**************@gmail.com says...
hi,

I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
You might try opening them with fopen and reading them with fgets
instead. There are quite a few standard libraries for which that
provides a substantial speed improvement.

There are also quite a few platform-dependent optimizations. For
example, on Windows you can often gain a substantial amount of speed by
opening files in binary (untranslated) mode, but doing the same on UNIX
or anything very similar normally won't make any difference at all.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Sep 13 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.