468,512 Members | 1,428 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 468,512 developers. It's quick & easy.

Parsing large files

hi,

I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
Any ideas??

TIA
Aditya

Sep 13 '06 #1
2 5487
aditya.raghunath wrote:
I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
Any ideas??
Getline is reading them as strings, copying each string. That wastes time
both allocating a random sized block of memory, then copying in the CPU. A
hard drive has a DMA channel that its driver can exploit, but strings
probably can't use this.

Then, your OS and possibly your C++ are buffering the file ahead of the
string. This is partly because the read-write head keeps flying over the
file, so the drive buffer might as well take it in, and partly because some
Standard Library systems also buffer the file.

One way to fix this is not use getline(), and not copy the string. You
should stream each byte of your file into your program, and your program
should use a state table to parse and figure out what to do with each one.
This technique makes better use of the read-ahead buffers, and it ought to
lead to a better design.

Another way is to use OS-specific functions (which are off-topic here), to
map the file into memory. Then you can point into the file with a real C++
pointer. If you can then run this pointer from one end of the file to the
other, you should accurately exploit the DMA channel between the hard drive
and the CPU. Then, if your pointer instead skips around, you will at least
only use the OS's virtual paging mechanism to read and write the actual
file, with no intervening OS or C++ buffers.

Then next way is to use OS-specific functions that batch together many
commands to the driver of your hard drive. Obviously only an OS-specific
newsgroup can even advise you about these situations.

--
Phlip
http://www.greencheese.us/ZeekLand <-- NOT a blog!!!
Sep 13 '06 #2
In article <11*********************@h48g2000cwc.googlegroups. com>,
ad**************@gmail.com says...
hi,

I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot, takes more than 15-20
min to read them. I want to know efficient ways to read these files.
You might try opening them with fopen and reading them with fgets
instead. There are quite a few standard libraries for which that
provides a substantial speed improvement.

There are also quite a few platform-dependent optimizations. For
example, on Windows you can often gain a substantial amount of speed by
opening files in binary (untranslated) mode, but doing the same on UNIX
or anything very similar normally won't make any difference at all.

--
Later,
Jerry.

The universe is a figment of its own imagination.
Sep 13 '06 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by PedroX | last post: by
3 posts views Thread by Kevin | last post: by
1 post views Thread by Rahul | last post: by
3 posts views Thread by Buddy Ackerman | last post: by
6 posts views Thread by comp.lang.php | last post: by
2 posts views Thread by alex masselot | last post: by
reply views Thread by NPC403 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.