473,396 Members | 1,758 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Efficient techniques to handle large binary files

Dear comp.lang.c++,
I'm interested in knowing the general techniques used to handle large
binary files (>10GB) efficiently such as tweaking with filebuf , etc.
Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation. I am interested in knowing
when & what to apply in frequently encoutered scenarios. For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?
Thank you.
KK

Mar 3 '06 #1
1 3943
pe******@gmail.com wrote:
Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation.
If you can read 10GB into memory, just do so and do the processing
from there... Personally, I wouldn't even try to do it this way even
if I had a machine with this amount of memory or at least virtual
memory.
I am interested in knowing
when & what to apply in frequently encoutered scenarios.
In general, I find that the approach taken for formatted I/O works
quite well and a similar approach can be taken for binary I/O, too:
you start off with some elementary input operations for basic types,
encapsulating them into appropriate operators, i.e. for input using
'operator>>()'. For binary I/O you should use a new class similar to
'std::istream' but not 'std::istream' itself because this class is
used for formatted I/O. The important abstraction is that it
internally uses a stream buffer ('std::streambuf') and obtains
individual characters from there. Once you got input operations for
the basic building blocks (e.g. integers, doubles, blobs, etc.) you
would layer input for data structures on top of them.

Even though the basic input operations might operate on individual
characters or on relatively small entities, the added processing for
these elements typically does not matter at all because the actual
I/O waits are much bigger. This way, the block structure of the
file is nice abstracted in the file buffer (or whatever stream buffer
you are using).
For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?


Since the patterns typically won't respect the block structure of the
file, this does not necessarily work. If your pattern search is,
however, relatively simple, you might indeed just fill a buffer and
increase it when necessary to detect patterns. One thing you might
want to try is using 'std::istreambuf_iterator<char>()' to read your
file and buffer the match portions of pattern before sending them on.
However, many implementations of 'std::istreambuf_iterator<char>()'
are not really that good. Alternatively, you might want to process
individual characters obtained directly from the stream buffer
instead of going through a stream buffer iterator.

The best approach depends on what you really want to do and whether
you want to retain most or at least some data after reading it.
--
<mailto:di***********@yahoo.com> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence
Mar 3 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Steve Troxell | last post by:
I am trying to use ServerXMLHTTP in an ASP page to return a binary file download to the browser. It works just fine with small files ( under 1 MB) but seems to fail with large files (4 MB, 11 MB in...
28
by: wwj | last post by:
void main() { char* p="Hello"; printf("%s",p); *p='w'; printf("%s",p); }
5
by: rnorthedge | last post by:
I am working on a code library which needs to read in the data from large binary files. The files hold int, double and string data. This is the code for reading in the strings: protected...
5
by: kids_pro | last post by:
Hi, How does File.Move implmented? I am like to use it a lot but when I come a cross a large file >500 MB my UI is freezed. I think about implement my own fileMove function but I am not sure...
2
by: gauravkhanna | last post by:
Hi All I need some help for the below problem: Scenario We need to send large binary files (audio file of about 10 MB or so) from the client machine (.Net Windows based application, located...
10
by: Fabuloussites | last post by:
I'm considering deploying an application that will us an IP address locaiton database provided by Ip2location.com... http://www.ip2location.net/ip2location-dotnet-component.aspx their .net...
10
by: joelagnel | last post by:
hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in...
11
by: David Lees | last post by:
I want to process large binary files (>2GB) in Python. I have played around with prototypes in pure Python and profiled the code. Most of the time seems to be spent converting back and forth to...
3
by: sebastian.harko | last post by:
Helllo, What's the general accepted strategy for dealing with very large binary files in C# ? I have to do a program that reads some "multi frame bitmap " files which can reach up to one...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.