473,651 Members | 3,063 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Efficient techniques to handle large binary files

Dear comp.lang.c++,
I'm interested in knowing the general techniques used to handle large
binary files (>10GB) efficiently such as tweaking with filebuf , etc.
Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation. I am interested in knowing
when & what to apply in frequently encoutered scenarios. For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?
Thank you.
KK

Mar 3 '06 #1
1 3961
pe******@gmail. com wrote:
Reading chunk by chunk seems to be popular choice even though it
complicates the algorithm implementation.
If you can read 10GB into memory, just do so and do the processing
from there... Personally, I wouldn't even try to do it this way even
if I had a machine with this amount of memory or at least virtual
memory.
I am interested in knowing
when & what to apply in frequently encoutered scenarios.
In general, I find that the approach taken for formatted I/O works
quite well and a similar approach can be taken for binary I/O, too:
you start off with some elementary input operations for basic types,
encapsulating them into appropriate operators, i.e. for input using
'operator>>()'. For binary I/O you should use a new class similar to
'std::istream' but not 'std::istream' itself because this class is
used for formatted I/O. The important abstraction is that it
internally uses a stream buffer ('std::streambu f') and obtains
individual characters from there. Once you got input operations for
the basic building blocks (e.g. integers, doubles, blobs, etc.) you
would layer input for data structures on top of them.

Even though the basic input operations might operate on individual
characters or on relatively small entities, the added processing for
these elements typically does not matter at all because the actual
I/O waits are much bigger. This way, the block structure of the
file is nice abstracted in the file buffer (or whatever stream buffer
you are using).
For example,
if I have to remove certain data patterns from a huge file. Is
processing the pattern search algorithm on chunk by chunk basis, the
only bet?


Since the patterns typically won't respect the block structure of the
file, this does not necessarily work. If your pattern search is,
however, relatively simple, you might indeed just fill a buffer and
increase it when necessary to detect patterns. One thing you might
want to try is using 'std::istreambu f_iterator<char >()' to read your
file and buffer the match portions of pattern before sending them on.
However, many implementations of 'std::istreambu f_iterator<char >()'
are not really that good. Alternatively, you might want to process
individual characters obtained directly from the stream buffer
instead of going through a stream buffer iterator.

The best approach depends on what you really want to do and whether
you want to retain most or at least some data after reading it.
--
<mailto:di***** ******@yahoo.co m> <http://www.dietmar-kuehl.de/>
<http://www.eai-systems.com> - Efficient Artificial Intelligence
Mar 3 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
6780
by: Steve Troxell | last post by:
I am trying to use ServerXMLHTTP in an ASP page to return a binary file download to the browser. It works just fine with small files ( under 1 MB) but seems to fail with large files (4 MB, 11 MB in tests). A success would be that the browser kicks off the "Save As" file dialog. The failures are not always the same. Sometimes the browser tries to download the ASP file itself. Sometimes the the file seems to download successfully, but for...
28
2776
by: wwj | last post by:
void main() { char* p="Hello"; printf("%s",p); *p='w'; printf("%s",p); }
5
6441
by: rnorthedge | last post by:
I am working on a code library which needs to read in the data from large binary files. The files hold int, double and string data. This is the code for reading in the strings: protected internal override string ReadString() { stringLength = fileStream.ReadByte(); moInput.Read(byteArrayBuffer, 0, stringLength); return asciiEncoding.GetString(byteArrayBuffer, 0, stringLength ); }
5
4827
by: kids_pro | last post by:
Hi, How does File.Move implmented? I am like to use it a lot but when I come a cross a large file >500 MB my UI is freezed. I think about implement my own fileMove function but I am not sure what is the efficient way to implement it. There are many thing in the System.IO such as BinaryRead, BinaryWrite FileStream etc.
2
4625
by: gauravkhanna | last post by:
Hi All I need some help for the below problem: Scenario We need to send large binary files (audio file of about 10 MB or so) from the client machine (.Net Windows based application, located outside the home network) to the Web Server and then retrieve the file back from the web server to the client.
10
2345
by: Fabuloussites | last post by:
I'm considering deploying an application that will us an IP address locaiton database provided by Ip2location.com... http://www.ip2location.net/ip2location-dotnet-component.aspx their .net component reads data from a binary file. I'm guess i'm wondering which of the two options would be best. 1) read data from the binary files (~27 mb or so) 2) or query a sql DB with the same info.
10
3637
by: joelagnel | last post by:
hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in binary mode. according to me, notepad opens files and each byte of the file read, it converts that byte from ascii to its correct character and displays
11
6634
by: David Lees | last post by:
I want to process large binary files (>2GB) in Python. I have played around with prototypes in pure Python and profiled the code. Most of the time seems to be spent converting back and forth to and from strings using the struct module. Is there a way to directly read into an array of integers in Python? TIA David Lees
3
2358
by: sebastian.harko | last post by:
Helllo, What's the general accepted strategy for dealing with very large binary files in C# ? I have to do a program that reads some "multi frame bitmap " files which can reach up to one hundred megs so I need to know how to optimize reading a file.. Best regards, Seb
0
8278
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
8807
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8701
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
8584
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
6158
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5615
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4144
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4290
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
2701
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.