find a pattern in binary file

vizzz

Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea

Jun 27 '08 #1

Subscribe Reply

8949

Kai-Uwe Bux

vizzz wrote:

Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?

You could try std::search() with istreambuf_iter ator< unsigned char >.

However:

(a) It is not clear that you will get good performance. Some implementations
are not really all that good with stream iterators.

(b) I am not sure whether search() is allowed to use backtracking
internally, in which case you cannot use it with stream iterators. You
should check.

(c) Even if search finds an occurrence, it reports the result as an
iterator. I do not know of a convenient way to convert that into an offset.
Maybe, rolling your own is not all that bad. You could read the file in
chunks (keeping the last three characters from the previous block) and use
std::search() on the blocks. With the right blocksize, this could be really
fast.
If your OS allows memory mapping of the file, you could do that and use
std::search() with unsigned char * on the whole thing. That could be the
fasted way, but will leave the realm of standard C++.
Best

Kai-Uwe Bux

Jun 27 '08 #2

Ivan

On Jun 20, 1:11*pm, vizzz <andrea.visin.. .@gmail.comwrot e:

Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea

Hmmm... I had a look at this and ran accross a simple problem. How do
you read a binary file and just echo the HEX for byte to the screen.
The issue is the c++ read function doesn't return number of bytes
read... so on the last read into a buffer how do you know how many
characters to print?

Thanks,
Ivan Novick
http://www.mycppquiz.com

Jun 27 '08 #3

Kai-Uwe Bux

Ivan wrote:

On Jun 20, 1:11*pm, vizzz <andrea.visin.. .@gmail.comwrot e:
>Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea

Hmmm... I had a look at this and ran accross a simple problem. How do
you read a binary file and just echo the HEX for byte to the screen.

#include <iostream>
#include <ostream>
#include <fstream>
#include <iterator>
#include <iomanip>
#include <algorithm>
#include <cassert>

class print_hex {

std::ostream * ostr_ptr;
unsigned int line_length;
unsigned int index;

public:

print_hex ( std::ostream & str_ref, unsigned int length )
: ostr_ptr( &str_ref )
, line_length ( length )
, index ( 0 )
{}

void operator() ( unsigned char ch ) {
++index;
if ( index >= line_length ) {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << '\n';
index = 0;
} else {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << ' ';
}
}

};

int main ( int argn, char ** args ) {
assert( argn == 2 );
std::ifstream in ( args[1] );
std::for_each( std::istreambuf _iterator< char >( in ),
std::istreambuf _iterator< char >(),
print_hex( std::cout, 25 ) );
std::cout << '\n';
}

The issue is the c++ read function doesn't return number of bytes
read... so on the last read into a buffer how do you know how many
characters to print?

Have a look at readsome().

Best

Kai-Uwe Bux

Jun 27 '08 #4

Eric Pruneau

"vizzz" <an************ *@gmail.coma écrit dans le message de news:
aa************* *************** **...legroup s.com...

Hi there,
i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea

Check out boost::regex

http://www.boost.org/doc/libs/1_35_0...tml/index.html

Jun 27 '08 #5

James Kanze

On Jun 20, 10:43 pm, Kai-Uwe Bux <jkherci...@gmx .netwrote:

vizzz wrote:

i need to find an hex pattern like 0x650A1010 in a binary
file. i can make a small algorithm that fetch all the file
for the match, but this file is huge, and i'm scared about
performances. Is there any stl method for a fast search?

You could try std::search() with istreambuf_iter ator< unsigned char >.

That's very problematic. istreambuf_iter ator< unsigned char >
will expect a basic_streambuf < unsigned char >, which isn't
defined by the standard (and you're not allowed to define it).
A number of implementations do provide a generic version of
basic_streambuf , but since the standard doesn't say what the
generic version should do, they tend to differ. (I remember
sometime back someone posting in fr.comp.lang.c+ + that he had
problems because g++ and VC++ provide incompatible generic
versions.)

It would, I suppose, be possible to use istream_iterato r<
unsigned char >, provided the file was opened in binary mode,
and you reset skipws. I have my doubts about the performance of
this solution, but it's probably worth a try---if the
performance turns out to be acceptable, you won't get much
simpler.

Except, of course, that search requires forward iterators, and
won't (necessarily) work with input iterators.

[...]

Maybe, rolling your own is not all that bad. You could read
the file in chunks (keeping the last three characters from the
previous block) and use std::search() on the blocks. With the
right blocksize, this could be really fast.

A lot depends on other possible constraints. He didn't say, but
his example was to look for 0x650A1010, not the sequence 0x65,
0x0A, 0x10, 0x10. If what he is really looking for is a four
byte word, correctly aligned, then as long as the block size is
a multiple of 4, he could use search() with an
iterator::value _type of uint32_t. For arbitrary positions and
sequences, on the other hand, some special handling might be
necessary for cases where the sequence spans a block boundary.

When I had to do something similar, I reserved a guard zone in
front of my buffer, and used a BM search in the buffer. When
the BM search would have taken me beyond the end of the buffer,
I copied the last N bytes of the buffer into the end of the
guard zone before reading the next block, and started my next
search from them. This would probably make keeping track of the
offset a bit tricky (I didn't need the offset), and for the best
performance on the system I was using then, I had to respect
alignment of the buffer as well, which also added some extra
complexity. (But I got the speed we needed:-).)

If your OS allows memory mapping of the file, you could do
that and use std::search() with unsigned char * on the whole
thing. That could be the fasted way, but will leave the realm
of standard C++.

If the entire file will fit into memory, perhaps just reading it
all into memory, and then using std::search, would be an
appropriate solution. Or perhaps not: it's often faster to use
a somewhat smaller buffer, and manage the "paging" yourself.

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #6

James Kanze

On Jun 21, 2:13 am, Kai-Uwe Bux <jkherci...@gmx .netwrote:

Ivan wrote:
On Jun 20, 1:11 pm, vizzz <andrea.visin.. .@gmail.comwrot e:

Hmmm... I had a look at this and ran accross a simple
problem. How do you read a binary file and just echo the
HEX for byte to the screen.

#include <iostream>
#include <ostream>
#include <fstream>
#include <iterator>
#include <iomanip>
#include <algorithm>
#include <cassert>

class print_hex {

std::ostream * ostr_ptr;
unsigned int line_length;
unsigned int index;

public:

print_hex ( std::ostream & str_ref, unsigned int length )
: ostr_ptr( &str_ref )
, line_length ( length )
, index ( 0 )
{}

void operator() ( unsigned char ch ) {
++index;
if ( index >= line_length ) {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << '\n';
index = 0;
} else {
(*ostr_ptr) << std::hex << std::setw(2) << std::setfill( '0' )
<< (unsigned int)(ch) << ' ';

Wouldn't it be preferable to set the formatting flags in the
constructor? I'd also provide an "indent" argument; if index
were 0, I'd output indent spaces, otherwise a single space---or
perhaps the best solution would be to provide a start of line
and a separator string to the constructor, then:

(*ostr_ptr)
<< (inLineCount == 0 ? startString : separString)
<< std::setw( 2 ) << (unsigned int)( ch ) ;
++ inLineCount ;
if ( inLineCount == lineLength ) {
(*ostr_ptr) << endString ;
inLineCount = 0 ;
}

(This supposes that hex and fill were set in the constructor.)
Given the copying that's going on, I'd also simulate move
semantics, so that the final destructor could do something like:

if ( inLineCount != 0 ) {
(*ostr_ptr) << endString ;
}

}
}
};

int main ( int argn, char ** args ) {
assert( argn == 2 );
std::ifstream in ( args[1] );
std::for_each( std::istreambuf _iterator< char >( in ),
std::istreambuf _iterator< char >(),
print_hex( std::cout, 25 ) );

Unless you're doing something relatively generic, with support
for different separators, etc., this really looks like a case of
for_each abuse.

std::cout << '\n';

Which results in one new line too many if the number of elements
just happened to be an exact multiple of the line length.

About the only real use for this sort of output I've found is
debugging or experimenting, but there, I use it often enough
that I've a generic Dump<Tclass (and a generic function which
returns it, for automatic type deduction), so that I can write
things like:

std::cout << dump( someObject ) << std::endl ;

The code that ends up getting called in the << operator is:

IOSave saver( dest ) ;
dest.fill( '0' ) ;
dest.setf( std::ios::hex, std::ios::basef ield ) ;
char const* baseStr = "" ;
if ( (dest.flags() & std::ios::showb ase) != 0 ) {
baseStr = "0x" ;
dest.unsetf( std::ios::showb ase ) ;
}
unsigned char const* const
end = myObj + sizeof( T ) ;
for ( unsigned char const* p = myObj ; p != end ; ++ p ) {
if ( p != myObj ) {
dest << ' ' ;
}
dest << baseStr << std::setw( 2 ) << (unsigned int)( *p ) ;
}

(Note that there's extra code there to support my personal
preference: a "0x" with a small x, even if std::ios::upper case
is specified.)

}
The issue is the c++ read function doesn't return number of
bytes read... so on the last read into a buffer how do you
know how many characters to print?

Have a look at readsome().

Yes, have a look at it. Read it's specification very carefully.
Because if you do, you're realize that it is absolutely
worthless here.

The function he's looking for is istream::gcount (), which
returns the number of bytes read by the last unformatted read.
His basic loop would be:

while ( input.read( &buffer[ 0 ], buffer.size() ) ) {
process( buffer.begin(), buffer.end() ) ;
}
process( buffer.begin(), buffer.begin() + input.gcount() ) ;

(But IMHO, istream really isn't appropriate for binary; if I'm
really working with a binary file, I'll drop down to the system
API.)

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #7

James Kanze

On Jun 21, 3:59 am, "Eric Pruneau" <eric.prun...@c gocable.cawrote :

"vizzz" <andrea.visin.. .@gmail.coma écrit dans le message de news:
aad55897-6560-4fd7-ae4f-5b8cc810f...@a7 0g2000hsh.googl egroups.com...

i need to find an hex pattern like 0x650A1010 in a binary file.
i can make a small algorithm that fetch all the file for the match,
but this file is huge, and i'm scared about performances.
Is there any stl method for a fast search?
Andrea

Check out boost::regex

Which requires a forward iterator, and so can't be used on data
in a file (for which he'll have at best an input iterator).

Also, if he's only looking for a fixed string, it's likely to be
significantly slower than some other algorithms.

http://www.boost.org/doc/libs/1_35_0...tml/index.html

Jun 27 '08 #8

James Kanze

On Jun 22, 12:49 am, Kai-Uwe Bux <jkherci...@gmx .netwrote:

James Kanze wrote:

[...]

Unless you're doing something relatively generic, with
support for different separators, etc., this really looks
like a case of for_each abuse.

Actually, with regard to for_each, I am growing more and more
comfortable using it.

I'm actually pretty comfortable using it too. Regretfully, we
seem to be a minority, and the programmers having to maintain my
code find it "unnatural" , and that it hurts readability, to move
the contents of a loop out into a separate class. Unless that
class is in some way "reusable", i.e. it represents some more
general application.

[...]

std::cout << '\n';

Which results in one new line too many if the number of
elements just happened to be an exact multiple of the line
length.

You are making up specs :-)

You started it:-). You decided that he needed newlines in ths
sequence to begin with. (OK: somebody did say something about
megabytes somewhere. But maybe he has a very, very wide
screen.)

But seriously: you are right, of course.

About the only real use for this sort of output I've found is
debugging or experimenting, but there, I use it often enough
that I've a generic Dump<Tclass (and a generic function which
returns it, for automatic type deduction), so that I can write
things like:

std::cout << dump( someObject ) << std::endl ;

[snip]

Hm, I never had a use for hex dumping objects. But, maybe I
should try that out.

I didn't really, for the longest time (which is why it isn't at
my site---I only added it to the library very recently). Even
now, most of its use is for "experimenting" : for trying to guess
the representation of some type in an undocumented format, for
example.

On the other hand, if I ever find time to write up an article on
how to correctly use iostream, I'll probably include it, because
it is a good example of how to handle arbitrary formatting for
any possible type.

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #9

James Kanze

On Jun 21, 8:57 pm, Mirco Wahab <wahab-m...@gmx.dewrot e:

vizzz wrote:
Maybe explaining my goal can be useful.
in jpeg2000 files (jp2) there are several boxes made of 4byte length,
4byte type and then data.
i must check if box exist by searching somewhere in the file
(boxes can be anywhere in the whole file) for the box type
(ex 0x650A1010).

What is the largest file size and on which system do you want
this to happen?

The C-memchr is, on modern compilers, very very fast (it does
8 byte alignment on the pointer, scans 32 or 64 bit at a time
by bit ops and so on.)

Maybe. I'm not familiar with the jpeg format, but somehow, I'd
be a bit surprised if the 4 byte value isn't required to be
aligned. And if it's aligned, treating the buffer as an array
of uint32_t, and using std::find, will almost certainly be
significantly faster than memchr.

You can't simply beat that one.

Actually, you almost always can.

--
James Kanze (GABI Software) email:ja******* **@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientier ter Datenverarbeitu ng
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Jun 27 '08 #10

Similar topics

1846

[Win32] Regexing a pattern from a binary file?

by: Fred the man | last post by:

Hi, I'm a PHP newbie, and am stuck as to why I can't find a pattern in a Win32 binary file. I'm actually trying to extract the FileVersion information myself since PHP under Unix doesn't seem to offer support for the PE file format: -------------

PHP

1262

How to read a binary file into a variable

by: john_phx | last post by:

I'm trying to increase the performance of a program that concatenates binary file parts into a single file. Each of the parts is contained in a binary file. The existing app simply takes the first part, renames it, then concatenates each additional part to that file I'd like to check the user's system for available heap space and calculate how many parts I can hold in memory, read the files into variables, concatenate the variables...

.NET Framework

16002

Getting file size of binary file

by: Arnold | last post by:

Is using fseek and ftell a reliable method of getting the file size on a binary file? I thought I remember reading somewhere it wasn't... If not what would be the "right" and portable method to obtain it? Thanks.

C / C++

3773

Read a binary file until "\name\" is encountered...

by: spike | last post by:

Im trying to write a program that should read through a binary file searching for the character sequence "\name\" Then it should read the characters following the "\name\" sequence until a NULL character is encountered. But when my program runs it gets a SIGSEGV (Segmentation vioalation) signal. Whats wrong? And is there a better way than mine to solve this task (most likely)

C / C++

5601

write a binary file?

by: cylin | last post by:

Dear all, I open a binary file and want to write 0x00040700 to this file. how can I set write buffer? --------------------------------------------------- typedef unsigned char UCHAR; int iFD=open(szFileName,O_CREAT|O_BINARY|O_TRUNC|O_WRONLY,S_IREAD|S_IWRITE); UCHAR buffer; //??????????? write(iFD,buffer,5); ---------------------------------------------------

C / C++

3176

how to insert unique ID into binary file that created after compilation?

by: pristo | last post by:

hello All, how can i insert unique ID into binary file (that created by compiler)? (so after compiling i can to identify the src that i use) thx

C / C++

6062

Reading structures from a binary file

by: John Dann | last post by:

I'm trying to read some binary data from a file created by another program. I know the binary file format but can't change or control the format. The binary data is organised such that it should populate a series of structures of specified variable composition. I have the structures created OK, but actually reading the files is giving me an error. Can I ask a simple question to start with: I'm trying to read the file using the...

Visual Basic .NET

14569

Find and Replace in Binary File

by: mouac01 | last post by:

Newbie here. How do I do a find and replace in a binary file? I need to read in a binary file then replace a string "ABC" with another string "XYZ" then write to a new file. Find string is the same length as Replace string. Here's what I have so far. I spent many hours googling for sample code but couldn't find much. Thanks... public static void FindReplace(string OldFile, string NewFile) { string sFind = "ABC"; //I probably need...

C# / C Sharp

4586

Pattern Search in Binary FIle

by: kenone | last post by:

I have loaded a large binary file into memory and now I want to search for 10101. I was using file.get to return the next hex number and see if it was equal to 0x15. This is not correct as part of my seach pattern (10101) may straggle over two hex numbers. Does anyone know of a way to find the pattern 10101 in a binary file loaded into memory? Any help is appreciated.

C / C++

9287

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9886

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9857

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9722

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8723

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5155

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5318

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3369

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2677

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General