473,725 Members | 2,070 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Searching for byte string in a binary file.

I took a C course some time ago, but I'm only now beginning to use it,
for a personal pet project. My current stumbling-block is finding an
efficient way to find a match between the beginning of a "counted"
string and data in a binary file. Given...

#include <stdio.h>

int main(int argc, char *argv[])
{
char bstring[255];
int bstring_length, match_length;
long match_location;
FILE *input_file, *output_file;

if(argc != 3){
printf(" Correct usage:\n %s input_filename output_filename \n", argv[0]);
return 1;
}
if((input_file = fopen(argv[1], "rb")) == NULL){
printf("Error opening %s for input\n", argv[1]);
return 1;
}
if((output_file = fopen(argv[2], "wb")) == NULL){
printf("Error opening %s for output\n", argv[2]);
return 1;
}
...

Later on, bstring and bstring_length get initialized. bstring may
contain \0's, so the "logical length" (which will not exceed 255) is
actually stored in bstring_length. What I'm trying to do is to find the
location and length of the longest match from the left side of bstring.
If that's not clear, here's an example...

bstring[0] = 'a';
bstring[1] = '\0';
bstring[2] = 'b';
bstring[3] = 'c';
bstring[4] = 'd';
bstring_length = 5;

Assume that the sequence is not matched anywhere in the file. However,
assume that the sequence 'a', '\0', 'b', 'c' does exist in the file.
What's the most efficient search that returns match_length ( 4 ) and
ftell() of the beginning of the match? If it's a function, should I
declare in main() like so?

int bstring_length, *match_length;
long *match_location ;

I assume that the function prototype would be something like...

int findmatch( bstring[255], bstring_length, match_length, match_location) ;

By passing pointers, I assume that both match_location and
match_length will be modified and available to main(). Or is my
understanding of pointers lacking. findmatch() should return 0 if a
match is found, and 1 if no match. Would memory-mapped files help here?

--
Walter Dnes; my email address is *ALMOST* like wz*******@waltd nes.org
Delete the "z" to get my real address. If that gets blocked, follow
the instructions at the end of the 550 message.
Nov 14 '05 #1
14 2476
"Walter Dnes (delete the 'z' to get my real address)" <wz*******@walt dnes.org> wrote:
I took a C course some time ago, but I'm only now beginning to use it,
for a personal pet project. My current stumbling-block is finding an
efficient way to find a match between the beginning of a "counted"
string and data in a binary file. Given... #include <stdio.h>
Also include <stdlib.h>, see below.
int main(int argc, char *argv[])
{
char bstring[255];
int bstring_length, match_length;
long match_location;
FILE *input_file, *output_file; if(argc != 3){
printf(" Correct usage:\n %s input_filename output_filename \n", argv[0]);
return 1;
Make that

return EXIT_FAILURE;

not all OSes return 1 for failure and 0 for success, and with the macros
EXIT_FAILURE and EXIT_SUCCESS, defined in <stdlib.h>, you're always on
the safe side.

Another common convention is to print error messages to stderr and not
mix them with the normal output of your program. So better use

fprintf( stderr, "Correct usage:\n %s input_filename output_filename \n",
argv[0]);
}
if((input_file = fopen(argv[1], "rb")) == NULL){
printf("Error opening %s for input\n", argv[1]);
return 1;
}
if((output_file = fopen(argv[2], "wb")) == NULL){
printf("Error opening %s for output\n", argv[2]);
return 1;
}
... Later on, bstring and bstring_length get initialized. bstring may
contain \0's, so the "logical length" (which will not exceed 255) is
actually stored in bstring_length. What I'm trying to do is to find the
location and length of the longest match from the left side of bstring.
If that's not clear, here's an example... bstring[0] = 'a';
bstring[1] = '\0';
bstring[2] = 'b';
bstring[3] = 'c';
bstring[4] = 'd';
bstring_length = 5; Assume that the sequence is not matched anywhere in the file. However,
assume that the sequence 'a', '\0', 'b', 'c' does exist in the file.
What's the most efficient search that returns match_length ( 4 ) and
ftell() of the beginning of the match? If it's a function, should I
declare in main() like so? int bstring_length, *match_length;
long *match_location ; I assume that the function prototype would be something like... int findmatch( bstring[255], bstring_length, match_length, match_location) ;
Sorry, that's not a prototype.
By passing pointers, I assume that both match_location and
match_length will be modified and available to main(). Or is my
understanding of pointers lacking. findmatch() should return 0 if a
match is found, and 1 if no match.
When you want to set 'match_location ' within find_match() to a long
value that's visible in main() you need to change that a bit. You
must define 'match_location ' in main() as long, i.e. have there

long match_location;

and then pass a pointer to that variable to the function. So your
prototype for the function would look like

int find_match( char buf[ 255 ], int blen, int mlen, long *loc );

(I intentionally changed the names to be used within the function
from the ones you defined in main() to make it more obvious what
belongs to main() and what belongs to find_match().) If you don't
intend to change the char array within the function it probably
would be better to make the first argument "const char buf[ 255 ]".

You now would call that function like

find_match( bstring, bstring_length, match_len, &match_locat ion );

(please note the '&' in front of 'match_location ', it tells that
you pass the address of 'match_location ' and not its value).
Within find_match() you would then assign the position of the match
to '*loc', i.e. you write it into the memory location 'loc' points
to.

By the way, using names like 'bstring' and 'bstring_length ' is a bit
misleading since 'bstring' isn't a string (a real string can't have
embedded '\0' characters), it's just an array of chars. So it might
be prudent to give it a name that doesn't make people assume that
it's going to be a string.
Would memory-mapped files help here?


The rest of your questions on how to do the matching in the fastest,
most effective way are not really C questions but more about what
kind of algorithm to use. That's better discussed in groups like
comp.programmin g since it really doesn't is is about C. And a google
search for e.g. "Boyer-Moore algorithm" (just to name one) will
probably come up with a lot of interesting links (and show you
how many possible algorithms have been developed for that problem
over the years. E.g.

http://www-igm.univ-mlv.fr/~lecroq/string/

looks rather interesting). Finally, questions about memory-mapping
files are off-topic here because that can only be done with system-
specific extensions to C.
Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@p hysik.fu-berlin.de
\______________ ____________ http://www.toerring.de
Nov 14 '05 #2
On 16 May 2004 11:39:57 GMT, Je***********@p hysik.fu-berlin.de, <Je***********@ physik.fu-berlin.de> wrote:
E.g. http://www-igm.univ-mlv.fr/~lecroq/string/ looks rather interesting).
Thanks. That appears to be exactly what I was looking for.
Make that

return EXIT_FAILURE;

not all OSes return 1 for failure and 0 for success, and with the macros
EXIT_FAILURE and EXIT_SUCCESS, defined in <stdlib.h>, you're always on
the safe side.

Another common convention is to print error messages to stderr and not
mix them with the normal output of your program. So better use

fprintf( stderr, "Correct usage:\n %s input_filename output_filename \n",
argv[0]);


Thanks for the tips. Note that I've changed the subject. Another
question; is this the correct way to define TRUE/FALSE?

const int TRUE = (1==1);
const int FALSE = (1!=1);

--
Walter Dnes; my email address is *ALMOST* like wz*******@waltd nes.org
Delete the "z" to get my real address. If that gets blocked, follow
the instructions at the end of the 550 message.
Nov 14 '05 #3
"Walter Dnes (delete the 'z' to get my real address)" <wz*******@walt dnes.org> writes:
Thanks for the tips. Note that I've changed the subject. Another
question; is this the correct way to define TRUE/FALSE?

const int TRUE = (1==1);
const int FALSE = (1!=1);


It is correct, but indicates a lack of understanding. Read the
FAQ. In addition to the FAQ commentary on this issue, it should
be noted that the value of variables, even those defined as
constant, cannot be used in constant expressions.

9.1: What is the right type to use for Boolean values in C? Why
isn't it a standard type? Should I use #defines or enums for
the true and false values?

A: C does not provide a standard Boolean type, in part because
picking one involves a space/time tradeoff which can best be
decided by the programmer. (Using an int may be faster, while
using char may save data space. Smaller types may make the
generated code bigger or slower, though, if they require lots of
conversions to and from int.)

The choice between #defines and enumeration constants for the
true/false values is arbitrary and not terribly interesting (see
also questions 2.22 and 17.10). Use any of

#define TRUE 1 #define YES 1
#define FALSE 0 #define NO 0

enum bool {false, true}; enum bool {no, yes};

or use raw 1 and 0, as long as you are consistent within one
program or project. (An enumeration may be preferable if your
debugger shows the names of enumeration constants when examining
variables.)

Some people prefer variants like

#define TRUE (1==1)
#define FALSE (!TRUE)

or define "helper" macros such as

#define Istrue(e) ((e) != 0)

These don't buy anything (see question 9.2 below; see also
questions 5.12 and 10.2).

9.2: Isn't #defining TRUE to be 1 dangerous, since any nonzero value
is considered "true" in C? What if a built-in logical or
relational operator "returns" something other than 1?

A: It is true (sic) that any nonzero value is considered true in C,
but this applies only "on input", i.e. where a Boolean value is
expected. When a Boolean value is generated by a built-in
operator, it is guaranteed to be 1 or 0. Therefore, the test

if((a == b) == TRUE)

would work as expected (as long as TRUE is 1), but it is
obviously silly. In fact, explicit tests against TRUE and
FALSE are generally inappropriate, because some library
functions (notably isupper(), isalpha(), etc.) return,
on success, a nonzero value which is not necessarily 1.
(Besides, if you believe that "if((a == b) == TRUE)" is
an improvement over "if(a == b)", why stop there? Why not
use "if(((a == b) == TRUE) == TRUE)"?) A good rule of thumb
is to use TRUE and FALSE (or the like) only for assignment
to a Boolean variable or function parameter, or as the return
value from a Boolean function, but never in a comparison.

The preprocessor macros TRUE and FALSE (and, of course, NULL)
are used for code readability, not because the underlying values
might ever change. (See also questions 5.3 and 5.10.)

Although the use of macros like TRUE and FALSE (or YES
and NO) seems clearer, Boolean values and definitions can
be sufficiently confusing in C that some programmers feel
that TRUE and FALSE macros only compound the confusion, and
prefer to use raw 1 and 0 instead. (See also question 5.9.)

References: K&R1 Sec. 2.6 p. 39, Sec. 2.7 p. 41; K&R2 Sec. 2.6
p. 42, Sec. 2.7 p. 44, Sec. A7.4.7 p. 204, Sec. A7.9 p. 206; ISO
Sec. 6.3.3.3, Sec. 6.3.8, Sec. 6.3.9, Sec. 6.3.13, Sec. 6.3.14,
Sec. 6.3.15, Sec. 6.6.4.1, Sec. 6.6.5; H&S Sec. 7.5.4 pp. 196-7,
Sec. 7.6.4 pp. 207-8, Sec. 7.6.5 pp. 208-9, Sec. 7.7 pp. 217-8,
Sec. 7.8 pp. 218-9, Sec. 8.5 pp. 238-9, Sec. 8.6 pp. 241-4;
"What the Tortoise Said to Achilles".

--
int main(void){char p[]="ABCDEFGHIJKLM NOPQRSTUVWXYZab cdefghijklmnopq rstuvwxyz.\
\n",*q="kl BIcNBFr.NKEzjwC IxNJC";int i=sizeof p/2;char *strchr();int putchar(\
);while(*q){i+= strchr(p,*q++)-p;if(i>=(int)si zeof p)i-=sizeof p-1;putchar(p[i]\
);}return 0;}
Nov 14 '05 #4
>"Walter Dnes (delete the 'z' to get my real address)"
<wz*******@wal tdnes.org> writes:
Thanks for the tips. Note that I've changed the subject. Another
question; is this the correct way to define TRUE/FALSE?
const int TRUE = (1==1);
const int FALSE = (1!=1);

In article <news:87******* *****@blp.benpf aff.org>
Ben Pfaff <bl*@cs.stanfor d.edu> writes:It is correct, but indicates a lack of understanding. Read the
FAQ. In addition to the FAQ commentary on this issue, it should
be noted that the value of variables, even those defined as
constant, cannot be used in constant expressions.


I think it is also worth adding that C99 *does* now have a built-in
boolean type. The spelling of this type (and its values) is
deliberately sneaky, and you are supposed to "#include <stdbool.h>"
to make the names "bool", "true", and "false" available.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.
Nov 14 '05 #5

"Chris Torek" <no****@torek.n et> wrote in message news:c8******** @news2.newsguy. com...

I think it is also worth adding that C99 *does* now have a built-in
boolean type. The spelling of this type (and its values) is
deliberately sneaky, and you are supposed to "#include <stdbool.h>"
to make the names "bool", "true", and "false" available.


As a matter of interest, what was the motivation for this?

--
John.
Nov 14 '05 #6
"John L" <jl@lammtarra.f slife.co.uk> writes:
"Chris Torek" <no****@torek.n et> wrote in message
news:c8******** @news2.newsguy. com...
I think it is also worth adding that C99 *does* now have a built-in
boolean type. The spelling of this type (and its values) is
deliberately sneaky, and you are supposed to "#include <stdbool.h>"
to make the names "bool", "true", and "false" available.


As a matter of interest, what was the motivation for this?


Are you asking about the motivation for adding the type, or for
requiring the use of <stdbool.h> to make the names visible?

The motivation for adding the type seems obvious to me; if it's not
obvious to you, ask again.

The reason for adding <stdbool.h> is to avoid breaking existing code.
"bool", "true", and "false" are the natural names for the boolean type
and its values, but making them keywords would have broken any code
that declared them as identifiers. Making them macros that don't
appear without a "#include <stdbool.h>" avoids this problem (existing
code is unlikely to use a header by that name).

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #7
In <ln************ @nuthaus.mib.or g> Keith Thompson <ks***@mib.or g> writes:
"John L" <jl@lammtarra.f slife.co.uk> writes:
"Chris Torek" <no****@torek.n et> wrote in message
news:c8******** @news2.newsguy. com...
> I think it is also worth adding that C99 *does* now have a built-in
> boolean type. The spelling of this type (and its values) is
> deliberately sneaky, and you are supposed to "#include <stdbool.h>"
> to make the names "bool", "true", and "false" available.
As a matter of interest, what was the motivation for this?


Are you asking about the motivation for adding the type, or for
requiring the use of <stdbool.h> to make the names visible?

The motivation for adding the type seems obvious to me; if it's not
obvious to you, ask again.


The motivation for adding the type is not obvious to me. I've coded a lot
in C without needing such a type. A simple naming convention for flag
variables was more than enough. No need for the TRUE/FALSE nonsense,
either.
The reason for adding <stdbool.h> is to avoid breaking existing code.


This is obvious to me. I wish *all* the C99 additions were made with
such concern in mind...

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Da*****@ifh.de
Nov 14 '05 #8
Da*****@cern.ch (Dan Pop) writes:
In <ln************ @nuthaus.mib.or g> Keith Thompson <ks***@mib.or g> writes:

[regarding bool]
Are you asking about the motivation for adding the type, or for
requiring the use of <stdbool.h> to make the names visible?

The motivation for adding the type seems obvious to me; if it's not
obvious to you, ask again.


The motivation for adding the type is not obvious to me. I've coded a lot
in C without needing such a type. A simple naming convention for flag
variables was more than enough. No need for the TRUE/FALSE nonsense,
either.


This is a case where the motivation could only be non-obvious to
an expert. Regardless of whether it is useful, many, many
libraries define their own constants for truth and falsity, and
many of these conflict with each other. Adding standard
constants should help to reduce these conflicts.
--
"I should killfile you where you stand, worthless human." --Kaz
Nov 14 '05 #9
On 17 May 2004 04:24:30 GMT, Chris Torek, <no****@torek.n et> wrote:
"Walter Dnes (delete the 'z' to get my real address)"
<wz*******@wal tdnes.org> writes:
Thanks for the tips. Note that I've changed the subject. Another
question; is this the correct way to define TRUE/FALSE?
const int TRUE = (1==1);
const int FALSE = (1!=1);


In article <news:87******* *****@blp.benpf aff.org>
Ben Pfaff <bl*@cs.stanfor d.edu> writes:
It is correct, but indicates a lack of understanding. Read the
FAQ. In addition to the FAQ commentary on this issue, it should
be noted that the value of variables, even those defined as
constant, cannot be used in constant expressions.


I think it is also worth adding that C99 *does* now have a built-in
boolean type. The spelling of this type (and its values) is
deliberately sneaky, and you are supposed to "#include <stdbool.h>"
to make the names "bool", "true", and "false" available.


Thank you very much for that pointer...

[/usr] find / -name stdbool*
/usr/lib/gcc-lib/i386-linux/3.0.4/include/stdbool.h
/usr/lib/gcc-lib/i386-linux/2.95.4/include/stdbool.h

[23:28:42][/home/waltdnes] gcc --version
2.95.4

OK, my system defaults to version 2.95.4, so...

[/home/waltdnes] cat /usr/lib/gcc-lib/i386-linux/2.95.4/include/stdbool.h

/* stdbool.h for GNU. */
#ifndef __STDBOOL_H__
#define __STDBOOL_H__ 1

/* The type `bool' must promote to `int' or `unsigned int'. The
* constants
`true' and `false' must have the value 0 and 1 respectively. */
typedef enum
{
false = 0,
true = 1
} bool;

/* The names `true' and `false' must also be made available as macros.
* */
#define false false
#define true true

/* Signal that all the definitions are present. */
#define __bool_true_fal se_are_defined 1

#endif /* stdbool.h */

I'll simply "#include <stdbool.h>" and be done with it.

--
Walter Dnes; my email address is *ALMOST* like wz*******@waltd nes.org
Delete the "z" to get my real address. If that gets blocked, follow
the instructions at the end of the 550 message.
Nov 14 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
9772
by: franzkowiak | last post by:
Hello, I've read some bytes from a file and just now I can't interpret 4 bytes in this dates like a real value. An extract from my program def l32(c): return ord(c) + (ord(c)<<8) + (ord(c)<<16) + (ord(c)<<24)
0
1349
by: MSNEWS | last post by:
I'm using the code below to read a file, I need to scan thru the file and look for a specific word. The files will be in various formats, word, rtf, pdf etc so the file will have binary characters. I've tried converting the byte to a string with thestring = System.Text.Encoding.Default.GetString(MyData) but that doesn't work, the binary characters screw the conversion up So how can I search a Byte field looking for a specific word?...
5
2522
by: Skwerl | last post by:
I'm trying to quickly grab the dimensions out of the headers of JPEG files. I need to look for the hex string FFC0001108 in a file that will be from 1-40 megs in size. If someone has experience doing a similar type of search, what is the most efficient method you have found for doing this? I can see doing this by comparing the first byte of the string to each consecutive byte of the file until it is found, and then checking for the rest of...
8
3671
by: Marius Cabas | last post by:
Hi, I'm a beginner so don't shoot ;) I'm reading a wave file into a byte and I'm trying to convert the result to String but the converted string is altered, so if I'm generating a new wave file from that string, the continment is altered from it's original state. Below is a code snipped I'm using: // Here I read the wave file and I convert the result to a string byte b = new byte; FileStream fs = File.OpenRead("test.wav"); int size =...
1
5780
by: Angel Filev | last post by:
Hi everyone, I am trying to store a file as a binary array in an "image" field in SQL Server 2000 database. It works OK except for the ".PDF" files, which I believe get corrupted in the process of reading a stream to a byte array. Uploading and downloading seems to work fine, but "Acrobat" pop ups "The file is damaged and could not be repaired" error. Is there any way I can make this to work.
1
1474
by: Bud Dean | last post by:
I need to search files for given text. In particular, I'm searching dll's, exe's, asp, aspx and html pages. I am having difficulty converting the byte arrays to strings. The following code identifies the file and line number where the text appears. It appears to work, but... I would appreciate any suggestions for improvement. heres the code: *********************Start Code ****************************
4
5346
by: Hunk | last post by:
Hi I have a binary file which contains records sorted by Identifiers which are strings. The Identifiers are stored in ascending order. I would have to write a routine to give the record given the Identifier. The logical way would be to read the record once and put it in an STL container such as vector and then use lower_bound to search for a given identifier. But for some strange reason i'm asked to not use a container but instead...
6
10567
by: fnoppie | last post by:
Hi, I am near to desperation as I have a million things to get a solution for my problem. I have to post a multipart message to a url that consists of a xml file and an binary file (pdf). Seperately the posting words fine but when I want to create one multipart message with both then things go wrong. The binary file is converted and of datatype byte() The xml file is just a string.
3
2165
by: Ahmad Jalil Qarshi | last post by:
Hi, I have a text file having size about 2 GB. The text file format is like: Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values Numeric valueAlphaNumeric values For example consider following chunk of actual data:
0
8888
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8752
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9401
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9257
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9176
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8097
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6702
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
4784
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
2635
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.