473,382 Members | 1,400 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

Search a binary file for a string... again! (its to slow)

Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText[i] == sBuff[i])
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}

int main()
{
FILE *fp;
char sText[STRSIZE] = "name";
char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= (sizeof(sText)))
{
if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);
}
fclose(fp);
}
else
{
printf("Could not open file");
}
return 0;
}
-------------------------------------------------------------------------
Nov 14 '05 #1
7 2593
spike wrote:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...
First of all: Your strequal function should be named differently, since
the standard reserves any name with "str..." for itself. Secondly, it
is basically a slow implementation of "strcmp"
int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText[i] == sBuff[i])
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}
You should better code it like something like

int strequal(char sText[],char sBuff[])
{
int i;
for(i=0;i<STRSIZE;i++)
if(sText[i] != sBuff[i])
return FALSE;

return TRUE;
}
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= .... fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);


That is what really kills you! Read 4 bytes, check them, rewind 3
bytes and start again.

Find something better here!
Nov 14 '05 #2
"spike" <jo**@ljungh.se> wrote in message
news:66**************************@posting.google.c om...
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText[i] == sBuff[i])
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}

int main()
{
FILE *fp;
char sText[STRSIZE] = "name";
char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= (sizeof(sText))) {
if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);
}
fclose(fp);
}
else
{
printf("Could not open file");
}
return 0;
}
-------------------------------------------------------------------------


Have a bigger buffer than 4 chars! Use at least 4k (or more really - spash
out 32k). That should make things much better ( less function calls and io
requests ) - and do not use fsetpos/fgetpos. Just write a looping function
first that reads in chunks of data -- after this, write a function that goes
through a buffer a character at a time, and keep track how much of the
search sting you have found so far.

Other than that consider using an algorithm such as Boyer-Moore
Nov 14 '05 #3

"spike" <jo**@ljungh.se> wrote in message
news:66**************************@posting.google.c om...
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...
Yup, it's just a *guess*. See below.
Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText[i] == sBuff[i])
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}


Some guesses of my own:

Instead of 'rolling your own' function like that, have
you considered trying the 'memcmp()' function from the
standard library? It's likely optimized for your
platform.

But note that the language doesn't specify any particular
'performance', only behavior.

Another option to look into (in case it's the actual i/o
that's the bottleneck and not the memory parsing) is to
see if your implementation offers any 'extension' functions
which would probably have more 'intimate' knowledge of your
file system and exploit it.

But really, all this is just guessing, I suggest you use
a profiler to find out where the bottleneck *really* is.
I've found that it's often not where one's 'logic' says
it should be.

-Mike
Nov 14 '05 #4

"spike" <jo**@ljungh.se> wrote in message
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it
faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText[i] == sBuff[i])
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}
You can optimise this by writing it

int mystrncmp(const char *s1, const char *s2, int n)
{
int i;

for(i=0;i<n;i++)
if(s1[i] != s2[i])
return s2[i] - s1[i];
return 0;
}
This will save some cycles. You can also replace with the stdlib function
strncmp(), which may be even faster as coded in assembly.
int main()
{
FILE *fp;
char sText[STRSIZE] = "name";
char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= (sizeof(sText))) You might save a bit of time by simply calling fgetc(), which is designed to
return single characters.
{

Most of the time the first character in the buffer won't match the first
character in the string. So you can avoid the overhead of the function call
by simply rejecting it. if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);
You are doing this on each character read. This will slow you down.
}
fclose(fp);
}
else
{
printf("Could not open file");
}
return 0;
}

The way to speed up is to have a more sophisticated buffer.

Read characters until you come up with a match in the first position.
Then read enough characters for the string into the buffer.
If you have a match, you have a match.

Otherwise, search the buffer for the next matching character to the first
letter, shift left, and read more in. If the alphabet is large and the
string quite short you should frequently flush the buffer, in which case you
go back to reading characters one at a time.

Shifting characters is expensive, but a circular buffer involves the modulus
operation, which is also expensive. To get even more speed, you can declare
an oversize buffer, as large as you like, and keep a travelling pointer to
the start position. Only when you hit the end do you need to move some
characters to the beginning and reset the start pointer. (If the buffer
empties then of course you go back to the initial start position for free).
Nov 14 '05 #5


Malcolm wrote:
"spike" <jo**@ljungh.se> wrote in message
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it
faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4
Remember the above define! We'll come back to it
later.

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText[i] == sBuff[i])
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}

You can optimise this by writing it

int mystrncmp(const char *s1, const char *s2, int n)
{
int i;

for(i=0;i<n;i++)
if(s1[i] != s2[i])
return s2[i] - s1[i];
return 0;
}
This will save some cycles. You can also replace with the stdlib function
strncmp(), which may be even faster as coded in assembly.
int main()
{
FILE *fp;
char sText[STRSIZE] = "name";
Latent bug in the code here. STRSIZE was defined
as 4. "name" is a 5 character string including
the NULL byte '/0' at the end. Therefore a NULL was
written *somewhere* (implementation specific).
It is very possible (but not guaranteed) that
the compiler allocated storage for sText and
sBuff right after each other.
If that was the case, the NULL byte
was written into the first character of sBuff.
Later on, when you read data into sBuff, that
NULL byte would be overwritten with some character
from the file and sText would no longer be
a null-terminated string. This would make
for some "interesting" behavior and probably
prevent you from using any of the standard library
functions.

Take the advice of the other posters about
using standard library functions and larger buffer
sizes, too.
char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >=


(sizeof(sText)))

You might save a bit of time by simply calling fgetc(), which is designed to
return single characters.
{


Most of the time the first character in the buffer won't match the first
character in the string. So you can avoid the overhead of the function call
by simply rejecting it.
if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);


You are doing this on each character read. This will slow you down.


The comment was absolutely correct. (Sorry, I don't
recall who made it.) You are (conceptually) reading
4 characters, then taking 3 steps back, and reading
4 characters again, 3 of which you have already read
once. While this works, your question was about
speed, and this sequence above looks like a real
pig.
}
fclose(fp);
}
else
{
printf("Could not open file");
}
return 0;
}


The way to speed up is to have a more sophisticated buffer.

Read characters until you come up with a match in the first position.
Then read enough characters for the string into the buffer.
If you have a match, you have a match.

Otherwise, search the buffer for the next matching character to the first
letter, shift left, and read more in. If the alphabet is large and the
string quite short you should frequently flush the buffer, in which case you
go back to reading characters one at a time.

Shifting characters is expensive, but a circular buffer involves the modulus
operation, which is also expensive. To get even more speed, you can declare
an oversize buffer, as large as you like, and keep a travelling pointerto
the start position. Only when you hit the end do you need to move some
characters to the beginning and reset the start pointer. (If the buffer
empties then of course you go back to the initial start position for free).



--
Ñ
"It is impossible to make anything foolproof because fools are so
ingenious" - A. Bloch

Nov 14 '05 #6
Joerg Schoen wrote:
spike wrote:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make
it faster? It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...
First of all: Your strequal function should be named differently,
since the standard reserves any name with "str..." for itself.
Secondly, it is basically a slow implementation of "strcmp"

.... snip ...
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);


That is what really kills you! Read 4 bytes, check them, rewind 3
bytes and start again.

Find something better here!


Here's a revision to what I published to the OP's original
request. He must have not seen that. Some slight changes have
been made to ensure that we can have binary file access, which is
not necessarily possible via stdin. This never backtracks.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>

/* The difference between a binary and a text file, on read,
is the conversion of end-of-line delimiters. What those
delimiters are does not affect the action. There is also
a potential problem with 0x1a EOF markers in DOS/Windoze.

This is a version of Knuth-Morris-Pratt algorithm. The
point of using this is to avoid any backtracking in file
reading, and thus avoiding any use of buffer arrays.
*/

size_t chrcount; /* debuggery, count of input chars, zeroed */

/* --------------------- */

/* Almost straight out of Sedgewick */
/* The next array indicates what index in id should next be
compared to the current char. Once the (lgh - 1)th char
has been successfully compared, the id has been found.
The array is formed by comparing id to itself. */
void initnext(int *next, const char *id, int lgh)
{
int i, j;

assert(lgh > 0);
next[0] = -1; i = 0; j = -1;
while (i < lgh) {
while ((j >= 0) && (id[i] != id[j])) j = next[j];
i++; j++;
next[i] = j;
}
#if (0)
for (i = 0; i < lgh; i++)
printf("id[%d] = '%c' next[%d] = %d\n",
i, id[i], i, next[i]);
#endif
} /* initnext */

/* --------------------- */

/* reads f without rewinding until either EOF or *marker
has been found. Returns EOF if not found. At exit the
last matching char has been read, and no further. */
int kmpffind(const char *marker, int lgh, int *next, FILE *f)
{
int j; /* char position in marker to check */
int ch; /* current char */

assert(lgh > 0);
j = 0;
while ((j < lgh) && (EOF != (ch = getc(f)))) {
chrcount++;
while ((j >= 0) && (ch != marker[j])) j = next[j];
j++;
}
return ch;
}

/* --------------------- */

int binfsrch(const char *marker, FILE *f)
{
int *next;
int lgh;
int ch;
int items; /* count of markers found */

if (!(next = malloc(strlen(marker) * sizeof *next))) {
puts("No memory");
exit(EXIT_FAILURE);
}
else {
lgh = strlen(marker);
initnext(next, marker, lgh);
items = 0;
while (EOF != kmpffind(marker, lgh, next, f)) {
/* found it, dump the following printable chars */
items++;
printf("%d %s : \"", items, marker);
while (isprint(ch = getc(f))) {
chrcount++;
putchar(ch);
}
puts("\"");
if (EOF == ch) break;
else chrcount++;
}
return items;
}
} /* binfsrch */

/* --------------------- */

int main(int argc, char **argv)
{
FILE *f;

f = stdin;
if (3 == argc) {
if (!(f = fopen(argv[2], "rb"))) {
printf("Can't open %s\n", argv[2]);
exit(EXIT_FAILURE);
}
argc--;
}
if (2 != argc) {
puts("Usage: binfsrch name [binaryfile]");
puts(" (file defaults to stdin text mode)");
}
else if (binfsrch(argv[1], f)) {
printf("\"%s\" : found\n", argv[1]);
}
else printf("\"%s\" : not found\n", argv[1]);
printf("%lu chars\n", (unsigned long)chrcount);
return 0;
} /* main binfsrch */

--
Chuck F (cb********@yahoo.com) (cb********@worldnet.att.net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net> USE worldnet address!

Nov 14 '05 #7
On 28 Feb 2004 12:22:33 -0800, jo**@ljungh.se (spike) wrote:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText[i] == sBuff[i])
{
iRetval = iRetval && TRUE;
I don't know if this contributes to your timing problem but this can
never change the value of iRetval and is effectively a no-op.
}
else
{
iRetval = iRetval && FALSE;
Similarly, this can only and will always set iRetval to FALSE.
Furthermore, once iRetval is FALSE, it can never become TRUE so there
is no point in continuing the loop.
}
}
return iRetval;
}

snip
<<Remove the del for email>>
Nov 14 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Nikos | last post by:
Hi... I would like to search for a hex string (for example: "E903") inside a binary file... Although I open the file correctly, how do I search hex values? Thanks in advance! Nikos
1
by: Andrew | last post by:
Hi, im trying to create a small function which can create a binary tree from the entries in a text file and after that we can perform all the usual operations like search, insert and delete etc....
1
by: Eric | last post by:
Hi: I have two files. I search pattern ":" from emails text file and save email contents into a database. Another search pattern " field is blank. Please try again.", vbExclamation + vbOKOnly...
1
by: vinothg | last post by:
Hi , I have a binary file which contains 30,000 strings of 20 bytes each.I need to search for a string in the file to see whether the particular string exists. The sample code which i wrote...
33
by: Bertram Trabant | last post by:
Hello, Im working on a little LAN game in the style of old text-only MUD's and would need to find a way to search for a string in a text file (for example for usernames). I know it works in the way...
7
by: elliotng.ee | last post by:
I have a text file that contains a header 32-bit binary. For example, the text file could be: %%This is the input text %%test.txt Date: Tue Dec 26 14:03:35 2006...
1
by: hn.ft.pris | last post by:
I have the following code: Tree.h defines a simple binary search tree node structure ########## FILE Tree.h ################ #ifndef TREE_H #define TREE_H //using namespace std; template...
16
by: Computer geek | last post by:
Hello, I am new to VB.NET and programming in general. I have taught myself a lot of the basics with vb.net but am still quite the novice. I am working on a little application now and I need some...
10
by: rory | last post by:
I can't seem to append a string to the end of a binary file. I'm using the following code: fstream outFile("test.exe", ios::in | ios::out | ios::binary | ios::ate | ios::app)...
11
!NoItAll
by: !NoItAll | last post by:
I need to search a buffer for a string of characters. The buffer is essentially a binary file (it's an image) and contains lots of chr(0) characters. The string I need to locate also contains...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.