473,769 Members | 2,359 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Finding strings in binary files

Hello everyone,

I wrote, simply as an exercise, a small piece of code to find 'strings'
(defined as an amount of at least 3 ASCII characters followed by a non
ASCII character) in binary files.

The purpose of the program is to serve as a facile 'strings' (Unix
command) replacement and to be 100% ANSI C. Unfortunatelly it operates
notedly slower than the original 'strings' from the fileutils package.

Maybe someone has some hints on how to improve performance and keep the
code at the same time pure ANSI C.

Any other remarks to obvoius or not so obvoius errors are highly
appreciated, too.

#v+

/* Seek for ASCII-strings in binary streams and output them including
* their byte-position in the stream
*/

#include<stdio. h>
#include<stdlib .h>
#include<ctype. h>

#define MAXSTRING 128
#define ISSTRSIZE 3

int main(int argc, char *argv[])
{
FILE *inp;
size_t i=0, j=MAXSTRING;
int ch;
char *buf;

switch(argc) {
case 0:
case 1:
if( !(inp=fdopen(fi leno(stdin), "r")) ) {
perror("Error") ;
return EXIT_FAILURE;
}
break;
case 2:
if( !(inp=fopen(arg v[1], "rb")) ) {
perror("Error") ;
return EXIT_FAILURE;
}
break;
default:
fprintf(stderr, "Syntax: %s [File]\n", argv[0]);
return EXIT_FAILURE;
}

if( !(buf = malloc(MAXSTRIN G)) )
return EXIT_FAILURE;
while( !feof(inp) ) {

ch = fgetc(inp);

if ( isprint(ch) ) {
if( i>j ) {
buf = realloc(buf, j*2);
j *= 2;
}
buf[i++] = (char) ch;
}
else {
if( i>ISSTRSIZE ) {
#ifdef POSITION
printf("%6lu: ", ftell(inp)-i-1);
#endif
buf[i] = '\0';
puts(buf);
}
i=0;
}
}

free(buf);
return EXIT_SUCCESS;
}

#v-

TIA & Greets, Rob

--
The Enterprise meets God, and it's a child, a computer, or a C program.
Nov 14 '05 #1
5 5849
Robert Manea <ro*@nova.hbx.u s> wrote:
I wrote, simply as an exercise, a small piece of code to find 'strings'
(defined as an amount of at least 3 ASCII characters followed by a non
ASCII character) in binary files. The purpose of the program is to serve as a facile 'strings' (Unix
command) replacement and to be 100% ANSI C. Unfortunatelly it operates
notedly slower than the original 'strings' from the fileutils package. Maybe someone has some hints on how to improve performance and keep the
code at the same time pure ANSI C. Any other remarks to obvoius or not so obvoius errors are highly
appreciated, too. /* Seek for ASCII-strings in binary streams and output them including
* their byte-position in the stream
*/ #include<stdio. h>
#include<stdlib .h>
#include<ctype. h> #define MAXSTRING 128
#define ISSTRSIZE 3 int main(int argc, char *argv[])
{
FILE *inp;
size_t i=0, j=MAXSTRING;
int ch;
char *buf; switch(argc) {
case 0:
case 1:
if( !(inp=fdopen(fi leno(stdin), "r")) ) {
perror("Error") ;
return EXIT_FAILURE;
}
break;
case 2:
if( !(inp=fopen(arg v[1], "rb")) ) {
perror("Error") ;
return EXIT_FAILURE;
}
break;
default:
fprintf(stderr, "Syntax: %s [File]\n", argv[0]);
return EXIT_FAILURE;
}
Why do you open stdin with "r" but a file with "rb"? What you get from
stdin could be a binary file (e.g. via a pipe).
if( !(buf = malloc(MAXSTRIN G)) )
Some people might object to using logical negation operator in this
case for stylistic reasons, and I would think

if ( ( buf = malloc( MAXSTRING ) ) == NULL )

could make your intentions easier to see.
return EXIT_FAILURE;
while( !feof(inp) ) {

ch = fgetc(inp);
feof() doesn't work as you seem to assume. It only will return a
useful value _after_ you have tried to read something. Why don't
you go for the much simpler

while ( ( ch = fgetc( inp ) ) != EOF ) {

That should cover all cases nicely, both end of file and read errors.
if ( isprint(ch) ) {
if( i>j ) {
That should be "i >= j" - if i is already as large as j with buf[ i ]
you would already be one past the end of the buffer.
buf = realloc(buf, j*2);
j *= 2;
}
buf[i++] = (char) ch;
}
else {
if( i>ISSTRSIZE ) {
#ifdef POSITION
printf("%6lu: ", ftell(inp)-i-1);
#endif
buf[i] = '\0';
You might need here another check - if i > j - 1 this would write past
the end of the buffer.
puts(buf);
}
i=0;
}
} free(buf);
return EXIT_SUCCESS;
}


I guess that some of the effects of the original strings implementation
being faster might result from reading in larger chunks of the file at
once into memory and then operating on that buffer instead of calling
fgetc() for each character. That's something you could also implement.
But since they aren't bound by strict ANSI C conformance they also can
use additional, platform dependend tricks like the use of mmap() where
available...
Regards, Jens
--
\ Jens Thoms Toerring ___ Je***********@p hysik.fu-berlin.de
\______________ ____________ http://www.toerring.de
Nov 14 '05 #2
"Robert Manea" <ro*@nova.hbx.u s> wrote in message
news:4s******** ***@rob.unisolb lade.de...
I wrote, simply as an exercise, a small piece of code to find 'strings'
(defined as an amount of at least 3 ASCII characters followed by a non
ASCII character) in binary files.
ITYM at least 3 printable characters.
The purpose of the program is to serve as a facile 'strings' (Unix
command) replacement and to be 100% ANSI C. Unfortunatelly it operates
notedly slower than the original 'strings' from the fileutils package.

Maybe someone has some hints on how to improve performance and keep the
code at the same time pure ANSI C.

Any other remarks to obvoius or not so obvoius errors are highly
appreciated, too.

#v+

/* Seek for ASCII-strings in binary streams and output them including
* their byte-position in the stream
*/

#include<stdio. h>
#include<stdlib .h>
#include<ctype. h>

#define MAXSTRING 128
#define ISSTRSIZE 3

int main(int argc, char *argv[])
{
FILE *inp;
size_t i=0, j=MAXSTRING;
int ch;
char *buf;

switch(argc) {
case 0:
case 1:
if( !(inp=fdopen(fi leno(stdin), "r")) ) {
fdopen() and fileno() are not ANSI C functions. You can simply:
inp = stdin;

Or perhaps use freopen() to open stdin in binary mode.
perror("Error") ;
return EXIT_FAILURE;
}
break;
case 2:
if( !(inp=fopen(arg v[1], "rb")) ) {
perror("Error") ;
return EXIT_FAILURE;
}
break;
default:
fprintf(stderr, "Syntax: %s [File]\n", argv[0]);
return EXIT_FAILURE;
}

if( !(buf = malloc(MAXSTRIN G)) )
return EXIT_FAILURE;
while( !feof(inp) ) {
ch = fgetc(inp);
while ((ch = getc(inp)) != EOF) {

For reasons given elsewhere.

if ( isprint(ch) ) {
if( i>j ) {
buf = realloc(buf, j*2);
Always use a temporary pointer and test the result for success.
j *= 2;
}
buf[i++] = (char) ch;
}
else {
if( i>ISSTRSIZE ) {
#ifdef POSITION
printf("%6lu: ", ftell(inp)-i-1);
#endif
buf[i] = '\0';
puts(buf);
}
i=0;
}
}

free(buf);
return EXIT_SUCCESS;
}


You may be able to improve performance by using fread() to read larger
chunks. Also, there is no reason to buffer entire string; after you've
collected enough chars to decide it's a string, you can simply write them to
stdout and then continue to copy stdin to stdout until you see a
non-printable character. This might make a big performance difference if I
offered a rather large text file as input :).

Alex
Nov 14 '05 #3

"Robert Manea" <ro*@nova.hbx.u s> wrote in message

while( !feof(inp) ) {

ch = fgetc(inp);
There's no gross inefficiency here, but you are making two function calls
that could be replaced with a single macro call to getc(). As others have
noted the use of feof() is incorrect anyway, though it hardly matters (it
means the last call to fgetc() will return EOF, which you handle as a normal
character).
This could well speed you up.
if ( isprint(ch) ) {
if( i>j ) {
buf = realloc(buf, j*2);
You need a test here for out of memory.
j *= 2;
}
buf[i++] = (char) ch;
}
else {
if( i>ISSTRSIZE ) {
#ifdef POSITION
printf("%6lu: ", ftell(inp)-i-1);
#endif
buf[i] = '\0';
puts(buf);
}
i=0;
}
}

free(buf);
return EXIT_SUCCESS;
}

Nov 14 '05 #4
Segfault in module "Alex Fraser" - dump details are as follows:

Thanks a lot for your suggestions Jens and Alex! I followed your
advices and achieved a real boost in speed.

For anyone interested here is the new version (I'm sure it still isn't
perfect, but way faster than the one before) including some benchmark
results.

First of all the benchmarks:

Tested on the following file:
$ dd if=/dev/urandom of=foo bs=1024 count=131072
$ ls -lh foo
-rw-rw-r-- 1 robert robert 128M 29. Jul 19:50 foo

$ time strings foo > /dev/null
7,01s user 0,46s system 99% cpu 7,544 total

$ time ./my_strings_OLD > /dev/null
8,60s user 0,65s system 99% cpu 9,292 total

$ time ./my_strings_NEW > /dev/null
2,02s user 0,48s system 98% cpu 2,547 total
And The Code:

#v+

#include<stdio. h>
#include<stdlib .h>
#include<ctype. h>

#define ISSTRSIZE 3
#define READBUF 2048
int main(int argc, char *argv[])
{
FILE *inp;
size_t i=0, k=0, c_read;
char buf[ISSTRSIZE+1], rbuf[READBUF];

switch(argc) {
case 0:
case 1:
inp = stdin;
break;
case 2:
if( !(inp=fopen(arg v[1], "rb")) ) {
perror("Error") ;
return EXIT_FAILURE;
}
break;
default:
fprintf(stderr, "Syntax: %s [File]\n", argv[0]);
return EXIT_FAILURE;
}

while ( (c_read=fread(r buf, 1, READBUF, inp)) > 0 ) {

while(k <= c_read) { /* Not really sure here if '<' or '<=' */
if ( isprint(rbuf[k]) ) {
if(i<ISSTRSIZE)
buf[i++] = rbuf[k];
else if(i == ISSTRSIZE) {
buf[ISSTRSIZE] = '\0';
fputs(buf, stdout);
fputc(rbuf[k], stdout);
i++;
}
else {
fputc(rbuf[k], stdout);
i++;
}
}
else {
if (i>ISSTRSIZE)
putchar('\n');
i=0;
}
k++;
}
k=0;
}

return EXIT_SUCCESS;
}

#v-

Greets, Rob

--
The Enterprise meets God, and it's a child, a computer, or a C program.
Nov 14 '05 #5

On Thu, 29 Jul 2004, Robert Manea wrote:

Thanks a lot for your suggestions Jens and Alex! I followed your
advices and achieved a real boost in speed.
[I tried profiling your original program with gprof, but it ran too
quickly to generate any data, even on several-megabyte inputs. It
looks like you have enough disk to run gigantic tests; have you
tried profiling the code to see where its bottlenecks are? Google
'gprof manual'.]

As for your code, it may well be as fast as possible. So I'm
going to inflict style tips on it.
#include<stdio. h>
#include<stdlib .h>
#include<ctype. h>

#define ISSTRSIZE 3
#define READBUF 2048
Neither of these names seems really correct. 'ISSTRSIZE' sounds
like a boolean, and 'READBUF' sounds like an action. But both of
them are really integer buffer sizes.
int main(int argc, char *argv[])
{
FILE *inp;
size_t i=0, k=0, c_read;
char buf[ISSTRSIZE+1], rbuf[READBUF];

switch(argc) {
case 0:
case 1:
inp = stdin;
break;
case 2:
if( !(inp=fopen(arg v[1], "rb")) ) {
perror("Error") ;
return EXIT_FAILURE;
}
break;
default:
fprintf(stderr, "Syntax: %s [File]\n", argv[0]);
return EXIT_FAILURE;
}

while ( (c_read=fread(r buf, 1, READBUF, inp)) > 0 ) {

while(k <= c_read) { /* Not really sure here if '<' or '<=' */
A bad sign. 'c_read' is the number of bytes read from the file,
correct? And at the beginning of this loop, 'k' is... [scan the
file looking for initialization of 'k'...] zero. And you're...
[scan the file looking for increment...] incrementing 'k' and
accessing 'rbuf[k]' for each 'k'. So if 'c_read' is 'READBUF',
then 'k' ought to go only up to 'READBUF-1'. You meant '<', not
'<='. (This is almost always a safe bet in C.)
if ( isprint(rbuf[k]) ) {
if(i<ISSTRSIZE)
buf[i++] = rbuf[k];
else if(i == ISSTRSIZE) {
buf[ISSTRSIZE] = '\0';
fputs(buf, stdout);
fputc(rbuf[k], stdout);
i++;
}
else {
fputc(rbuf[k], stdout);
i++;
}
Here you write 'i++' three times in three different control
branches. Only one increment is really needed. Pull it out of
the branches into the body of the enclosing 'if'.
}
else {
if (i>ISSTRSIZE)
putchar('\n');
i=0;
}
k++;
}
k=0;
The re-initialization of 'k' is shoved all the way down here,
far from where it's used. This is bad. (As with the '++i', you're
duplicating code in the wrong places rather than putting it in the
right place to begin with.)
}

return EXIT_SUCCESS;
}


Rewriting to incorporate all these style changes, we have:
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>

#define MIN_STRING_SIZE 3
#define BUFFER_SIZE 2048

int main(int argc, char *argv[])
{
FILE *inp;
size_t i, c_read;
char buf[MIN_STRING_SIZE];
char rbuf[BUFFER_SIZE];

switch(argc) {
case 0:
case 1:
inp = stdin;
break;
case 2:
inp = fopen(argv[1], "rb");
if (inp == NULL) {
fprintf(stderr, "Could not open file '%s'\n", argv[1]);
return EXIT_FAILURE;
}
break;
default:
fprintf(stderr, "Syntax: %s [File]\n", argv[0]);
return EXIT_FAILURE;
}

i = 0;
while ((c_read = fread(rbuf, 1, sizeof rbuf, inp)) > 0)
{
size_t k;
for (k=0; k < c_read; ++k) {
if (isprint(rbuf[k])) {
if (i < MIN_STRING_SIZE )
buf[i] = rbuf[k];
else if (i == MIN_STRING_SIZE ) {
printf("%.*s", sizeof buf, buf);
putchar(rbuf[k]);
}
else {
putchar(rbuf[k]);
}
++i;
}
else {
if (i > MIN_STRING_SIZE )
putchar('\n');
i = 0;
}
}
}

return EXIT_SUCCESS;
}

The scope of 'i' is still kind of icky-looking to me, and I don't
like the three-way branch depending on the comparison of 'i' and
'MIN_STR_SIZE'; but I'm not sure there's a better approach that
would retain this general algorithm.

HTH,
-Arthur

Nov 14 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

20
5775
by: Ravi | last post by:
Hi, I have about 200GB of data that I need to go through and extract the common first part of a line. Something like this. >>>a = "abcdefghijklmnopqrstuvwxyz" >>>b = "abcdefghijklmnopBHLHT" >>>c = extract(a,b) >>>print c "abcdefghijklmnop"
4
1705
by: j_mckitrick | last post by:
Does it make sense to use doc strings rather than #-comments for a standalone Python app? If the classes aren't going to be re-used or imported, do they need them?
13
15256
by: yaipa | last post by:
What would be the common sense way of finding a binary pattern in a ..bin file, say some 200 bytes, and replacing it with an updated pattern of the same length at the same offset? Also, the pattern can occur on any byte boundary in the file, so chunking through the code at 16 bytes a frame maybe a problem. The file itself isn't so large, maybe 32 kbytes is all and the need for speed is not so great, but the need for accuracy in the...
388
21905
by: maniac | last post by:
Hey guys, I'm new here, just a simple question. I'm learning to Program in C, and I was recommended a book called, "Mastering C Pointers", just asking if any of you have read it, and if it's worth the $25USD. I'm just looking for a book on Pointers, because from what I've read it's one of the toughest topics to understand. thanks in advanced.
13
2381
by: jt | last post by:
I can't seem to find a way to concatenate strings that have nulls within the string. I have a string that I need another string that has nulls in it and what to append the 2nd string, 3 string and so forth to the 1st string. Any ideas how to go about this? Thanks,
5
6452
by: rnorthedge | last post by:
I am working on a code library which needs to read in the data from large binary files. The files hold int, double and string data. This is the code for reading in the strings: protected internal override string ReadString() { stringLength = fileStream.ReadByte(); moInput.Read(byteArrayBuffer, 0, stringLength); return asciiEncoding.GetString(byteArrayBuffer, 0, stringLength ); }
2
22608
by: Potiuper | last post by:
Question: Is it possible to use a char pointer array ( char *<name> ) to read an array of strings from a file in C? Given: code is written in ANSI C; I know the exact nature of the strings to be read (the file will be written by only this program); file can be either in text or binary (preferably binary as the files may be read repeatedly); the amount and size of strings in the array won't be known until run time (in the example I have it in...
14
2816
by: prasadjoshi124 | last post by:
Hi All, I am writing a small tool which is supposed to fill the filesystem to a specified percent. For, that I need to read how much the file system is full in percent, like the output given by df -k lopgod10:~/mycrfile # df -k /mnt/mvdg1/vset Filesystem 1K-blocks Used Available Use% Mounted on
20
9448
by: tomPee | last post by:
Hi, I've bumbed into a slight problem now, and I just don't seem to know how to fix it. What I want to do is the following: Send over a socket: 1. Number of files to be send (not as an integer, just as a string) then for each file to be send: 2. Length of Filename (again as a string) 3. Filename 4. File as binary data.
0
9586
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9423
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
1
9990
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9861
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8869
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7406
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6672
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
3956
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3561
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.