Parsing Binary Files - C# / C Sharp

Hemang Shah

Hello fellow Coders!

ok, I"m trying to write a very simple application in C#. (Yes its my first
program)

What I want to do is :

1) Open a binary file
2) Search this file for a particular string.
3) Close the file

Now is there any special thing I should do as this is a binary file ?

Any code examples would very greating appreciated.

Thank You

Hemang Shah

Nov 16 '05 #1

Subscribe Post Reply

11380

Jon Skeet [C# MVP]

Hemang Shah <v-*****@microsoft.com> wrote:

ok, I"m trying to write a very simple application in C#. (Yes its my first
program)

What I want to do is :

1) Open a binary file
2) Search this file for a particular string.
3) Close the file

Now is there any special thing I should do as this is a binary file ?

Well, if you're trying to search for a *string*, you'll need to know
the encoding - or by "string" do you mean "sequence of bytes"?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #2

Hemang Shah

Hello Jon

I'm trying to search for occurances of "OU=" in the binary file yes so its a
sequence of bytes.

If I open the file in hexviewer, I can see these and search for it. Rather
then opening up the file in hexviewer everytime, I want to write a utility
to search it and display it.

I did find some code online which opens the file in binary mode and displays
it on a text box.
But what you see in the text box is not the same what you see in hexviewer.
Moreover, I don't really understand the code.

Here is the code:

void DisplayFile()

{

int nCols = 16;

FileStream inStream = new FileStream(chosenfile, FileMode.Open,

FileAccess.Read);

long nBytesToRead = inStream.Length;

if (nBytesToRead > 65536/4)

nBytesToRead = 65536/4;

int nLines = (int)(nBytesToRead/nCols) + 1;

string [] lines = new string[nLines];

int nBytesRead = 0;

for (int i=0 ; i<nLines ; i++)

{

StringBuilder nextLine = new StringBuilder();

nextLine.Capacity = 4*nCols;

for (int j = 0 ; j<nCols ; j++)

{

int nextByte = inStream.ReadByte();

nBytesRead++;

if (nextByte < 0 || nBytesRead > 65536)

break;

char nextChar = (char)nextByte;

if (nextChar < 16)

nextLine.Append(" x0" + string.Format("{0,1:X}",

(int)nextChar));

else if

(char.IsLetterOrDigit(nextChar) ||

char.IsPunctuation(nextChar))

nextLine.Append(" " + nextChar + " ");

else

nextLine.Append(" x" + string.Format("{0,2:X}",

(int)nextChar));

}

lines[i] = nextLine.ToString();

}

inStream.Close();

this.textBoxContents.Lines = lines;

}

Thank You

__________________________________________________ ________________________

Hemang Shah MCSE A+
Enterprise Messaging Support
Direct phone: (905) 568-0434 x 23854

Email: v-*****@microsoft.com

Office hours: Wed to Sat from 19:00-06:00 hrs EST.

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

Hemang Shah <v-*****@microsoft.com> wrote:
ok, I"m trying to write a very simple application in C#. (Yes its my
first
program)

What I want to do is :

1) Open a binary file
2) Search this file for a particular string.
3) Close the file

Now is there any special thing I should do as this is a binary file ?

Well, if you're trying to search for a *string*, you'll need to know
the encoding - or by "string" do you mean "sequence of bytes"?

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #3

Jon Skeet [C# MVP]

Hemang Shah <v-*****@microsoft.com> wrote:

I'm trying to search for occurances of "OU=" in the binary file yes so its a
sequence of bytes.
But OU= is a sequence of *characters*. Do you mean you're looking for
the sequence of bytes which form the ASCII encoding for "OU="? I
suspect that's what you're after.
If I open the file in hexviewer, I can see these and search for it. Rather
then opening up the file in hexviewer everytime, I want to write a utility
to search it and display it.

I did find some code online which opens the file in binary mode and displays
it on a text box.
But what you see in the text box is not the same what you see in hexviewer.
Moreover, I don't really understand the code.

The first thing is to ditch that code. It's bad in many, many ways.

I don't have time to write some sample code for you right now, but I'll
try tomorrow afternoon. Basically, you should read the file in chunks,
and then look through for the correct sequence, knowing that it might
go across a "chunk boundary".

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #4

Hemang Shah

Yes you are right, that is what I'm trying to achieve.. A sequence of
*Characters* which I thought comprised a string.

I can send you a sample of the type of files I'm trying to read if you like.

I would really appreciate if you could write me a sample, that would be
going over & beyond!

You can write it tomorrow or whenever you can. Or you can point me to some
good resources which would teach / explain the logic behind it.

Reading in chunks makes sense. Sometimes the files that I'll be parsing
will even exceed 16 to 80GB in size. But i'll only have to parse the first
few 100MBs of data to get the "OU=".

Thanks a lot again in advance.

Hemang

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...

Hemang Shah <v-*****@microsoft.com> wrote:
I'm trying to search for occurances of "OU=" in the binary file yes so
its a
sequence of bytes.

But OU= is a sequence of *characters*. Do you mean you're looking for
the sequence of bytes which form the ASCII encoding for "OU="? I
suspect that's what you're after.
If I open the file in hexviewer, I can see these and search for it.
Rather
then opening up the file in hexviewer everytime, I want to write a
utility
to search it and display it.

I did find some code online which opens the file in binary mode and
displays
it on a text box.
But what you see in the text box is not the same what you see in
hexviewer.
Moreover, I don't really understand the code.

The first thing is to ditch that code. It's bad in many, many ways.

I don't have time to write some sample code for you right now, but I'll
try tomorrow afternoon. Basically, you should read the file in chunks,
and then look through for the correct sequence, knowing that it might
go across a "chunk boundary".

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #5

Niki Estner

"Hemang Shah" <v-*****@microsoft.com> wrote
news:eG*************@TK2MSFTNGP14.phx.gbl...

Yes you are right, that is what I'm trying to achieve.. A sequence of
*Characters* which I thought comprised a string.
I think what Jon was trying to say is that *bytes* and *characters* are two
different things: In .net, characters are usually unicode characters, i.e.
have a size of 2 bytes. You can convert these to a variety of binary
representations (including plain ASCII) which have a different layout.
Now, in your binary file, do you want to look for occurances of a string in
*unicode* representation or ascii (or other) representation?
...
I would really appreciate if you could write me a sample, that would be
going over & beyond!

Here's a little sample I've come up with:
It reads binary blocks of data from a file, then tests every possible
position. After that, it copies the trailing n bytes of the buffer to the
beginning and starts reading after byte n, so it can find matches on "chunk
boundaries". (I think it works)
Note that this is not the fastest searching algorithm; (google for
"boyer-moore" for more info). But I'd guess in your case the HD is the
bottleneck anyway.
using System;
using System.IO;

class BinarySearch
{
static void Main()
{
string stringToLookFor = "7777";
string filePath = @"C:\SomePath\pi.txt";

// convert the string to a binary (ASCII) representation
byte[] bufferToLookFor =
System.Text.Encoding.ASCII.GetBytes(stringToLookFo r);

int matchCounter = 1; // count matches for nicer output

// open the file in binary mode
using (Stream stream = new FileStream(filePath, FileMode.Open,
FileAccess.Read))
{
byte[] readBuffer = new byte[16384]; // our input buffer
int bytesRead = 0; // number of bytes read
int offset = 0; // offset inside read-buffer
long filePos = 0; // position inside the file
before read operation
while ((bytesRead = stream.Read(readBuffer, offset,
readBuffer.Length-offset)) > 0)
{
for (int i=0; i<bytesRead+offset-bufferToLookFor.Length; i++)
{
bool match = true;
for (int j=0; j<bufferToLookFor.Length; j++)
if (bufferToLookFor[j] != readBuffer[i+j])
{
match = false;
break;
}
if (match)
{
Console.WriteLine("{0,5}. \"{1}\" found at {3:x}",
matchCounter++, stringToLookFor, filePath, filePos+i-offset);
//return;
}
}
// store file position before next read
filePos = stream.Position;

// store the last few characters to ensure matches on "chunk
boundaries"
offset = bufferToLookFor.Length;
for (int i=0; i<offset; i++)
readBuffer[i] = readBuffer[readBuffer.Length-offset+i];
}
}
Console.WriteLine("No match found");
}
}
Niki

Nov 16 '05 #6

Jon Skeet [C# MVP]

Hemang Shah <v-*****@microsoft.com> wrote:

I would really appreciate if you could write me a sample, that would be
going over & beyond!

Is the sample Niki provided okay for you? (I like the idea of copying
the buffer - nice simple way of dealing with boundaries.)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #7

Hemang Shah

Thank you Niki & Jon

I took the sample and it worked for me. I was able to get proper matches.

Now I have some questions if you don't mind me asking:

1) The code right now is Case Sensitive I guess to the string we want to
search is that correct ?
2) If I want it to be not case sensitive, do I type the string in every
posible combination and search with each of those bytes ? or is there a
better solution
3) What I want to do is, after the search is met, I want to read x amount
of characters after that search and display it. Now the # of characters
after the search is not fixed, it could be 1 word or it could be a sentence.
I would know it because it will truncate with another search string.
4) I don't understand the copying of buffer so that we can check across
boundries, I understand the concept but I cannot follow the code from there.
Also, how do I handle my fetching the info if it is across boundries.
5) Our input buffer is set to 16 bytes. Is there any reason its 16 ? or it
could be any size.

I hope I was able to ask the right questions.

Thank You

Hemang.

"Niki Estner" <ni*********@cube.net> wrote in message
news:OR**************@TK2MSFTNGP10.phx.gbl...

"Hemang Shah" <v-*****@microsoft.com> wrote
news:eG*************@TK2MSFTNGP14.phx.gbl...
Yes you are right, that is what I'm trying to achieve.. A sequence of
*Characters* which I thought comprised a string.

I think what Jon was trying to say is that *bytes* and *characters* are
two different things: In .net, characters are usually unicode characters,
i.e. have a size of 2 bytes. You can convert these to a variety of binary
representations (including plain ASCII) which have a different layout.
Now, in your binary file, do you want to look for occurances of a string
in *unicode* representation or ascii (or other) representation?
...
I would really appreciate if you could write me a sample, that would be
going over & beyond!

Here's a little sample I've come up with:
It reads binary blocks of data from a file, then tests every possible
position. After that, it copies the trailing n bytes of the buffer to the
beginning and starts reading after byte n, so it can find matches on
"chunk boundaries". (I think it works)
Note that this is not the fastest searching algorithm; (google for
"boyer-moore" for more info). But I'd guess in your case the HD is the
bottleneck anyway.
using System;
using System.IO;

class BinarySearch
{
static void Main()
{
string stringToLookFor = "7777";
string filePath = @"C:\SomePath\pi.txt";

// convert the string to a binary (ASCII) representation
byte[] bufferToLookFor =
System.Text.Encoding.ASCII.GetBytes(stringToLookFo r);

int matchCounter = 1; // count matches for nicer output

// open the file in binary mode
using (Stream stream = new FileStream(filePath, FileMode.Open,
FileAccess.Read))
{
byte[] readBuffer = new byte[16384]; // our input buffer
int bytesRead = 0; // number of bytes read
int offset = 0; // offset inside read-buffer
long filePos = 0; // position inside the file
before read operation
while ((bytesRead = stream.Read(readBuffer, offset,
readBuffer.Length-offset)) > 0)
{
for (int i=0; i<bytesRead+offset-bufferToLookFor.Length; i++)
{
bool match = true;
for (int j=0; j<bufferToLookFor.Length; j++)
if (bufferToLookFor[j] != readBuffer[i+j])
{
match = false;
break;
}
if (match)
{
Console.WriteLine("{0,5}. \"{1}\" found at {3:x}",
matchCounter++, stringToLookFor, filePath, filePos+i-offset);
//return;
}
}
// store file position before next read
filePos = stream.Position;

// store the last few characters to ensure matches on "chunk
boundaries"
offset = bufferToLookFor.Length;
for (int i=0; i<offset; i++)
readBuffer[i] = readBuffer[readBuffer.Length-offset+i];
}
}
Console.WriteLine("No match found");
}
}
Niki

Nov 16 '05 #8

Jon Skeet [C# MVP]

Hemang Shah <v-*****@microsoft.com> wrote:

Thank you Niki & Jon

I took the sample and it worked for me. I was able to get proper matches.

Now I have some questions if you don't mind me asking:

1) The code right now is Case Sensitive I guess to the string we want to
search is that correct ?
Yes.
2) If I want it to be not case sensitive, do I type the string in every
posible combination and search with each of those bytes ? or is there a
better solution
Well, you could supply multiple byte arrays, and check whether the nth
byte is any of the acceptable ones, rather than just a single
acceptable one. You then just supply a lower case version and an upper
case version - you don't need to come up with every combination.
3) What I want to do is, after the search is met, I want to read x amount
of characters after that search and display it. Now the # of characters
after the search is not fixed, it could be 1 word or it could be a sentence.
I would know it because it will truncate with another search string.
To what extent is this *really* a binary file? Pretty much everything
you've said has been in terms of text.
4) I don't understand the copying of buffer so that we can check across
boundries, I understand the concept but I cannot follow the code from there.
I haven't actually looked at Niki's code myself.
Also, how do I handle my fetching the info if it is across boundries.
5) Our input buffer is set to 16 bytes. Is there any reason its 16 ? or it
could be any size.

It could be set to any size. I'd usually use about 32K myself.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 16 '05 #9

Niki Estner

"Hemang Shah" <v-*****@microsoft.com> wrote in
news:eq**************@TK2MSFTNGP09.phx.gbl...

Thank you Niki & Jon

I took the sample and it worked for me. I was able to get proper matches.

Now I have some questions if you don't mind me asking:

1) The code right now is Case Sensitive I guess to the string we want to
search is that correct ?
Yes.
2) If I want it to be not case sensitive, do I type the string in every
posible combination and search with each of those bytes ? or is there a
better solution
I'd convert the input string to uppercase, and convert each byte in the
buffer to uppercase too before comparing.
3) What I want to do is, after the search is met, I want to read x amount
of characters after that search and display it. Now the # of characters
after the search is not fixed, it could be 1 word or it could be a
sentence. I would know it because it will truncate with another search
string.
If you have the offset in the file, you can use Stream.Seek & Stream.Read to
do that.
4) I don't understand the copying of buffer so that we can check across
boundries, I understand the concept but I cannot follow the code from
there.
Try to use a short buffer (e.g. 20 bytes), and a short file and step through
the code with the debugger. IMO that's generally the best way to see what a
program does.
Also, how do I handle my fetching the info if it is across boundries.
As I said, I'd use a separate Stream.Read call to extract that info.
5) Our input buffer is set to 16 bytes. Is there any reason its 16 ? or
it could be any size.
It's set to 16 kbytes. HD access can only be performed in 4 k pages, so it
should be at least 4k (otherwise the HD will have to read the same page more
than once). I usually make it a little bigger so the overhead for calling
into the OS isn't done that often.
If you don't care for performance (e.g. for testing or debugging) you can
make it any size as long as it's bigger than the search string.
I hope I was able to ask the right questions.

There are no stupid questions. Only stupid answers...

Niki

Nov 16 '05 #10

Similar topics

[Q] Text vs Binary Files

by: Eric | last post by:

Assume that disk space is not an issue (the files will be small < 5k in general for the purpose of storing preferences) Assume that transportation to another OS may never occur. Are there...

.NET Framework

where wrong?how to correct?and why? thank you

by: wwj | last post by:

void main() { char* p="Hello"; printf("%s",p); *p='w'; printf("%s",p); }

C / C++

Binary files

by: alice | last post by:

hi all, Can anybody please tell the advantages which the binary files offers over the character files. Thanks, Alice walls

C / C++

Working with binary files in C++

by: knapak | last post by:

Hello I'm a self instructed amateur attempting to read a huge file from disk... so bear with me please... I just learned that reading a file in binary is faster than text. So I wrote the...

.NET Framework

Detecting binary files

by: dagecko | last post by:

Hi I would like to know how to detect if a file is binary or not. It's important for me but I don't know where to start. Ty

C / C++

text and binary files confusion

by: joelagnel | last post by:

hi friends, i've been having this confusion for about a year, i want to know the exact difference between text and binary files. using the fwrite function in c, i wrote 2 bytes of integers in...

C / C++

Understanding binary files.

by: JoeC | last post by:

I am writing a program that I am trying to learn and save binary files. This is the page I found as a source: http://www.angelfire.com/country/aldev0/cpphowto/cpp_BinaryFileIO.html I have...

C / C++

Streaming file IO and binary files

by: masood.iqbal | last post by:

Hi, Kindly excuse my novice question. In all the literature on ifstream that I have seen, nowhere have I read what happens if you try to read a binary file using the ">>" operator. I ran into...

C / C++

binary files

by: deepakvsoni | last post by:

are binary files portable?

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware