471,319 Members | 1,430 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,319 software developers and data experts.

Reading Binary Files

I need to split a large binary file into two binary files. I have a delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to end
of file. Kindly let me know whether this si possible

Thanks
Rohith
Dec 26 '05 #1
21 3700

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:BD**********************************@microsof t.com...
I need to split a large binary file into two binary files. I have a
delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to
end
of file. Kindly let me know whether this si possible


Just open a System.IO.FileStream against the file. Read it out by chunks
into a byte[] and examine the chunks for your delimeter. Write the chunks
to a first and then a seconde FileStream.

David
Dec 26 '05 #2
Ya..This will work. But I have a huge binary file nearly 1GB. Is there an
alternate solution to find the delimiter position with checking on every
chunk looping through it

"David Browne" wrote:

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:BD**********************************@microsof t.com...
I need to split a large binary file into two binary files. I have a
delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to
end
of file. Kindly let me know whether this si possible


Just open a System.IO.FileStream against the file. Read it out by chunks
into a byte[] and examine the chunks for your delimeter. Write the chunks
to a first and then a seconde FileStream.

David

Dec 26 '05 #3
Sorry, without looping through every chunk of data.
Dec 26 '05 #4
"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:86**********************************@microsof t.com...
Sorry, without looping through every chunk of data.


I don't see how. How would you know where the delimiter is?

David
Dec 26 '05 #5
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?

"David Browne" wrote:
"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:86**********************************@microsof t.com...
Sorry, without looping through every chunk of data.


I don't see how. How would you know where the delimiter is?

David

Dec 26 '05 #6

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:2D**********************************@microsof t.com...
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?


First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.

Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.

Without knowing anything about the structure of this file, it is difficult to be more helpful.

Good luck
Bill
Dec 26 '05 #7
Hi Bill,
"Bill Butler" wrote:

First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.
NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.
Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.
The thing is that I will not be knowing the actual postition. I will be
knowing only the delimitter.
Without knowing anything about the structure of this file, it is difficult to be more helpful.


Regarding the structure, Its only raw chunk of bytes.

Thanks
Rohith
Dec 26 '05 #8

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:E2**********************************@microsof t.com...
Hi Bill,
"Bill Butler" wrote:

First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.


NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.
Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.


The thing is that I will not be knowing the actual postition. I will be
knowing only the delimitter.
Without knowing anything about the structure of this file, it is
difficult to be more helpful.


Regarding the structure, Its only raw chunk of bytes.


Typically you would prepend a header onto the file indicating, say the
number of files contained, their names and offsets. Then you can seek
around in the file to find the offsets.

David
Dec 26 '05 #9

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:E2**********************************@microsof t.com...
<snip>
NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.


If YOU are the one responsible for combining the data and then separating it, is there any reason
why you can't have a header in the file? If you could include a header, you could easily include the
sizes/offsets of the raw chunks. Then you would have no need of a delimiter.
If your hands are tied and you can do nothing more than a delimiter, then you have problems. You
need to choose a delimter that CANNOT exist in the binary data, but ANY value can exist in binary
data. You would need to scan your data to make sure that the delimiter is acceptible, and then find
a way to keep track of what the delimiter was.
If your only option is to use a delimiter, you have no choice, but to search for it linearly,
and you may need to have a multi-byte delimiter if every 8 bit combination exists in the data.

I personally would fight for the header.

Good luck,
Bill

Dec 26 '05 #10
Rohith <Ro****@discussions.microsoft.com> wrote:
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?


The cost of looking through memory is likely to be much smaller than
the IO cost in the first place.

As Bill suggested though, if you're the one who gets to combine the
files, it's easy - just include the lengths of each file.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 27 '05 #11
You're gonna have to read the file to split it anyway. If there is no
header that tells where the delimiter is or if you cannot create one,
then you will have to read the file in manually.

Typically, you would read a certain amount at a time into a memory
buffer, for example 4K, then search that buffer for the delimiter.

The performance should not be too bad.

Dec 27 '05 #12
Thanks for the Replies.

I would not be able to add header to the files, as I have a set of previous
version(of my Application) binary files that does not have header. Now my new
requirement is to add two separate binary files in a single binary file and
deserialize them. But If i add header to the new files i will not be able to
identify which files to split and which not.

"Chris Dunaway" wrote:
You're gonna have to read the file to split it anyway. If there is no
header that tells where the delimiter is or if you cannot create one,
then you will have to read the file in manually.

Typically, you would read a certain amount at a time into a memory
buffer, for example 4K, then search that buffer for the delimiter.

The performance should not be too bad.

Dec 28 '05 #13
"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:31**********************************@microsof t.com...
Thanks for the Replies.

I would not be able to add header to the files, as I have a set of previous
version(of my Application) binary files that does not have header. Now my new
requirement is to add two separate binary files in a single binary file and
deserialize them. But If i add header to the new files i will not be able to
identify which files to split and which not.


Sure you can.

Add the header to the Compound files only.
Start it of with a MAGIC String of bytes that remains the same.
Although there is a tiny possibility of incorrectly identifying a Simple file as being Compound, you
can control how tiny by extending the MAGIC String length.

Also

If the binary data is not random there will be some sequences of bytes that are FAR more likely than
others. Careful selection of the MAGIC String can effectively eliminate a false positive.

Good Luck
Bill
Dec 28 '05 #14
As its a Huge file nearly(2GB), it would not be easy to form magic bytes
that will be present only once. Also the text present in the binary file will
not be the same..So to find the magic bytes do i have to search throught he
file every time before serializing?

"Bill Butler" wrote:
Sure you can.

Add the header to the Compound files only.
Start it of with a MAGIC String of bytes that remains the same.
Although there is a tiny possibility of incorrectly identifying a Simple file as being Compound, you
can control how tiny by extending the MAGIC String length.

Also

If the binary data is not random there will be some sequences of bytes that are FAR more likely than
others. Careful selection of the MAGIC String can effectively eliminate a false positive.

Good Luck
Bill

Dec 28 '05 #15
Rohith <Ro****@discussions.microsoft.com> wrote:
As its a Huge file nearly(2GB), it would not be easy to form magic bytes
that will be present only once.
They'd only have to not be present at the start of the old files.
Compare that with your delimiter idea which relies on the delimiter
*never* being present.
Also the text present in the binary file will
not be the same..So to find the magic bytes do i have to search throught he
file every time before serializing?


No, you'd look for the magic bytes at the start of the file when
*deserializing*.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #16

"Jon Skeet [C# MVP]" wrote:
No, you'd look for the magic bytes at the start of the file when
*deserializing*.


But how do i ensure that my previous version (application) files does not
have these magic bytes at the start of the file. Also How do I identify
magic bytes...
Dec 28 '05 #17
Rohith <Ro****@discussions.microsoft.com> wrote:
"Jon Skeet [C# MVP]" wrote:
No, you'd look for the magic bytes at the start of the file when
*deserializing*.
But how do i ensure that my previous version (application) files does not
have these magic bytes at the start of the file.


Well, what are these files? Many file formats already have a magic
number at the start of the file.

It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?

If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.
Also How do I identify magic bytes...


That's simple - by reading the first 8 bytes (or however long your
magic number is) and seeing whether or not they are the same as the
magic number.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #18
> Well, what are these files? Many file formats already have a magic
number at the start of the file.
It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?
No. I will not be even able to find whether this a older or newer version
file.
If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.


My previous version files does not have this magic bytes written. But When I
desrialize them I will not be in a position to tell whether this a previous
version or new version file. So I will not be able to get the length of this
first file from the magic bytes.
Dec 28 '05 #19
Rohith <Ro****@discussions.microsoft.com> wrote:
Well, what are these files? Many file formats already have a magic
number at the start of the file.
It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?


No. I will not be even able to find whether this a older or newer version
file.


Okay. In that case, you would certainly have no chance with a single-
byte delimiter as you were planning, would you?
If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.


My previous version files does not have this magic bytes written. But When I
desrialize them I will not be in a position to tell whether this a previous
version or new version file. So I will not be able to get the length of this
first file from the magic bytes.


You can find the length of a whole file very easily (eg use
FileStream.Length after opening it, or FileInfo.Length).

Basically, if you start deserializing and don't see the magic number,
the contents is just the whole of the file.

If you *do* see the magic number, you then read whatever header
information you've put into the new files, and deserialize
appropriately.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #20
Basically, if you start deserializing and don't see the magic number,
the contents is just the whole of the file.

If you *do* see the magic number, you then read whatever header
information you've put into the new files, and deserialize
appropriately.

Everytime I serialize a file I have to generate a Magic byte (say 16 bytes)
and add them to file. But at the time of deserializing i wont be having that
magic number with me...(since i am serializing with new magic bytes every
time and I can desrialize any file in my application). So with the previous
version files, and I am taking the first 16 bytes i will not be able to tell
whether this is a magic byte or actual data.
Dec 28 '05 #21
Rohith <Ro****@discussions.microsoft.com> wrote:
Basically, if you start deserializing and don't see the magic number,
the contents is just the whole of the file.

If you *do* see the magic number, you then read whatever header
information you've put into the new files, and deserialize
appropriately.

Everytime I serialize a file I have to generate a Magic byte (say 16 bytes)
and add them to file. But at the time of deserializing i wont be having that
magic number with me...(since i am serializing with new magic bytes every
time and I can desrialize any file in my application). So with the previous
version files, and I am taking the first 16 bytes i will not be able to tell
whether this is a magic byte or actual data.


But this is what I was saying before - if you generate a random set of
16 bytes to be your magic number (and that's the same for *every* file
you create) then the chances of another file starting with the exact
same 16 bytes are incredibly small.

There is absolutely no way round that kind of problem being present at
all, because however you decide to serialize your files, there's always
a possibility that there will be an old file with exactly the same
content as a serialized pair of files.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #22

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by christos panagiotou | last post: by
3 posts views Thread by Olivier Maurice | last post: by
7 posts views Thread by laclac01 | last post: by
50 posts views Thread by Michael Mair | last post: by
7 posts views Thread by John Dann | last post: by
2 posts views Thread by amfr | last post: by
30 posts views Thread by siliconwafer | last post: by
2 posts views Thread by nnimod | last post: by
6 posts views Thread by arne.muller | last post: by
reply views Thread by rosydwin | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.