473,385 Members | 1,732 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Reading Binary Files

I need to split a large binary file into two binary files. I have a delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to end
of file. Kindly let me know whether this si possible

Thanks
Rohith
Dec 26 '05 #1
21 3785

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:BD**********************************@microsof t.com...
I need to split a large binary file into two binary files. I have a
delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to
end
of file. Kindly let me know whether this si possible


Just open a System.IO.FileStream against the file. Read it out by chunks
into a byte[] and examine the chunks for your delimeter. Write the chunks
to a first and then a seconde FileStream.

David
Dec 26 '05 #2
Ya..This will work. But I have a huge binary file nearly 1GB. Is there an
alternate solution to find the delimiter position with checking on every
chunk looping through it

"David Browne" wrote:

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:BD**********************************@microsof t.com...
I need to split a large binary file into two binary files. I have a
delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to
end
of file. Kindly let me know whether this si possible


Just open a System.IO.FileStream against the file. Read it out by chunks
into a byte[] and examine the chunks for your delimeter. Write the chunks
to a first and then a seconde FileStream.

David

Dec 26 '05 #3
Sorry, without looping through every chunk of data.
Dec 26 '05 #4
"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:86**********************************@microsof t.com...
Sorry, without looping through every chunk of data.


I don't see how. How would you know where the delimiter is?

David
Dec 26 '05 #5
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?

"David Browne" wrote:
"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:86**********************************@microsof t.com...
Sorry, without looping through every chunk of data.


I don't see how. How would you know where the delimiter is?

David

Dec 26 '05 #6

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:2D**********************************@microsof t.com...
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?


First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.

Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.

Without knowing anything about the structure of this file, it is difficult to be more helpful.

Good luck
Bill
Dec 26 '05 #7
Hi Bill,
"Bill Butler" wrote:

First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.
NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.
Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.
The thing is that I will not be knowing the actual postition. I will be
knowing only the delimitter.
Without knowing anything about the structure of this file, it is difficult to be more helpful.


Regarding the structure, Its only raw chunk of bytes.

Thanks
Rohith
Dec 26 '05 #8

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:E2**********************************@microsof t.com...
Hi Bill,
"Bill Butler" wrote:

First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.


NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.
Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.


The thing is that I will not be knowing the actual postition. I will be
knowing only the delimitter.
Without knowing anything about the structure of this file, it is
difficult to be more helpful.


Regarding the structure, Its only raw chunk of bytes.


Typically you would prepend a header onto the file indicating, say the
number of files contained, their names and offsets. Then you can seek
around in the file to find the offsets.

David
Dec 26 '05 #9

"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:E2**********************************@microsof t.com...
<snip>
NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.


If YOU are the one responsible for combining the data and then separating it, is there any reason
why you can't have a header in the file? If you could include a header, you could easily include the
sizes/offsets of the raw chunks. Then you would have no need of a delimiter.
If your hands are tied and you can do nothing more than a delimiter, then you have problems. You
need to choose a delimter that CANNOT exist in the binary data, but ANY value can exist in binary
data. You would need to scan your data to make sure that the delimiter is acceptible, and then find
a way to keep track of what the delimiter was.
If your only option is to use a delimiter, you have no choice, but to search for it linearly,
and you may need to have a multi-byte delimiter if every 8 bit combination exists in the data.

I personally would fight for the header.

Good luck,
Bill

Dec 26 '05 #10
Rohith <Ro****@discussions.microsoft.com> wrote:
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?


The cost of looking through memory is likely to be much smaller than
the IO cost in the first place.

As Bill suggested though, if you're the one who gets to combine the
files, it's easy - just include the lengths of each file.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 27 '05 #11
You're gonna have to read the file to split it anyway. If there is no
header that tells where the delimiter is or if you cannot create one,
then you will have to read the file in manually.

Typically, you would read a certain amount at a time into a memory
buffer, for example 4K, then search that buffer for the delimiter.

The performance should not be too bad.

Dec 27 '05 #12
Thanks for the Replies.

I would not be able to add header to the files, as I have a set of previous
version(of my Application) binary files that does not have header. Now my new
requirement is to add two separate binary files in a single binary file and
deserialize them. But If i add header to the new files i will not be able to
identify which files to split and which not.

"Chris Dunaway" wrote:
You're gonna have to read the file to split it anyway. If there is no
header that tells where the delimiter is or if you cannot create one,
then you will have to read the file in manually.

Typically, you would read a certain amount at a time into a memory
buffer, for example 4K, then search that buffer for the delimiter.

The performance should not be too bad.

Dec 28 '05 #13
"Rohith" <Ro****@discussions.microsoft.com> wrote in message
news:31**********************************@microsof t.com...
Thanks for the Replies.

I would not be able to add header to the files, as I have a set of previous
version(of my Application) binary files that does not have header. Now my new
requirement is to add two separate binary files in a single binary file and
deserialize them. But If i add header to the new files i will not be able to
identify which files to split and which not.


Sure you can.

Add the header to the Compound files only.
Start it of with a MAGIC String of bytes that remains the same.
Although there is a tiny possibility of incorrectly identifying a Simple file as being Compound, you
can control how tiny by extending the MAGIC String length.

Also

If the binary data is not random there will be some sequences of bytes that are FAR more likely than
others. Careful selection of the MAGIC String can effectively eliminate a false positive.

Good Luck
Bill
Dec 28 '05 #14
As its a Huge file nearly(2GB), it would not be easy to form magic bytes
that will be present only once. Also the text present in the binary file will
not be the same..So to find the magic bytes do i have to search throught he
file every time before serializing?

"Bill Butler" wrote:
Sure you can.

Add the header to the Compound files only.
Start it of with a MAGIC String of bytes that remains the same.
Although there is a tiny possibility of incorrectly identifying a Simple file as being Compound, you
can control how tiny by extending the MAGIC String length.

Also

If the binary data is not random there will be some sequences of bytes that are FAR more likely than
others. Careful selection of the MAGIC String can effectively eliminate a false positive.

Good Luck
Bill

Dec 28 '05 #15
Rohith <Ro****@discussions.microsoft.com> wrote:
As its a Huge file nearly(2GB), it would not be easy to form magic bytes
that will be present only once.
They'd only have to not be present at the start of the old files.
Compare that with your delimiter idea which relies on the delimiter
*never* being present.
Also the text present in the binary file will
not be the same..So to find the magic bytes do i have to search throught he
file every time before serializing?


No, you'd look for the magic bytes at the start of the file when
*deserializing*.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #16

"Jon Skeet [C# MVP]" wrote:
No, you'd look for the magic bytes at the start of the file when
*deserializing*.


But how do i ensure that my previous version (application) files does not
have these magic bytes at the start of the file. Also How do I identify
magic bytes...
Dec 28 '05 #17
Rohith <Ro****@discussions.microsoft.com> wrote:
"Jon Skeet [C# MVP]" wrote:
No, you'd look for the magic bytes at the start of the file when
*deserializing*.
But how do i ensure that my previous version (application) files does not
have these magic bytes at the start of the file.


Well, what are these files? Many file formats already have a magic
number at the start of the file.

It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?

If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.
Also How do I identify magic bytes...


That's simple - by reading the first 8 bytes (or however long your
magic number is) and seeing whether or not they are the same as the
magic number.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #18
> Well, what are these files? Many file formats already have a magic
number at the start of the file.
It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?
No. I will not be even able to find whether this a older or newer version
file.
If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.


My previous version files does not have this magic bytes written. But When I
desrialize them I will not be in a position to tell whether this a previous
version or new version file. So I will not be able to get the length of this
first file from the magic bytes.
Dec 28 '05 #19
Rohith <Ro****@discussions.microsoft.com> wrote:
Well, what are these files? Many file formats already have a magic
number at the start of the file.
It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?


No. I will not be even able to find whether this a older or newer version
file.


Okay. In that case, you would certainly have no chance with a single-
byte delimiter as you were planning, would you?
If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.


My previous version files does not have this magic bytes written. But When I
desrialize them I will not be in a position to tell whether this a previous
version or new version file. So I will not be able to get the length of this
first file from the magic bytes.


You can find the length of a whole file very easily (eg use
FileStream.Length after opening it, or FileInfo.Length).

Basically, if you start deserializing and don't see the magic number,
the contents is just the whole of the file.

If you *do* see the magic number, you then read whatever header
information you've put into the new files, and deserialize
appropriately.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #20
Basically, if you start deserializing and don't see the magic number,
the contents is just the whole of the file.

If you *do* see the magic number, you then read whatever header
information you've put into the new files, and deserialize
appropriately.

Everytime I serialize a file I have to generate a Magic byte (say 16 bytes)
and add them to file. But at the time of deserializing i wont be having that
magic number with me...(since i am serializing with new magic bytes every
time and I can desrialize any file in my application). So with the previous
version files, and I am taking the first 16 bytes i will not be able to tell
whether this is a magic byte or actual data.
Dec 28 '05 #21
Rohith <Ro****@discussions.microsoft.com> wrote:
Basically, if you start deserializing and don't see the magic number,
the contents is just the whole of the file.

If you *do* see the magic number, you then read whatever header
information you've put into the new files, and deserialize
appropriately.

Everytime I serialize a file I have to generate a Magic byte (say 16 bytes)
and add them to file. But at the time of deserializing i wont be having that
magic number with me...(since i am serializing with new magic bytes every
time and I can desrialize any file in my application). So with the previous
version files, and I am taking the first 16 bytes i will not be able to tell
whether this is a magic byte or actual data.


But this is what I was saying before - if you generate a random set of
16 bytes to be your magic number (and that's the same for *every* file
you create) then the chances of another file starting with the exact
same 16 bytes are incredibly small.

There is absolutely no way round that kind of problem being present at
all, because however you decide to serialize your files, there's always
a possibility that there will be an old file with exactly the same
content as a serialized pair of files.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
Dec 28 '05 #22

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: christos panagiotou | last post by:
hi all I am trying to open some .raw files that represent images (256x256, 8 bit per pixel, no header) in a c++ program I cannot copy paste the module here as it uses a method from the VTK...
3
by: Olivier Maurice | last post by:
Hi all, I suppose some of you know the program Redmon (type redmon in google, first result). This neat little tool allows to hook up any functionality to a printer by putting the file printed...
7
by: laclac01 | last post by:
So I am converting some matlab code to C++. I am stuck at one part of the code. The matlab code uses fread() to read in to a vector a file. It's a binary file. The vector is made up of floats,...
50
by: Michael Mair | last post by:
Cheerio, I would appreciate opinions on the following: Given the task to read a _complete_ text file into a string: What is the "best" way to do it? Handling the buffer is not the problem...
5
by: rnorthedge | last post by:
I am working on a code library which needs to read in the data from large binary files. The files hold int, double and string data. This is the code for reading in the strings: protected...
7
by: John Dann | last post by:
I'm trying to read some binary data from a file created by another program. I know the binary file format but can't change or control the format. The binary data is organised such that it should...
2
by: amfr | last post by:
On windows, is there anything special I have to do to read a binary file correctly?
30
by: siliconwafer | last post by:
Hi All, I want to know tht how can one Stop reading a file in C (e.g a Hex file)with no 'EOF'?
2
by: nnimod | last post by:
Hi. I'm having trouble reading some unicode files. Basically, I have to parse certain files. Some of those files are being input in Japanese, Chinese etc. The easiest way, I figured, to distinguish...
6
by: arne.muller | last post by:
Hello, I've come across some problems reading strucutres from binary files. Basically I've some strutures typedef struct { int i; double x; int n; double *mz;
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.