473,385 Members | 1,588 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

binary diff

Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It
works fine but the delta binary is pretty huge in size. I
have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary
formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL
Nov 15 '05 #1
9 6485
Binary diff is a tough problem. So far, the best paper I have seen on the
subject is Andrew Tridgell's paper on rsync. It actually tackles a slightly
more difficult problem that just binary diff, i.e. binary diff between local
and remote file, but the general idea (rolling checksum + strong checksum)
can be used to compute a binary diff between two local files in an efficient
way.

Andrew's paper is on http://samba.org/rsync/tech_report/

I don't undestand how binary serialization would help you implement binary
diff. Seems like it is only going to make matters worse, but maybe you want
to do something slightly different than standard binary diff.

Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It
works fine but the delta binary is pretty huge in size. I
have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary
formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

Nov 15 '05 #2
Instead of serializing the byte[] diff, why can't you just write that to a
binary diff file (i.e. no serialization)?

--
William Stacey, DNS MVP

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in message
news:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It
works fine but the delta binary is pretty huge in size. I
have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary
formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

Nov 15 '05 #3
Hi Bruno,

"...I don't undestand how binary serialization would help
you implement binary diff..."

I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.

To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.

"...but maybe you want to do something slightly different
than standard binary diff..."

I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.

Thanks!
-CL

-----Original Message-----
Binary diff is a tough problem. So far, the best paper I have seen on thesubject is Andrew Tridgell's paper on rsync. It actually tackles a slightlymore difficult problem that just binary diff, i.e. binary diff between localand remote file, but the general idea (rolling checksum + strong checksum)can be used to compute a binary diff between two local files in an efficientway.

Andrew's paper is on http://samba.org/rsync/tech_report/

I don't undestand how binary serialization would help you implement binarydiff. Seems like it is only going to make matters worse, but maybe you wantto do something slightly different than standard binary diff.
Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message denews:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

.

Nov 15 '05 #4
Hi Will,

I actually store the byte diff in a hashtable (key =
offset, value = byte diff). With serialization, it's
easier to maintain the state of the hashtable.

I was once thought about put everything in one byte[] and,
like you suggested, no serialization-write it to a file.
But this way, for each byte diff, I have to store the
offset, length of the byte diff, and the byte itself. And
to reconstruct file B back, I have to have 3 readings to
get the offset, length, and read the next length byte from
get the byte diff.

File A:
+-------------------+
| A | B | C | D | E |
+-------------------+

File B:
+-------------------+
| A | b | c | D | e |
+-------------------+

byte[] diff:
+---------------------------+
| 1 | 2 | b | c | 4 | 1 | e |
+---------------------------+

So:
- diff[0] = 1 indicates that A[1] has changed
- diff[1] = 2 indicates that I have to read the next 2
bytes from diff[] to get 'b' and 'c' and stick them into
file A
- and this goes on and on...

Isn't it too much work to do?

Well, above is just one example and the size of file A and
B happen to be the same. One also needs to consider other
cases.

Any more idea?
-CL


-----Original Message-----
Instead of serializing the byte[] diff, why can't you just write that to abinary diff file (i.e. no serialization)?

--
William Stacey, DNS MVP

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in messagenews:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL


Nov 15 '05 #5
Your algorithm will work and produce a diff that is significantly smaller
than the complete files only in very special circumstances, i.e., if a few
isolated bytes get replaced by some other bytes in your file. But it does
not qualify for a general purpose binary diff utility.

If the file is reorganized in some way (bytes inserted, deleted or moved
around), your algo will not find any match and will produce an output which
is much larger than the files themselves.

And even if the file is not reorganized like that, your algo will perform
very poorly if sequences of bytes change between two versions, because your
are storing the diffs on individual bytes. You should at least try to find
differing sequences rather than individual bytes.

Finally, dumping a Hashtable is not an optimal way to save the difference. I
suggest you take a look at the GDIFF format for a good example on how a
binary diff could be stored: http://www.w3.org/TR/NOTE-gdiff-19970901.

Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:05****************************@phx.gbl...
Hi Bruno,

"...I don't undestand how binary serialization would help
you implement binary diff..."

I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.

To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.

"...but maybe you want to do something slightly different
than standard binary diff..."

I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.

Thanks!
-CL

-----Original Message-----
Binary diff is a tough problem. So far, the best paper I have seen on thesubject is Andrew Tridgell's paper on rsync. It actually tackles a slightlymore difficult problem that just binary diff, i.e. binary diff between localand remote file, but the general idea (rolling checksum + strong checksum)can be used to compute a binary diff between two local files in an efficientway.

Andrew's paper is on http://samba.org/rsync/tech_report/

I don't undestand how binary serialization would help you implement binarydiff. Seems like it is only going to make matters worse, but maybe you wantto do something slightly different than standard binary diff.
Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message denews:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

.

Nov 15 '05 #6
To get a feel for what is being stored, serialize the class using XML
serializer and look at. As you will see, its not the data but the metadata
that takes a lot of space. Now imagin how you might store the same data and
metadata using a binary format. Its not xml (although there is bxml in the
works), but you see the issue with creating a general format to reconstruct
your object graph. Unless you come up with your own special method to
serialize your structure, binary is probably the smallest you will get. If
you must go more compact, I would probably look at a function to take your
hashtable and convert it to a byte[] or a comma delimited text file and back
again.

--
William Stacey, DNS MVP

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in message
news:17****************************@phx.gbl...
Hi Will,

I actually store the byte diff in a hashtable (key =
offset, value = byte diff). With serialization, it's
easier to maintain the state of the hashtable.

I was once thought about put everything in one byte[] and,
like you suggested, no serialization-write it to a file.
But this way, for each byte diff, I have to store the
offset, length of the byte diff, and the byte itself. And
to reconstruct file B back, I have to have 3 readings to
get the offset, length, and read the next length byte from
get the byte diff.

File A:
+-------------------+
| A | B | C | D | E |
+-------------------+

File B:
+-------------------+
| A | b | c | D | e |
+-------------------+

byte[] diff:
+---------------------------+
| 1 | 2 | b | c | 4 | 1 | e |
+---------------------------+

So:
- diff[0] = 1 indicates that A[1] has changed
- diff[1] = 2 indicates that I have to read the next 2
bytes from diff[] to get 'b' and 'c' and stick them into
file A
- and this goes on and on...

Isn't it too much work to do?

Well, above is just one example and the size of file A and
B happen to be the same. One also needs to consider other
cases.

Any more idea?
-CL


-----Original Message-----
Instead of serializing the byte[] diff, why can't you

just write that to a
binary diff file (i.e. no serialization)?

--
William Stacey, DNS MVP

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in

message
news:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

Nov 15 '05 #7
To get a feel for what is being stored, serialize the class using XML
serializer and look at. As you will see, its not the data but the metadata
that takes a lot of space. Now imagin how you might store the same data and
metadata using a binary format. Its not xml (although there is bxml in the
works), but you see the issue with creating a general format to reconstruct
your object graph. Unless you come up with your own special method to
serialize your structure, binary is probably the smallest you will get. If
you must go more compact, I would probably look at a function to take your
hashtable and convert it to a byte[] or a comma delimited text file and back
again.

--
William Stacey, DNS MVP

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in message
news:17****************************@phx.gbl...
Hi Will,

I actually store the byte diff in a hashtable (key =
offset, value = byte diff). With serialization, it's
easier to maintain the state of the hashtable.

I was once thought about put everything in one byte[] and,
like you suggested, no serialization-write it to a file.
But this way, for each byte diff, I have to store the
offset, length of the byte diff, and the byte itself. And
to reconstruct file B back, I have to have 3 readings to
get the offset, length, and read the next length byte from
get the byte diff.

File A:
+-------------------+
| A | B | C | D | E |
+-------------------+

File B:
+-------------------+
| A | b | c | D | e |
+-------------------+

byte[] diff:
+---------------------------+
| 1 | 2 | b | c | 4 | 1 | e |
+---------------------------+

So:
- diff[0] = 1 indicates that A[1] has changed
- diff[1] = 2 indicates that I have to read the next 2
bytes from diff[] to get 'b' and 'c' and stick them into
file A
- and this goes on and on...

Isn't it too much work to do?

Well, above is just one example and the size of file A and
B happen to be the same. One also needs to consider other
cases.

Any more idea?
-CL


-----Original Message-----
Instead of serializing the byte[] diff, why can't you

just write that to a
binary diff file (i.e. no serialization)?

--
William Stacey, DNS MVP

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in

message
news:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

Nov 15 '05 #8
Hi Ching-Lung,

I read your reply to William and saw that your algo is smart enough to find
sequences that differ, not just isolated bytes. So, my judgement was a bit
severe.

Anyway, you should still take a look at the GDIFF format, it is a smart
format for storing the diff and it is relatively easy to generate and
process (you can take some shortcuts and eliminate the byte/short/int
variants of the tags if you don't need an optimal size).

Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:05****************************@phx.gbl...
Hi Bruno,

"...I don't undestand how binary serialization would help
you implement binary diff..."

I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.

To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.

"...but maybe you want to do something slightly different
than standard binary diff..."

I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.

Thanks!
-CL

-----Original Message-----
Binary diff is a tough problem. So far, the best paper I have seen on thesubject is Andrew Tridgell's paper on rsync. It actually tackles a slightlymore difficult problem that just binary diff, i.e. binary diff between localand remote file, but the general idea (rolling checksum + strong checksum)can be used to compute a binary diff between two local files in an efficientway.

Andrew's paper is on http://samba.org/rsync/tech_report/

I don't undestand how binary serialization would help you implement binarydiff. Seems like it is only going to make matters worse, but maybe you wantto do something slightly different than standard binary diff.
Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message denews:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

.

Nov 15 '05 #9
Hi Ching-Lung,

I read your reply to William and saw that your algo is smart enough to find
sequences that differ, not just isolated bytes. So, my judgement was a bit
severe.

Anyway, you should still take a look at the GDIFF format, it is a smart
format for storing the diff and it is relatively easy to generate and
process (you can take some shortcuts and eliminate the byte/short/int
variants of the tags if you don't need an optimal size).

Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:05****************************@phx.gbl...
Hi Bruno,

"...I don't undestand how binary serialization would help
you implement binary diff..."

I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.

To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.

"...but maybe you want to do something slightly different
than standard binary diff..."

I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.

Thanks!
-CL

-----Original Message-----
Binary diff is a tough problem. So far, the best paper I have seen on thesubject is Andrew Tridgell's paper on rsync. It actually tackles a slightlymore difficult problem that just binary diff, i.e. binary diff between localand remote file, but the general idea (rolling checksum + strong checksum)can be used to compute a binary diff between two local files in an efficientway.

Andrew's paper is on http://samba.org/rsync/tech_report/

I don't undestand how binary serialization would help you implement binarydiff. Seems like it is only going to make matters worse, but maybe you wantto do something slightly different than standard binary diff.
Bruno.

"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message denews:10****************************@phx.gbl...
Hi all,

I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things
added to the delta file.

Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*

Please advice, thanks!
-CL

.

Nov 15 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: André Rosendaal | last post by:
Hi, I need a test so I can determine whether a url points to a text file or to a binary file. Actually, I need to distinguish between streaming files and metafiles (e.g. asx files). I tried...
17
by: Guyon Morée | last post by:
what is the difference? if I open a text file in binary (rb) mode, it doesn't matter... the read() output is the same.
13
by: yaipa | last post by:
What would be the common sense way of finding a binary pattern in a ..bin file, say some 200 bytes, and replacing it with an updated pattern of the same length at the same offset? Also, the...
4
by: Viviana Vc | last post by:
Hi all, I am using Win2k, VS. NET 7.1 (MS development Environment 2003 7.1.3088) and I noticed that by building the exact same code twice the generated binaries are different (not much, but they...
0
by: christopher diggins | last post by:
// binary_aritmetic.hpp // Diggins PDP (Public Domain Post) #1.2 // by Christopher Diggins, May 24, 2005 // // Comment: // Provides several algorithms for unsigned binary arithmetic // //...
103
by: Steven T. Hatton | last post by:
§27.4.2.1.4 Type ios_base::openmode Says this about the std::ios::binary openmode flag: *binary*: perform input and output in binary mode (as opposed to text mode) And that is basically _all_ it...
1
by: Josh Carlisle | last post by:
I posted a question a few days ago concerning file differencing and I got some thought provoking answers. My original question dealt with identifying file differences for storage of multiple...
4
by: comp.lang.php | last post by:
I borrowed the following function from the PHP manual user notes: if (!function_exists('is_binary')) { /** * Determine if a file is binary. Useful for doing file content editing * *...
4
by: whitehatmiracle | last post by:
Hello I have written a program for a binary tree. On compiling one has to first chose option 1 and then delete or search. Im having some trouble with the balancing function. It seems to be going...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.