Hi all,
I try to create a tool to check the delta (diff) of 2
binaries and create the delta binary. I use binary
formatter (serialization) to create the delta binary. It
works fine but the delta binary is pretty huge in size. I
have 1 byte file and 2 bytes file, the delta should be 1
byte but somehow it turns out to be 249 bytes using binary
formatter. I guess serialization has some other things
added to the delta file.
Is there any better way to create delta binary from 2
given files, but the delta has to be able to be
constructed back to the original file? My current
solution, using serialization binary formatter is great
and easy but the size of the delta file... *sigh*
Please advice, thanks!
-CL 9 6380
Binary diff is a tough problem. So far, the best paper I have seen on the
subject is Andrew Tridgell's paper on rsync. It actually tackles a slightly
more difficult problem that just binary diff, i.e. binary diff between local
and remote file, but the general idea (rolling checksum + strong checksum)
can be used to compute a binary diff between two local files in an efficient
way.
Andrew's paper is on http://samba.org/rsync/tech_report/
I don't undestand how binary serialization would help you implement binary
diff. Seems like it is only going to make matters worse, but maybe you want
to do something slightly different than standard binary diff.
Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
Instead of serializing the byte[] diff, why can't you just write that to a
binary diff file (i.e. no serialization)?
--
William Stacey, DNS MVP
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in message
news:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
Hi Bruno,
"...I don't undestand how binary serialization would help
you implement binary diff..."
I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.
To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.
"...but maybe you want to do something slightly different
than standard binary diff..."
I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.
Thanks!
-CL -----Original Message----- Binary diff is a tough problem. So far, the best paper I
have seen on thesubject is Andrew Tridgell's paper on rsync. It actually
tackles a slightlymore difficult problem that just binary diff, i.e. binary
diff between localand remote file, but the general idea (rolling checksum +
strong checksum)can be used to compute a binary diff between two local
files in an efficientway.
Andrew's paper is on http://samba.org/rsync/tech_report/
I don't undestand how binary serialization would help you
implement binarydiff. Seems like it is only going to make matters worse,
but maybe you wantto do something slightly different than standard binary
diff. Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit
dans le message denews:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary.
It works fine but the delta binary is pretty huge in
size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using
binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
.
Hi Will,
I actually store the byte diff in a hashtable (key =
offset, value = byte diff). With serialization, it's
easier to maintain the state of the hashtable.
I was once thought about put everything in one byte[] and,
like you suggested, no serialization-write it to a file.
But this way, for each byte diff, I have to store the
offset, length of the byte diff, and the byte itself. And
to reconstruct file B back, I have to have 3 readings to
get the offset, length, and read the next length byte from
get the byte diff.
File A:
+-------------------+
| A | B | C | D | E |
+-------------------+
File B:
+-------------------+
| A | b | c | D | e |
+-------------------+
byte[] diff:
+---------------------------+
| 1 | 2 | b | c | 4 | 1 | e |
+---------------------------+
So:
- diff[0] = 1 indicates that A[1] has changed
- diff[1] = 2 indicates that I have to read the next 2
bytes from diff[] to get 'b' and 'c' and stick them into
file A
- and this goes on and on...
Isn't it too much work to do?
Well, above is just one example and the size of file A and
B happen to be the same. One also needs to consider other
cases.
Any more idea?
-CL -----Original Message----- Instead of serializing the byte[] diff, why can't you
just write that to abinary diff file (i.e. no serialization)?
-- William Stacey, DNS MVP
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in
messagenews:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary.
It works fine but the delta binary is pretty huge in
size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using
binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
Your algorithm will work and produce a diff that is significantly smaller
than the complete files only in very special circumstances, i.e., if a few
isolated bytes get replaced by some other bytes in your file. But it does
not qualify for a general purpose binary diff utility.
If the file is reorganized in some way (bytes inserted, deleted or moved
around), your algo will not find any match and will produce an output which
is much larger than the files themselves.
And even if the file is not reorganized like that, your algo will perform
very poorly if sequences of bytes change between two versions, because your
are storing the diffs on individual bytes. You should at least try to find
differing sequences rather than individual bytes.
Finally, dumping a Hashtable is not an optimal way to save the difference. I
suggest you take a look at the GDIFF format for a good example on how a
binary diff could be stored: http://www.w3.org/TR/NOTE-gdiff-19970901.
Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:05****************************@phx.gbl...
Hi Bruno,
"...I don't undestand how binary serialization would help
you implement binary diff..."
I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.
To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.
"...but maybe you want to do something slightly different
than standard binary diff..."
I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.
Thanks!
-CL -----Original Message----- Binary diff is a tough problem. So far, the best paper I
have seen on thesubject is Andrew Tridgell's paper on rsync. It actually
tackles a slightlymore difficult problem that just binary diff, i.e. binary
diff between localand remote file, but the general idea (rolling checksum +
strong checksum)can be used to compute a binary diff between two local
files in an efficientway.
Andrew's paper is on http://samba.org/rsync/tech_report/
I don't undestand how binary serialization would help you
implement binarydiff. Seems like it is only going to make matters worse,
but maybe you wantto do something slightly different than standard binary
diff. Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit
dans le message denews:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary.
It works fine but the delta binary is pretty huge in
size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using
binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
.
To get a feel for what is being stored, serialize the class using XML
serializer and look at. As you will see, its not the data but the metadata
that takes a lot of space. Now imagin how you might store the same data and
metadata using a binary format. Its not xml (although there is bxml in the
works), but you see the issue with creating a general format to reconstruct
your object graph. Unless you come up with your own special method to
serialize your structure, binary is probably the smallest you will get. If
you must go more compact, I would probably look at a function to take your
hashtable and convert it to a byte[] or a comma delimited text file and back
again.
--
William Stacey, DNS MVP
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in message
news:17****************************@phx.gbl... Hi Will,
I actually store the byte diff in a hashtable (key = offset, value = byte diff). With serialization, it's easier to maintain the state of the hashtable.
I was once thought about put everything in one byte[] and, like you suggested, no serialization-write it to a file. But this way, for each byte diff, I have to store the offset, length of the byte diff, and the byte itself. And to reconstruct file B back, I have to have 3 readings to get the offset, length, and read the next length byte from get the byte diff.
File A: +-------------------+ | A | B | C | D | E | +-------------------+
File B: +-------------------+ | A | b | c | D | e | +-------------------+
byte[] diff: +---------------------------+ | 1 | 2 | b | c | 4 | 1 | e | +---------------------------+
So: - diff[0] = 1 indicates that A[1] has changed - diff[1] = 2 indicates that I have to read the next 2 bytes from diff[] to get 'b' and 'c' and stick them into file A - and this goes on and on...
Isn't it too much work to do?
Well, above is just one example and the size of file A and B happen to be the same. One also needs to consider other cases.
Any more idea? -CL
-----Original Message----- Instead of serializing the byte[] diff, why can't you just write that to abinary diff file (i.e. no serialization)?
-- William Stacey, DNS MVP
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in messagenews:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
To get a feel for what is being stored, serialize the class using XML
serializer and look at. As you will see, its not the data but the metadata
that takes a lot of space. Now imagin how you might store the same data and
metadata using a binary format. Its not xml (although there is bxml in the
works), but you see the issue with creating a general format to reconstruct
your object graph. Unless you come up with your own special method to
serialize your structure, binary is probably the smallest you will get. If
you must go more compact, I would probably look at a function to take your
hashtable and convert it to a byte[] or a comma delimited text file and back
again.
--
William Stacey, DNS MVP
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in message
news:17****************************@phx.gbl... Hi Will,
I actually store the byte diff in a hashtable (key = offset, value = byte diff). With serialization, it's easier to maintain the state of the hashtable.
I was once thought about put everything in one byte[] and, like you suggested, no serialization-write it to a file. But this way, for each byte diff, I have to store the offset, length of the byte diff, and the byte itself. And to reconstruct file B back, I have to have 3 readings to get the offset, length, and read the next length byte from get the byte diff.
File A: +-------------------+ | A | B | C | D | E | +-------------------+
File B: +-------------------+ | A | b | c | D | e | +-------------------+
byte[] diff: +---------------------------+ | 1 | 2 | b | c | 4 | 1 | e | +---------------------------+
So: - diff[0] = 1 indicates that A[1] has changed - diff[1] = 2 indicates that I have to read the next 2 bytes from diff[] to get 'b' and 'c' and stick them into file A - and this goes on and on...
Isn't it too much work to do?
Well, above is just one example and the size of file A and B happen to be the same. One also needs to consider other cases.
Any more idea? -CL
-----Original Message----- Instead of serializing the byte[] diff, why can't you just write that to abinary diff file (i.e. no serialization)?
-- William Stacey, DNS MVP
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> wrote in messagenews:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary. It works fine but the delta binary is pretty huge in size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
Hi Ching-Lung,
I read your reply to William and saw that your algo is smart enough to find
sequences that differ, not just isolated bytes. So, my judgement was a bit
severe.
Anyway, you should still take a look at the GDIFF format, it is a smart
format for storing the diff and it is relatively easy to generate and
process (you can take some shortcuts and eliminate the byte/short/int
variants of the tags if you don't need an optimal size).
Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:05****************************@phx.gbl...
Hi Bruno,
"...I don't undestand how binary serialization would help
you implement binary diff..."
I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.
To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.
"...but maybe you want to do something slightly different
than standard binary diff..."
I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.
Thanks!
-CL -----Original Message----- Binary diff is a tough problem. So far, the best paper I
have seen on thesubject is Andrew Tridgell's paper on rsync. It actually
tackles a slightlymore difficult problem that just binary diff, i.e. binary
diff between localand remote file, but the general idea (rolling checksum +
strong checksum)can be used to compute a binary diff between two local
files in an efficientway.
Andrew's paper is on http://samba.org/rsync/tech_report/
I don't undestand how binary serialization would help you
implement binarydiff. Seems like it is only going to make matters worse,
but maybe you wantto do something slightly different than standard binary
diff. Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit
dans le message denews:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary.
It works fine but the delta binary is pretty huge in
size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using
binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
.
Hi Ching-Lung,
I read your reply to William and saw that your algo is smart enough to find
sequences that differ, not just isolated bytes. So, my judgement was a bit
severe.
Anyway, you should still take a look at the GDIFF format, it is a smart
format for storing the diff and it is relatively easy to generate and
process (you can take some shortcuts and eliminate the byte/short/int
variants of the tags if you don't need an optimal size).
Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit dans le message de
news:05****************************@phx.gbl...
Hi Bruno,
"...I don't undestand how binary serialization would help
you implement binary diff..."
I compare each byte from byte[] of file A and byte[] of
file B. If they are diff then I store the byte diff in a
hashtable (key = offset, value = byte diff). There are
some other cases to consider, i.e. if the size of those 2
files are not the same, etc. Once done, I serialize the
hashtable.
To reconstruct file B from file A with this delta binary,
I deserialize the hashtable, do byte comparison between
file A and the hashtable with one assumption that file A
is the same as file A I created the binary diff from.
"...but maybe you want to do something slightly different
than standard binary diff..."
I don't know much about standard binary diff, but I assume
that it's all described in Andrew's paper. I'll check it
out soon.
Thanks!
-CL -----Original Message----- Binary diff is a tough problem. So far, the best paper I
have seen on thesubject is Andrew Tridgell's paper on rsync. It actually
tackles a slightlymore difficult problem that just binary diff, i.e. binary
diff between localand remote file, but the general idea (rolling checksum +
strong checksum)can be used to compute a binary diff between two local
files in an efficientway.
Andrew's paper is on http://samba.org/rsync/tech_report/
I don't undestand how binary serialization would help you
implement binarydiff. Seems like it is only going to make matters worse,
but maybe you wantto do something slightly different than standard binary
diff. Bruno.
"Ching-Lung" <ch*******@alumni.cs.utexas.edu> a écrit
dans le message denews:10****************************@phx.gbl... Hi all,
I try to create a tool to check the delta (diff) of 2 binaries and create the delta binary. I use binary formatter (serialization) to create the delta binary.
It works fine but the delta binary is pretty huge in
size. I have 1 byte file and 2 bytes file, the delta should be 1 byte but somehow it turns out to be 249 bytes using
binary formatter. I guess serialization has some other things added to the delta file.
Is there any better way to create delta binary from 2 given files, but the delta has to be able to be constructed back to the original file? My current solution, using serialization binary formatter is great and easy but the size of the delta file... *sigh*
Please advice, thanks! -CL
. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: André Rosendaal |
last post by:
Hi,
I need a test so I can determine whether a url points to a text file or to a
binary file. Actually, I need to distinguish between streaming files and
metafiles (e.g. asx files). I tried...
|
by: Guyon Morée |
last post by:
what is the difference?
if I open a text file in binary (rb) mode, it doesn't matter... the read()
output is the same.
|
by: yaipa |
last post by:
What would be the common sense way of finding a binary pattern in a
..bin file, say some 200 bytes, and replacing it with an updated pattern
of the same length at the same offset?
Also, the...
|
by: Viviana Vc |
last post by:
Hi all,
I am using Win2k, VS. NET 7.1 (MS development Environment 2003 7.1.3088)
and I noticed that by building the exact same code twice the generated
binaries are different (not much, but they...
|
by: christopher diggins |
last post by:
// binary_aritmetic.hpp
// Diggins PDP (Public Domain Post) #1.2
// by Christopher Diggins, May 24, 2005
//
// Comment:
// Provides several algorithms for unsigned binary arithmetic
//
//...
|
by: Steven T. Hatton |
last post by:
§27.4.2.1.4 Type ios_base::openmode
Says this about the std::ios::binary openmode flag:
*binary*: perform input and output in binary mode (as opposed to text mode)
And that is basically _all_ it...
|
by: Josh Carlisle |
last post by:
I posted a question a few days ago concerning file differencing and I got
some thought provoking answers. My original question dealt with identifying
file differences for storage of multiple...
|
by: comp.lang.php |
last post by:
I borrowed the following function from the PHP manual user notes:
if (!function_exists('is_binary')) {
/**
* Determine if a file is binary. Useful for doing file content
editing
*
*...
|
by: whitehatmiracle |
last post by:
Hello
I have written a program for a binary tree. On compiling one has to
first chose option 1 and then delete or search. Im having some trouble
with the balancing function. It seems to be going...
|
by: Kemmylinns12 |
last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and efficiency. While initially associated with cryptocurrencies...
|
by: Naresh1 |
last post by:
What is WebLogic Admin Training?
WebLogic Admin Training is a specialized program designed to equip individuals with the skills and knowledge required to effectively administer and manage Oracle...
|
by: antdb |
last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine
In the overall architecture, a new "hyper-convergence" concept was proposed, which integrated multiple engines and...
|
by: Arjunsri |
last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and credentials and received a successful connection...
|
by: WisdomUfot |
last post by:
It's an interesting question you've got about how Gmail hides the HTTP referrer when a link in an email is clicked. While I don't have the specific technical details, Gmail likely implements measures...
|
by: Matthew3360 |
last post by:
Hi,
I have been trying to connect to a local host using php curl. But I am finding it hard to do this. I am doing the curl get request from my web server and have made sure to enable curl. I get a...
|
by: Carina712 |
last post by:
Setting background colors for Excel documents can help to improve the visual appeal of the document and make it easier to read and understand. Background colors can be used to highlight important...
|
by: Johno34 |
last post by:
I have this click event on my form. It speaks to a Datasheet Subform
Private Sub Command260_Click()
Dim r As DAO.Recordset
Set r = Form_frmABCD.Form.RecordsetClone
r.MoveFirst
Do
If...
|
by: jack2019x |
last post by:
hello, Is there code or static lib for hook swapchain present?
I wanna hook dxgi swapchain present for dx11 and dx9.
| |