We will soon have 3 copies, for testing purposes, of what should be about
4.5 terrabytes of data.
Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3
reads, instead of 4 reads, per disk block - with a sufficiently large
blocksize, of course.
My question then is, does python have a high-level API that would
facilitate this sort of thing, or should I just code something up based on
open and read?
Thanks! 11 2009
Dan Stromberg wrote: Rather than cmp'ing twice, to verify data integrity, I was thinking we could speed up the comparison a bit, by using a python script that does 3
Use the cmp. So what if you must run it twice ... by the way I
really doubt that you could speed up the process in python
.... you'll probably end up with a much slower version
Istvan.
Istvan Albert <ia*****@mailbl ocks.com> wrote: Dan Stromberg wrote:
Rather than cmp'ing twice, to verify data integrity, I was thinking we could speed up the comparison a bit, by using a python script that does 3
Use the cmp. So what if you must run it twice ... by the way I really doubt that you could speed up the process in python ... you'll probably end up with a much slower version
In this case you would be wrong. Comparing data on a processor is
trivial (and would be done using Python's C internals anyways if a
strict string equality is all that matters), but IO is expensive.
Reading Terabytes of data is going to be the bottleneck, so reducing IO
is /the/ optimization that can and should be done.
The code to do so is simple:
def compare_3(fn1, fn2, fn3):
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
while f1.tell() < p:
p = f1.tell()
if f1.read(b) == f2.read(b) == f3.read(b):
continue
print "files differ"
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()
Note that it /may/ be faster to first convert the data into arrays
(module array) to get 2, 4 or 8 byte block comparisons.
- Josiah
Dan Stromberg <st******@dcs.n ac.uci.edu> writes: We will soon have 3 copies, for testing purposes, of what should be about 4.5 terrabytes of data.
Rather than cmp'ing twice, to verify data integrity, I was thinking we could speed up the comparison a bit, by using a python script that does 3 reads, instead of 4 reads, per disk block - with a sufficiently large blocksize, of course.
My question then is, does python have a high-level API that would facilitate this sort of thing, or should I just code something up based on open and read?
Thanks!
Taking a checksum of each file and comparing them would probably be much
faster. A quick test with md5 versus cmp gave me a 10 times speedup. Though,
unexpectedly, running 2 md5 processes in parallel was slower than 2 in
sequence - could be the cheap'n'nasty HD in my desktop, normally I'd expect a
gain here as 1 process CPUs while the other is IOing.
Eddie
Josiah Carlson wrote: The code to do so is simple:
.... p = -1 good = 1 while f1.tell() < p: p = f1.tell() if f1.read(b) == f2.read(b) == f3.read(b): continue
....
What is slightly amusing is that your *simple*
solution is actually incorrect. You got the
comparison backwards in the while loop.
Other functional deficiency when compared to
the cmp diffs is that you don't know which
file has changed or which byte differs
.... adding that brings about the potential
for another set of bugs. Then someone else comes along
who knows a little less about python and adds
a little feature to the program that actually
silently breaks it ...
Whether or not it is actually faster remains
to be seen. And that was my whole point,
not to don't dismiss cmp to soon, see how it works
test it, then armed with some real numbers one
can make better decisions.
Istvan.
Istvan Albert <ia*****@mailbl ocks.com> wrote: Josiah Carlson wrote: The code to do so is simple: ... > p = -1 > good = 1 > while f1.tell() < p: > p = f1.tell() > if f1.read(b) == f2.read(b) == f3.read(b): > continue ... What is slightly amusing is that your *simple* solution is actually incorrect. You got the comparison backwards in the while loop.
My finger slipped.
Other functional deficiency when compared to the cmp diffs is that you don't know which file has changed or which byte differs ... adding that brings about the potential for another set of bugs. Then someone else comes along who knows a little less about python and adds a little feature to the program that actually silently breaks it ...
You are now talking about code maintenance, which is a separate issue
from "how fast can I compare three files", which was originally asked
about.
Whether or not it is actually faster remains to be seen. And that was my whole point, not to don't dismiss cmp to soon, see how it works test it, then armed with some real numbers one can make better decisions.
Ok. Let me see. I am using the fastest computer in the apartment; my
wife's 2.26 ghz P4 without hyperthreading, with a gig of memory, running
two 40GB ATA-100 drives. I have created 3x1.5gb files on one of the
drives (I don't have 4.5tb free), which should give us a reasonable
measure, if shy by around 3 orders of magnitude in terms of time.
cmp looks to be taking 0-12% processor, so I was correct in my statement
that it is likely disk bound (unless your disks are 8 times faster or
your processor is 1/10 as fast).
Oh, goodness, it is still going. I've been typing this email for over 8
minutes. It sure would be nice if cmp had a progress bar. At least
then I could know if I should kill it now or wait another few minutes.
To hell with it, I'm going to try the Python version.
*make little fix to code to give me a progress bar, time elapsed, and an
expected total time*
cmp: at least 8 1/2 minutes for the first cmp, was not finished, I
killed it. At least 17 minutes in total;
Python: 15 minutes, 3 seconds to finish it all.
Hrm, that is only around 5 megs/second. I think we can do better (even
if it is already better than cmp)...
New algorithm...
def compare_32(fn1, fn2, fn3):
mn = md5.new
st = time.time()
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
total = 1.5*2**30
digs = []
m = mn()
while f1.tell() > p:
p = f1.tell()
digs.append(m.u pdate(f1.read(b )).digest())
if not ((p>>20)&7) and p:
a = 100.0*p/total/3
d = time.time()-st
print "%8.1f%% %8.0f\t\r"%(a,1 00*d/a),
m = mn()
for dig in digs:
if dig != m.update(f2.rea d(b)).digest():
print "files 1 and 2 differ before", f2.tell()
good = 0
break
m = mn()
for dig in digs:
if dig != m.update(f3.rea d(b)).digest():
print "files 1 and 2 differ before", f3.tell()
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()
That new one runs in 5 minutes 15 seconds total, because it exploits the
fact that sequential reads are fast. It does use ~20% processor to
compute the md5s, which only makes a difference if your processor is as
fast as a 400mhz P4.
I'd say that Python wins here. Is that concrete enough for you?
- Josiah
Josiah Carlson wrote: for dig in digs: if dig != m.update(f2.rea d(b)).digest(): print "files 1 and 2 differ before", f2.tell() good = 0 break m = mn() for dig in digs: if dig != m.update(f3.rea d(b)).digest(): print "files 1 and 2 differ before", f3.tell()
Oops, copy/paste bug. ^^^^^^^
Should be:
print "files 1 and 3 differ before", f3.tell()
- Josiah
Josiah Carlson <jc******@uci.e du> wrote:
Obviously this was running old code as the following lines don't work... digs.append(m.u pdate(f1.read(b )).digest()) if dig != m.update(f2.rea d(b)).digest(): if dig != m.update(f3.rea d(b)).digest():
Those got split up into:
m.update(f1.rea d(b))
digs.append(m.d igest())
m.update(f2.rea d(b))
if dig != m.digest():
m.update(f3.rea d(b))
if dig != m.digest():
respectively. Gah, I don't like mornings.
- Josiah
Josiah Carlson wrote: measure, if shy by around 3 orders of magnitude in terms of time.
That new one runs in 5 minutes 15 seconds total, because it exploits the
my point was never to say that it is not possible to write
a better way, nor to imply that you could not do it, I simply
said that there is no easy way around this problem.
Your solution while short and nice is not simple, and
requires quite a bit of knowledge to understand
why and how it works.
other topic ... sometimes I have a hard time visualizing how much
a terrabyte of data actually is, this is a good example
for that, even this optimized algorithm would take over three
days to perform a a simple identity check ...
Istvan.
Istvan Albert <ia*****@mailbl ocks.com> wrote: Josiah Carlson wrote:
measure, if shy by around 3 orders of magnitude in terms of time. That new one runs in 5 minutes 15 seconds total, because it exploits the
my point was never to say that it is not possible to write a better way, nor to imply that you could not do it, I simply said that there is no easy way around this problem.
Funny, I thought the implementation I offered was an easy way. No
easier than to use someone else's code, which took all of 5 minutes to
write.
Your solution while short and nice is not simple, and requires quite a bit of knowledge to understand why and how it works.
Anyone who has difficulty understanding the algorithm I offered that
used md5 should seriously consider switching hobbies/jobs. The algorithm
is trivial to understand.
other topic ... sometimes I have a hard time visualizing how much a terrabyte of data actually is, this is a good example for that, even this optimized algorithm would take over three days to perform a a simple identity check ...
It would only take that long if you could only get 15 megs/second from
your drives (it was the main boot drive and the files were fragmented).
At multi-tb data sizes, RAID arrays are common, so read speed is
generally much higher, which would then reduce the running time
significantly.
In the case of multiple drives in a non-raid array, splitting up the
process could linearly scale to the limit of the drive and processor(s)
speed.
- Josiah This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Peter |
last post by:
Hi
how can I compare two byte arrays in VB.NET
Thank
Peter
|
by: Odd-R. |
last post by:
I have to lists, A and B, that may, or may not be equal. If they are not
identical, I want the output to be three new lists, X,Y and Z where X has
all the elements that are in A, but not in B, and Y contains all the
elements that are B but not in A. Z will then have the elements that are
in both A and B.
One way of doing this is of course to iterate throug the lists and compare
each of the element, but is there a more efficient way?
...
|
by: Don Seckler |
last post by:
I am building an application to track the distribution and returns of
copies of magazines.
Copies of a magazine that are unsold are returned by the retailer to
the wholesaler. They are destroyed by the wholesaler and the
wholesaler supplies the publisher (me) with a report (affidavit)about
which copies are destroyed.
The report (affidavit)from the wholesaler is dated and numbered and
says how many copies of each issue were returned...
|
by: Hank Reed |
last post by:
Hello,
I have searched through dozens of old responses to this question
but have been unable to make it work in my situation. I'm using
Access 2000
We have a very old sticker printer on a serial line. Neither
situation is going to be upgraded so don't suggest that. A simple
sticker report takes 10 seconds to reach the printer. That is not an
issue for me. But, if I want 5 copies of the same sticker, most
methods I have tried...
|
by: J. Black |
last post by:
Hello everyone -
I've been developing an incident tracking and reporting system for an
emergency services outfit that I work with part time. It's going
well, except for one fly in the ointment.
The situation is this - I have a dispatch form that is used by
dispatchers to track incidents. An incident number is automatically
assigned with an AutoNumber field once a dispatcher initiates an
incident by opening the dispatch form. I have...
| |
by: Michel Esber |
last post by:
Hello,
I have an environment that contains V8 (FP 11) and V7 (FP13) Linux
servers.
One of our applications reads data from V8 servers and writes
summarized data into V7 servers that can´t be easily migrated for
several reasons.
Because I have to connect to both V8 and V7, the runtime client
|
by: Bob Alston |
last post by:
Looking for someone with experience building apps with multiple
instances of forms open. I am building an app for a nonprofit
organizations case workers. They provide services to the elderly.
so far I have built a traditional app, switchboard, forms, etc.
Part of this app is to automate the forms they previously prepared
manually.
After the app was built and works just fine, I find out there are
several case managers using MS word...
|
by: Shortstuff12345 |
last post by:
I have a database where I'm trying to do some root cause analysis on design changes. Each design change could have multiple root causes associated to the change. I'd like to try to setup the data entry form into a section that has the base information about the design change (ID, person responsible, date issued, etc.) and then have a linked subform for the root cause (type, error origination, etc).
I know how to setup the form (frm_main)...
|
by: Zytan |
last post by:
I want multiple instances of the same .exe to run and share the same
data. I know they all can access the same file at the same time, no
problem, but I'd like to have this data in RAM, which they can all
access. It seems like a needless waste of memory to make them all
maintain their own copy of the same data in RAM at the same time.
What's the best way to achieve this?
I've heard of memory mapped files, so maybe that's the answer. ...
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
| |
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |