comparing multiple copies of terrabytes of data?

Dan Stromberg

We will soon have 3 copies, for testing purposes, of what should be about
4.5 terrabytes of data.

Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3
reads, instead of 4 reads, per disk block - with a sufficiently large
blocksize, of course.

My question then is, does python have a high-level API that would
facilitate this sort of thing, or should I just code something up based on
open and read?

Thanks!

Jul 18 '05 #1

Subscribe Reply

2009

Istvan Albert

Dan Stromberg wrote:

Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3

Use the cmp. So what if you must run it twice ... by the way I
really doubt that you could speed up the process in python
.... you'll probably end up with a much slower version

Istvan.

Jul 18 '05 #2

Josiah Carlson

Istvan Albert <ia*****@mailbl ocks.com> wrote:

Dan Stromberg wrote:
Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3

Use the cmp. So what if you must run it twice ... by the way I
really doubt that you could speed up the process in python
... you'll probably end up with a much slower version

In this case you would be wrong. Comparing data on a processor is
trivial (and would be done using Python's C internals anyways if a
strict string equality is all that matters), but IO is expensive.
Reading Terabytes of data is going to be the bottleneck, so reducing IO
is /the/ optimization that can and should be done.

The code to do so is simple:

def compare_3(fn1, fn2, fn3):
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
while f1.tell() < p:
p = f1.tell()
if f1.read(b) == f2.read(b) == f3.read(b):
continue
print "files differ"
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()
Note that it /may/ be faster to first convert the data into arrays
(module array) to get 2, 4 or 8 byte block comparisons.

- Josiah

Jul 18 '05 #3

Eddie Corns

Dan Stromberg <st******@dcs.n ac.uci.edu> writes:

We will soon have 3 copies, for testing purposes, of what should be about
4.5 terrabytes of data. Rather than cmp'ing twice, to verify data integrity, I was thinking we
could speed up the comparison a bit, by using a python script that does 3
reads, instead of 4 reads, per disk block - with a sufficiently large
blocksize, of course. My question then is, does python have a high-level API that would
facilitate this sort of thing, or should I just code something up based on
open and read? Thanks!

Taking a checksum of each file and comparing them would probably be much
faster. A quick test with md5 versus cmp gave me a 10 times speedup. Though,
unexpectedly, running 2 md5 processes in parallel was slower than 2 in
sequence - could be the cheap'n'nasty HD in my desktop, normally I'd expect a
gain here as 1 process CPUs while the other is IOing.

Eddie

Jul 18 '05 #4

Istvan Albert

Josiah Carlson wrote:

The code to do so is simple:
.... p = -1
good = 1
while f1.tell() < p:
p = f1.tell()
if f1.read(b) == f2.read(b) == f3.read(b):
continue

....

What is slightly amusing is that your *simple*
solution is actually incorrect. You got the
comparison backwards in the while loop.

Other functional deficiency when compared to
the cmp diffs is that you don't know which
file has changed or which byte differs
.... adding that brings about the potential
for another set of bugs. Then someone else comes along
who knows a little less about python and adds
a little feature to the program that actually
silently breaks it ...

Whether or not it is actually faster remains
to be seen. And that was my whole point,
not to don't dismiss cmp to soon, see how it works
test it, then armed with some real numbers one
can make better decisions.

Istvan.

Jul 18 '05 #5

Josiah Carlson

Istvan Albert <ia*****@mailbl ocks.com> wrote:

Josiah Carlson wrote:

The code to do so is simple:
...
> p = -1
> good = 1
> while f1.tell() < p:
> p = f1.tell()
> if f1.read(b) == f2.read(b) == f3.read(b):
> continue

...

What is slightly amusing is that your *simple*
solution is actually incorrect. You got the
comparison backwards in the while loop.

My finger slipped.

Other functional deficiency when compared to
the cmp diffs is that you don't know which
file has changed or which byte differs
... adding that brings about the potential
for another set of bugs. Then someone else comes along
who knows a little less about python and adds
a little feature to the program that actually
silently breaks it ...
You are now talking about code maintenance, which is a separate issue
from "how fast can I compare three files", which was originally asked
about.

Whether or not it is actually faster remains
to be seen. And that was my whole point,
not to don't dismiss cmp to soon, see how it works
test it, then armed with some real numbers one
can make better decisions.

Ok. Let me see. I am using the fastest computer in the apartment; my
wife's 2.26 ghz P4 without hyperthreading, with a gig of memory, running
two 40GB ATA-100 drives. I have created 3x1.5gb files on one of the
drives (I don't have 4.5tb free), which should give us a reasonable
measure, if shy by around 3 orders of magnitude in terms of time.

cmp looks to be taking 0-12% processor, so I was correct in my statement
that it is likely disk bound (unless your disks are 8 times faster or
your processor is 1/10 as fast).

Oh, goodness, it is still going. I've been typing this email for over 8
minutes. It sure would be nice if cmp had a progress bar. At least
then I could know if I should kill it now or wait another few minutes.
To hell with it, I'm going to try the Python version.

*make little fix to code to give me a progress bar, time elapsed, and an
expected total time*

cmp: at least 8 1/2 minutes for the first cmp, was not finished, I
killed it. At least 17 minutes in total;
Python: 15 minutes, 3 seconds to finish it all.

Hrm, that is only around 5 megs/second. I think we can do better (even
if it is already better than cmp)...
New algorithm...
def compare_32(fn1, fn2, fn3):
mn = md5.new
st = time.time()
f1, f2, f3 = [open(i, 'rb') for i in (fn1, fn2, fn3)]
b = 2**20 #tune this as necessary
p = -1
good = 1
total = 1.5*2**30
digs = []
m = mn()
while f1.tell() > p:
p = f1.tell()
digs.append(m.u pdate(f1.read(b )).digest())
if not ((p>>20)&7) and p:
a = 100.0*p/total/3
d = time.time()-st
print "%8.1f%% %8.0f\t\r"%(a,1 00*d/a),
m = mn()
for dig in digs:
if dig != m.update(f2.rea d(b)).digest():
print "files 1 and 2 differ before", f2.tell()
good = 0
break
m = mn()
for dig in digs:
if dig != m.update(f3.rea d(b)).digest():
print "files 1 and 2 differ before", f3.tell()
good = 0
break
if good and f1.read(1) == f2.read(1) == f3.read(1) == '':
print "files are identical"
f1.close() #I prefer to explicitly close my file handles
f2.close()
f3.close()
That new one runs in 5 minutes 15 seconds total, because it exploits the
fact that sequential reads are fast. It does use ~20% processor to
compute the md5s, which only makes a difference if your processor is as
fast as a 400mhz P4.

I'd say that Python wins here. Is that concrete enough for you?
- Josiah

Jul 18 '05 #6

Josiah Carlson

Josiah Carlson wrote:

for dig in digs:
if dig != m.update(f2.rea d(b)).digest():
print "files 1 and 2 differ before", f2.tell()
good = 0
break
m = mn()
for dig in digs:
if dig != m.update(f3.rea d(b)).digest():
print "files 1 and 2 differ before", f3.tell()

Oops, copy/paste bug. ^^^^^^^
Should be:
print "files 1 and 3 differ before", f3.tell()

- Josiah

Jul 18 '05 #7

Josiah Carlson

Josiah Carlson <jc******@uci.e du> wrote:

Obviously this was running old code as the following lines don't work...

digs.append(m.u pdate(f1.read(b )).digest())
if dig != m.update(f2.rea d(b)).digest():
if dig != m.update(f3.rea d(b)).digest():

Those got split up into:
m.update(f1.rea d(b))
digs.append(m.d igest())

m.update(f2.rea d(b))
if dig != m.digest():

m.update(f3.rea d(b))
if dig != m.digest():

respectively. Gah, I don't like mornings.

- Josiah

Jul 18 '05 #8

Istvan Albert

Josiah Carlson wrote:

measure, if shy by around 3 orders of magnitude in terms of time. That new one runs in 5 minutes 15 seconds total, because it exploits the

my point was never to say that it is not possible to write
a better way, nor to imply that you could not do it, I simply
said that there is no easy way around this problem.

Your solution while short and nice is not simple, and
requires quite a bit of knowledge to understand
why and how it works.

other topic ... sometimes I have a hard time visualizing how much
a terrabyte of data actually is, this is a good example
for that, even this optimized algorithm would take over three
days to perform a a simple identity check ...

Istvan.

Jul 18 '05 #9

Josiah Carlson

Istvan Albert <ia*****@mailbl ocks.com> wrote:

Josiah Carlson wrote:
measure, if shy by around 3 orders of magnitude in terms of time.
That new one runs in 5 minutes 15 seconds total, because it exploits the

my point was never to say that it is not possible to write
a better way, nor to imply that you could not do it, I simply
said that there is no easy way around this problem.

Funny, I thought the implementation I offered was an easy way. No
easier than to use someone else's code, which took all of 5 minutes to
write.

Your solution while short and nice is not simple, and
requires quite a bit of knowledge to understand
why and how it works.
Anyone who has difficulty understanding the algorithm I offered that
used md5 should seriously consider switching hobbies/jobs. The algorithm
is trivial to understand.

other topic ... sometimes I have a hard time visualizing how much
a terrabyte of data actually is, this is a good example
for that, even this optimized algorithm would take over three
days to perform a a simple identity check ...

It would only take that long if you could only get 15 megs/second from
your drives (it was the main boot drive and the files were fragmented).
At multi-tb data sizes, RAID arrays are common, so read speed is
generally much higher, which would then reduce the running time
significantly.

In the case of multiple drives in a non-raid array, splitting up the
process could linearly scale to the limit of the drive and processor(s)
speed.
- Josiah

Jul 18 '05 #10

Similar topics

461

Comparing byte arrays

by: Peter | last post by:

Hi how can I compare two byte arrays in VB.NET Thank Peter

.NET Framework

3958

Comparing lists

by: Odd-R. | last post by:

I have to lists, A and B, that may, or may not be equal. If they are not identical, I want the output to be three new lists, X,Y and Z where X has all the elements that are in A, but not in B, and Y contains all the elements that are B but not in A. Z will then have the elements that are in both A and B. One way of doing this is of course to iterate throug the lists and compare each of the element, but is there a more efficient way? ...

Python

4869

How to add multiple records with a form/subform?

by: Don Seckler | last post by:

I am building an application to track the distribution and returns of copies of magazines. Copies of a magazine that are unsold are returned by the retailer to the wholesaler. They are destroyed by the wholesaler and the wholesaler supplies the publisher (me) with a report (affidavit)about which copies are destroyed. The report (affidavit)from the wholesaler is dated and numbered and says how many copies of each issue were returned...

Microsoft Access / VBA

3964

Print multiple copies

by: Hank Reed | last post by:

Hello, I have searched through dozens of old responses to this question but have been unable to make it work in my situation. I'm using Access 2000 We have a very old sticker printer on a serial line. Neither situation is going to be upgraded so don't suggest that. A simple sticker report takes 10 seconds to reach the printer. That is not an issue for me. But, if I want 5 copies of the same sticker, most methods I have tried...

Microsoft Access / VBA

1695

Question on multiple forms and referencing data

by: J. Black | last post by:

Hello everyone - I've been developing an incident tracking and reporting system for an emergency services outfit that I work with part time. It's going well, except for one fly in the ointment. The situation is this - I have a dispatch form that is used by dispatchers to track incidents. An incident number is automatically assigned with an AutoNumber field once a dispatcher initiates an incident by opening the dispatch form. I have...

Microsoft Access / VBA

1864

Multiple DB2 Client Versions

by: Michel Esber | last post by:

Hello, I have an environment that contains V8 (FP 11) and V7 (FP13) Linux servers. One of our applications reads data from V8 servers and writes summarized data into V7 servers that can´t be easily migrated for several reasons. Because I have to connect to both V8 and V7, the runtime client

DB2 Database

2657

Looking for someone with experience building apps with multiple instancesof forms open

by: Bob Alston | last post by:

Looking for someone with experience building apps with multiple instances of forms open. I am building an app for a nonprofit organizations case workers. They provide services to the elderly. so far I have built a traditional app, switchboard, forms, etc. Part of this app is to automate the forms they previously prepared manually. After the app was built and works just fine, I find out there are several case managers using MS word...

Microsoft Access / VBA

2748

Question about multiple instances of a subform

by: Shortstuff12345 | last post by:

I have a database where I'm trying to do some root cause analysis on design changes. Each design change could have multiple root causes associated to the change. I'd like to try to setup the data entry form into a section that has the base information about the design change (ID, person responsible, date issued, etc.) and then have a linked subform for the root cause (type, error origination, etc). I know how to setup the form (frm_main)...

Microsoft Access / VBA

6343

How: multiple program instances sharing same data

by: Zytan | last post by:

I want multiple instances of the same .exe to run and share the same data. I know they all can access the same file at the same time, no problem, but I'd like to have this data in RAM, which they can all access. It seems like a needless waste of memory to make them all maintain their own copy of the same data in RAM at the same time. What's the best way to achieve this? I've heard of memory mapped files, so maybe that's the answer. ...

C# / C Sharp

8830

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9541

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9370

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9321

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9247

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8242

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6074

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4602

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

2782

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP