473,406 Members | 2,847 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

hard disk activity

I have a task that involves knowing when a file has changed. But while
for small files this is an easy enough task, checking the modification
dates, or doing a compare on the contents, I need to be able to do this
for very large files.

Is there anything already available in Python that will allow me to
check the hard-disk itself, or that can make my routines aware when a
disk write has occurred?

Thanks for any help,

V

Feb 13 '06 #1
18 2019
VSmirk:
I have a task that involves knowing when a file has changed. But while
for small files this is an easy enough task, checking the modification
dates,


Checking the modification time works the same way for large files. Why is
that not good enough?

What's your platform?

--
René Pijlman
Feb 13 '06 #2
I'm working primarily on Windows XP, but my solution needs to be cross
platform.

The problem is that I need more than the fact that a file has been
modified. I need to know what has been modified in that file.

I am needing to synchronize the file on a remote folder, and my current
solution, which simply copies the file if a date comparison or a
content comparison, becomes a bit unmanageable for very large files.
Some of the files I'm working with are hundreds of MB in size, or
larger.

So I need to skip copying a hundred MB file that has had only a few
bytes changed and instead identify which few bytes have changed and
where those changes are. I was thinking having a module that worked
below the file system level, at the device level, might be a place to
look for a solution.

Feb 13 '06 #3
"VSmirk" <va*********@gmail.com> writes:
I am needing to synchronize the file on a remote folder, and my current
solution, which simply copies the file if a date comparison or a
content comparison, becomes a bit unmanageable for very large files.
Some of the files I'm working with are hundreds of MB in size, or
larger.


Why don't you look at the rsync program:

http://samba.anu.edu.au/rsync/

but for that much data, just plopping it all in a huge file is not
a great approach if you can help it. Maybe you can use a database instead.
Feb 13 '06 #4
I agree with you wholeheartedly, but the large files is part of the
business requirements.

Thanks for the link. I'll look into it.

V

Feb 13 '06 #5

VSmirk wrote:
I'm working primarily on Windows XP, but my solution needs to be cross
platform.

The problem is that I need more than the fact that a file has been
modified. I need to know what has been modified in that file.

I am needing to synchronize the file on a remote folder, and my current
solution, which simply copies the file if a date comparison or a
content comparison, becomes a bit unmanageable for very large files.
Some of the files I'm working with are hundreds of MB in size, or
larger.

So I need to skip copying a hundred MB file that has had only a few
bytes changed and instead identify which few bytes have changed and
where those changes are. I was thinking having a module that worked
below the file system level, at the device level, might be a place to
look for a solution.


Sounds like the diff'g files part is the crux of it, look at sequence
matching libs like (don't know if they'll handle strings this big:

http://docs.python.org/lib/module-difflib.html

for watching files' last-mod flags:
http://www.amk.ca/python/simple/dirwatch.html
http://aspn.activestate.com/ASPN/Coo.../Recipe/215418

http://python-fam.sourceforge.net/

http://pyinotify.sourceforge.net/

(there's a few recipes in the online cookbook, in fact)

Feb 13 '06 #6
Paul Rubin wrote:
"VSmirk" <va*********@gmail.com> writes:
I am needing to synchronize the file on a remote folder, and my current
solution, which simply copies the file if a date comparison or a
content comparison, becomes a bit unmanageable for very large files.
Some of the files I'm working with are hundreds of MB in size, or
larger.


Why don't you look at the rsync program:

http://samba.anu.edu.au/rsync/

but for that much data, just plopping it all in a huge file is not
a great approach if you can help it. Maybe you can use a database instead.


Perhaps a cvs developer could also give some insight, you could check
subversion's mailinglist (check their website for more info:
http://subversion.tigris.org/)

--
mph
Feb 13 '06 #7
Pretty much, yeah. Except I need diffing a pair of files that exist on
opposite ends of a network, without causing the entire contents of the
file to be transferred over that network.

Now, I have the option of doing this: If I am able to determine that
(for instance) bytes 10468 to 1473 in a 849308 byte file are the only
segment that has changed, I can send that range over the network and
insert it into the right place; and then, with a downtime overnight, I
can do a file-copy synchronization to ensure there were no errors
during the day. (I'm reading this and wondering if it even makes
sense, sorry if it doesn't.)

But the trick in my mind is figuring out which specific bytes have been
written to disk. That's why I was thinking device level. Am I going
to have to work in C++ or Assembler for something like this?

Sorry if this sounds like a newbie question. I've been working with
Python long enough to know that someone out there has already solved
one or another of a really obscure problem. So I thought I'd take a
stab at it.

Thanks everyone for the great links.

V

Feb 13 '06 #8
"VSmirk" <va*********@gmail.com> writes:
But the trick in my mind is figuring out which specific bytes have been
written to disk. That's why I was thinking device level. Am I going
to have to work in C++ or Assembler for something like this?


No, you can do it in Python. The basic idea is: locally compute a
separate checksum for (say) each 1% chunk of the file. Do the same
thing on the remote side. So for a 1GB file, you compute 100
checksums at each end, each checksum covering 10 MB. Then send the
100 checksums over the network, which is just a few kbytes. Compare
the checksums and you know which 10MB chunks have changed. For the
chunks that have changed, divide them into 100-kbyte sub-chunks and
checksum those, etc. The optimal number of chunks at each level
depends on network speed and various other things. Anyway this is
basically how rsync works.

Doing anything device level will be highly OS dependent.
Feb 13 '06 #9
Aweseme!!! I got as far as segmenting the large file on my own, and I
ran out of ideas. I kind of thought about checksum, but I never put
the two together.

Thanks. You've helped a lot....

V

Feb 13 '06 #10
"VSmirk" <va*********@gmail.com> writes:
Aweseme!!! I got as far as segmenting the large file on my own, and I
ran out of ideas. I kind of thought about checksum, but I never put
the two together.

Thanks. You've helped a lot....


The checksum method I described works ok if bytes change in the middle
of the file but don't get inserted (piecs of the file don't move
around). If you insert on byte in the middle of a 1GB file (so it
becomes 1GB+1 byte) then all the checksums after the middle block
change, which is no good for your purpose.

Rsync is a very clever program. Rather than re-implement its
algorithm maybe you should just install it and use it, either directly
(instead of writing a Python program) or under control of a Python
program, using os.system or the subprocess module.
Feb 13 '06 #11
Thanks for the head's up. I was so giddy with the simplicity of the
solution, I stopped trying to poke holes in it.

I agree with your philosophy of not "reinventing the wheel", but I did
notice two things: First, the link you provided claims in the features
section that rsync if for *nix systems, so I am assuming I'll need a
port of it for windows systems; however looking at a Python rsync
module I found, it looks like it's just doing file-copy (which I have
already solved).

So I'm wondering if you know off-hand which windows port does this
checksum validation you outlined.

Feb 13 '06 #12
Maybe an example will help

file A

abef | 1938 | 4bac | 0def | 8675

file B

adef | 0083 | abfd | 3356 | 2465

File A is different from file B and you want to have File A look like
File B. So do the segmentation (I have chosen ' | ' as the divide
between segments).

After that do checksums on each segment. For each segment's checksum
that differ there's a discrepancy between the two segments. So make
the changes to have one segment look like the other segment.

In this example the first segment's checksum would be the same whereas
the checksum for segments 2, 3, 4, and 5 will be different. So modify
the bits and bytes accordingly.

You may want to pursue this subject further by looking into various
error correction algorithms.

VSmirk wrote:
Aweseme!!! I got as far as segmenting the large file on my own, and I
ran out of ideas. I kind of thought about checksum, but I never put
the two together.

Thanks. You've helped a lot....

V


Feb 13 '06 #13
"VSmirk" <va*********@gmail.com> writes:
So I'm wondering if you know off-hand which windows port does this
checksum validation you outlined.


I think rsync has been ported to Windows but I don't know any details.
I don't use Windows.
Feb 13 '06 #14
> So I'm wondering if you know off-hand which windows port does this
checksum validation you outlined.


http://www.gaztronics.net/rsync.php is one source.

Just do a Google search for "windows rsync".
Feb 14 '06 #15
> So I'm wondering if you know off-hand which windows port does this
checksum validation you outlined.


http://www.gaztronics.net/rsync.php is one source. Just do a Google search
for "windows rsync".
Feb 14 '06 #16
Of course that was the first thing I tried.

But what I meant to say was that at least one port, the python one,
didn't have the checksum validation that Paul was talking about, so I
was wondering if he knew of one that was faithful to the unix port of
it.

Thanks much for the links, though, and all the help.

Feb 14 '06 #17
On 13 Feb 2006 13:13:51 -0800
Paul Rubin <"http://phr.cx"@NOSPAM.invalid> wrote:
"VSmirk" <va*********@gmail.com> writes:
Aweseme!!! I got as far as segmenting the large file on
my own, and I ran out of ideas. I kind of thought about
checksum, but I never put the two together.

Thanks. You've helped a lot....


The checksum method I described works ok if bytes change
in the middle of the file but don't get inserted (piecs of
the file don't move around). If you insert on byte in the
middle of a 1GB file (so it becomes 1GB+1 byte) then all
the checksums after the middle block change, which is no
good for your purpose.


But of course, the OS will (I hope) give you the exact
length of the file, so you *could* assume that the beginning
and end are the same, then work towards the middle.
Somewhere in between, when you hit the insertion point, both
will disagree, and you've found it. Same for deletion.

Of course, if *many* changes have been made to the file,
then this will break down. But then, if that's the case,
you're going to have to do an expensive transfer anyway, so
expensive analysis is justified.

In fact, you could proceed by analyzing the top and bottom
checksum lists at the point of failure -- download that
frame, do a byte by byte compare and see if you can derive
the frameshift. Then compensate, and go back to checksums
until they fail again. Actually, that will work just coming
from the beginning, too.

If instead, the region continues to be unrecognizeable to
the end of the frame, then you need the next frame anyway.

Seems like it could get pretty close to optimal (but we
probably are re-inventing rsync).

Cheers,
Terry

--
Terry Hancock (ha*****@AnansiSpaceworks.com)
Anansi Spaceworks http://www.AnansiSpaceworks.com

Feb 14 '06 #18
Terry,

Yeah, I was sketching out a scenario much like that. It does break
things down pretty well, and that gets my file sync scenario up to much
larger files. Even if many changes are made to a file, if you keep
track of the number of bytes and checksum over from 1 to the number of
bytes different by shifting the sequence ( that is [abcd]ef, a[bced]f,
ab[cdef]), until a checksum is a match again, you should be able to
find some point where the checksums match again and you can continue up
(or down) doing only the checksums again without all the overhead.

The question in my mind that I will have to test is how much overhead
this causes.

One of the business rules underlying this task is to work with files
that are being continuously written to, say by logging systems or
database servers. This brings with it some obvious problems of file
access, but even in cases where you don't have file access issues, I am
very concerned about race conditions where one of the already-handled
blocks of data are written to. The synched copy on the remote system
now no longer represents a true image of the local file.

This is one of the reasons I was looking into a device-level solution
that would let me know when a hard disk write had occurred. One
colleagues suggested I was going to have to write assembler to do this,
and I may have to ultimately just use the solutions described here for
files that don't have locking and race-condition issues.

Regardless, it's a fun project, and I have to say this list is one of
the more polite lists I've been involved with. Thanks!

V

Feb 14 '06 #19

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Mee Yamo | last post by:
Fellas!! This is a very complicated one and it took me a few days to figure out exactly what's going on, but here's the final story: I have a production environment running on .NET with a SQL...
0
by: pshroads | last post by:
I am trying to restore a 200 GB database on to a newly formatted SAN volume. The restore has been running for hours but there doesn't seem to be a lot of disk activity on the disks I'm restoring to...
0
by: misscrf | last post by:
I have this normalized database, and I don't understand why setting up the forms is so hard for me. I have the main candidate entry form. This is to enter in candidates who apply for a job. I...
0
by: Bob | last post by:
I'd like to be able to make a little app that monitors disk activity just like task manager, listing all active processes with their current disk read and write values. It's not obvious to me how...
8
by: dantan98 | last post by:
Hello all... Please don't ask me why, but I am currently tasked to somehow discover when the hard drive is being accessed (for reads and writes). All of the searching I have done has been...
3
by: ary | last post by:
I try to create a weblog host site! in this case i can't use cache for every page because that cause to be my Server ram full of caching page. but if I can save cache in hard disk my problem...
18
by: NEWSGROUPS | last post by:
I work for a large organization were my team has developed 2 very substantial databases in Access 2000. These databases have been working fine for the last 6 years with minimal issues or problems....
2
by: =?Utf-8?B?R2VvcmR5?= | last post by:
Hello everyone, I would really appreciate if someone helped me in this matter cause I am going to lose my mind... I am using a Sony Vaio Laptop with Windows XP professional (512MB Ram, 1,7 GHz CPU...
4
by: max | last post by:
Hi all, I want to write a program in C/C++ which monitor any hard disk activity by a particular program which have assigned for WINXP. For example, the program are going to run like this.....
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.