By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,922 Members | 1,944 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,922 IT Pros & Developers. It's quick & easy.

Determining when a file has finished copying

P: n/a
Hi all,

I'm writing some code that monitors a directory for the appearance of
files from a workflow. When those files appear I write a command file
to a device that tells the device how to process the file. The
appearance of the command file triggers the device to grab the
original file. My problem is I don't want to write the command file to
the device until the original file from the workflow has been copied
completely. Since these files are large, my program has a good chance
of scanning the directory while they are mid-copy, so I need to
determine which files are finished being copied and which are still
mid-copy.

I haven't seen anything on Google talking about this, and I don't see
an obvious way of doing this using the os.stat() method on the
filepath. Anyone have any ideas about how I might accomplish this?

Thanks in advance!
Doug
Jul 9 '08 #1
Share this Question
Share on Google+
13 Replies


P: n/a
writeson wrote:
Hi all,

I'm writing some code that monitors a directory for the appearance of
files from a workflow. When those files appear I write a command file
to a device that tells the device how to process the file. The
appearance of the command file triggers the device to grab the
original file. My problem is I don't want to write the command file to
the device until the original file from the workflow has been copied
completely. Since these files are large, my program has a good chance
of scanning the directory while they are mid-copy, so I need to
determine which files are finished being copied and which are still
mid-copy.

I haven't seen anything on Google talking about this, and I don't see
an obvious way of doing this using the os.stat() method on the
filepath. Anyone have any ideas about how I might accomplish this?

Thanks in advance!
Doug
The best way to do this is to have the program that copies the files copy them
to a temporarily named file and rename it when it is completed. That way you
know when it is done by scanning for files with a specific mask.

If that is not possible you might be able to use pyinotify
(http://pyinotify.sourceforge.net/) to watch for WRITE_CLOSE events on the
directory and then process the files.

-Larry

Jul 9 '08 #2

P: n/a
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This seems a synchronization problem. A scenario description could clear
things up so we can help:

Program W (The workflow) copies file F to directory B
Program D (the dog) polls directory B to find is there's any new file F

In this scenario, program D does not know whether F has been fully
copied, but W does.

Solution:
Create a custom lock mechanism. Program W writes a file D/F.lock to
indicate file F is not complete, it's removed when F is fully copied.
I program W crashes in mid-copy both F and F.lock are kept so program D
does not bother to process F. Recovery from the crash in W would another
issue to tackle down.

Best regards,
Manuel.

writeson wrote:
Hi all,

I'm writing some code that monitors a directory for the appearance of
files from a workflow. When those files appear I write a command file
to a device that tells the device how to process the file. The
appearance of the command file triggers the device to grab the
original file. My problem is I don't want to write the command file to
the device until the original file from the workflow has been copied
completely. Since these files are large, my program has a good chance
of scanning the directory while they are mid-copy, so I need to
determine which files are finished being copied and which are still
mid-copy.

I haven't seen anything on Google talking about this, and I don't see
an obvious way of doing this using the os.stat() method on the
filepath. Anyone have any ideas about how I might accomplish this?

Thanks in advance!
Doug
--
http://mail.python.org/mailman/listinfo/python-list
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkh04skACgkQI2zpkmcEAhi0eQCgsVqg51fWiw i47jxqtbR8Gz2U
UukAoKm15UAm3KpEyjhsIGQ+68rq8WuU
=UFHi
-----END PGP SIGNATURE-----
Jul 9 '08 #3

P: n/a

Also available:
pgm-W copies/creates-fills whatever B/dummy
when done, pgm-W renames B/dummy to B/F
pgm-D only scouts for B/F and does it thing when found

Steve
no******@hughes.net
Manuel Vazquez Acosta wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This seems a synchronization problem. A scenario description could clear
things up so we can help:

Program W (The workflow) copies file F to directory B
Program D (the dog) polls directory B to find is there's any new file F

In this scenario, program D does not know whether F has been fully
copied, but W does.

Solution:
Create a custom lock mechanism. Program W writes a file D/F.lock to
indicate file F is not complete, it's removed when F is fully copied.
I program W crashes in mid-copy both F and F.lock are kept so program D
does not bother to process F. Recovery from the crash in W would another
issue to tackle down.

Best regards,
Manuel.

writeson wrote:
>Hi all,

I'm writing some code that monitors a directory for the appearance of
files from a workflow. When those files appear I write a command file
to a device that tells the device how to process the file. The
appearance of the command file triggers the device to grab the
original file. My problem is I don't want to write the command file to
the device until the original file from the workflow has been copied
completely. Since these files are large, my program has a good chance
of scanning the directory while they are mid-copy, so I need to
determine which files are finished being copied and which are still
mid-copy.

I haven't seen anything on Google talking about this, and I don't see
an obvious way of doing this using the os.stat() method on the
filepath. Anyone have any ideas about how I might accomplish this?

Thanks in advance!
Doug
--
http://mail.python.org/mailman/listinfo/python-list

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkh04skACgkQI2zpkmcEAhi0eQCgsVqg51fWiw i47jxqtbR8Gz2U
UukAoKm15UAm3KpEyjhsIGQ+68rq8WuU
=UFHi
-----END PGP SIGNATURE-----
--
http://mail.python.org/mailman/listinfo/python-list
Jul 9 '08 #4

P: n/a
Guys,

Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.

Any thoughts? Thanks!
Doug
Jul 9 '08 #5

P: n/a
writeson wrote:
Guys,

Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.

Any thoughts? Thanks!
Doug
I guess the problem is "What is the definition of 'finished copying'?". There
is no explicit operating system command that says "I'm done copying to this file
and I won't add anything on to the end of it".

If I could not control the sending application, I would make an estimation of
how long the longest file could possibly take to copy, double it and then only
look at files where the st_ctime was at least that far in the past. What you
suggest could work as well.

-Larry
Jul 9 '08 #6

P: n/a
Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.

Any thoughts? Thanks!
Doug
Could you maybe us the os module to call out to lsof to see if anyone
still has the target file open? I am assuming that when the write process
finishes writing it would close the file.

Check "man lsof"
Jul 9 '08 #7

P: n/a
writeson wrote:
Guys,

Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
>>From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.

Any thoughts? Thanks!
Doug
The solution my team has used is to monitor the file size. If the file
has stopped growing for x amount of time (we use 45 seconds) the file is
done copying. Not elegant, but it works.
--
Ethan
Jul 9 '08 #8

P: n/a
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ethan Furman wrote:
writeson wrote:
>Guys,

Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
>>From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.

Any thoughts? Thanks!
Doug

The solution my team has used is to monitor the file size. If the file
has stopped growing for x amount of time (we use 45 seconds) the file is
done copying. Not elegant, but it works.
--
Ethan
Also I think that matching the md5sums may work. Just set up so that it
checks the copy's md5sum every couple of seconds (or whatever time
interval you want) and matches against the original's. When they match
copying's done. I haven't actually tried this but think it may work.
Any more experienced programmers out there let me know if this is
unworkable please.
K
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIdVkX8vmNfzrLpqoRAsJ2AKCp8wMz93Vz8y9K+MDSP3 3kH/WHngCgl/wM
qTFBfyIEGhu/dNSQzeRrwYQ=
=Xvjq
-----END PGP SIGNATURE-----
Jul 10 '08 #9

P: n/a
keith wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ethan Furman wrote:
>writeson wrote:
>>Guys,

Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.

Any thoughts? Thanks!
Doug
The solution my team has used is to monitor the file size. If the file
has stopped growing for x amount of time (we use 45 seconds) the file is
done copying. Not elegant, but it works.
--
Ethan
Also I think that matching the md5sums may work. Just set up so that it
checks the copy's md5sum every couple of seconds (or whatever time
interval you want) and matches against the original's. When they match
copying's done. I haven't actually tried this but think it may work.
Any more experienced programmers out there let me know if this is
unworkable please.
K
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIdVkX8vmNfzrLpqoRAsJ2AKCp8wMz93Vz8y9K+MDSP3 3kH/WHngCgl/wM
qTFBfyIEGhu/dNSQzeRrwYQ=
=Xvjq
-----END PGP SIGNATURE-----
If the files are large this could consume a lot of CPU and I/O to recalculate
the checksum over and over. I would try the "hasn't been
modified/accessed/created" in some amount of time first.

-Larry
Jul 10 '08 #10

P: n/a
On Jul 9, 5:34*pm, keith <ke...@keithperkins.netwrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ethan Furman wrote:
writeson wrote:
Guys,
Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.
Any thoughts? Thanks!
Doug
The solution my team has used is to monitor the file size. *If the file
has stopped growing for x amount of time (we use 45 seconds) the file is
done copying. *Not elegant, but it works.
--
Ethan

Also I think that matching the md5sums may work. *Just set up so that it
checks the copy's md5sum every couple of seconds (or whatever time
interval you want) and matches against the original's. *When they match
copying's done. I haven't actually tried this but think it may work.
Any more experienced programmers out there let me know if this is
unworkable please.
K
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org

iD8DBQFIdVkX8vmNfzrLpqoRAsJ2AKCp8wMz93Vz8y9K+MDSP3 3kH/WHngCgl/wM
qTFBfyIEGhu/dNSQzeRrwYQ=
=Xvjq
-----END PGP SIGNATURE-----
I use a combination of both the os.stat() on filesize, and md5.
Checking md5s works, but it can take a long time on big files. To fix
that, I wrote a simple sparse md5 sum generator. It takes a small
number bytes from various areas of the file, and creates an md5 by
combining all the sections. This is, in fact, the only solution I have
come up with for watching a folder for windows copys.

The filesize solution doesn't work when a user copies into the watch
folder using drag and drop on Windows because it allocates all the
attributes of the file before any data is written. The filesize will
always show the full size of the file.

~Sean
Jul 11 '08 #11

P: n/a
Sean DiZazzo wrote:
On Jul 9, 5:34 pm, keith <ke...@keithperkins.netwrote:
>-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ethan Furman wrote:
>>writeson wrote:
Guys,
Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).
From that experiment it looks like st_atime and st_mtime equal each
other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.
Any thoughts? Thanks!
Doug
The solution my team has used is to monitor the file size. If the file
has stopped growing for x amount of time (we use 45 seconds) the file is
done copying. Not elegant, but it works.
--
Ethan
Also I think that matching the md5sums may work. Just set up so that it
checks the copy's md5sum every couple of seconds (or whatever time
interval you want) and matches against the original's. When they match
copying's done. I haven't actually tried this but think it may work.
Any more experienced programmers out there let me know if this is
unworkable please.
K
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org

iD8DBQFIdVkX8vmNfzrLpqoRAsJ2AKCp8wMz93Vz8y9K+MDSP 33kH/WHngCgl/wM
qTFBfyIEGhu/dNSQzeRrwYQ=
=Xvjq
-----END PGP SIGNATURE-----

I use a combination of both the os.stat() on filesize, and md5.
Checking md5s works, but it can take a long time on big files. To fix
that, I wrote a simple sparse md5 sum generator. It takes a small
number bytes from various areas of the file, and creates an md5 by
combining all the sections. This is, in fact, the only solution I have
come up with for watching a folder for windows copys.

The filesize solution doesn't work when a user copies into the watch
folder using drag and drop on Windows because it allocates all the
attributes of the file before any data is written. The filesize will
always show the full size of the file.

~Sean
While a lot depends on HOW the copying program does its copy, I've recently been
able to get pyinotify to watch folders. By watching for IN_CLOSE_WRITE events I
can see when files are closed by the writer and then process them instantly
after they have been written. Now if the writer does something like:

open
write
close
open append
write
close
..
..
..

This won't work as well.

FYI,
Larry
Jul 13 '08 #12

P: n/a
Sean DiZazzo wrote:
On Jul 9, 5:34 pm, keith <ke...@keithperkins.netwrote:
>>-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ethan Furman wrote:
>>>writeson wrote:

Guys,
>>>>Thanks for your replies, they are helpful. I should have included in
my initial question that I don't have as much control over the program
that writes (pgm-W) as I'd like. Otherwise, the write to a different
filename and then rename solution would work great. There's no way to
tell from the os.stat() methods to tell when the file is finished
being copied? I ran some test programs, one of which continously
copies big files from one directory to another, and another that
continously does a glob.glob("*.pdf") on those files and looks at the
st_atime and st_mtime parts of the return value of os.stat(filename).

>From that experiment it looks like st_atime and st_mtime equal each

other until the file has finished being copied. Nothing in the
documentation about st_atime or st_mtime leads me to think this is
true, it's just my observations about the two test programs I've
described.
>>>>Any thoughts? Thanks!
Doug
>>>The solution my team has used is to monitor the file size. If the file
has stopped growing for x amount of time (we use 45 seconds) the file is
done copying. Not elegant, but it works.
--
Ethan

Also I think that matching the md5sums may work. Just set up so that it
checks the copy's md5sum every couple of seconds (or whatever time
interval you want) and matches against the original's. When they match
copying's done. I haven't actually tried this but think it may work.
Any more experienced programmers out there let me know if this is
unworkable please.
K
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org

iD8DBQFIdVkX8vmNfzrLpqoRAsJ2AKCp8wMz93Vz8y9K+MDS P33kH/WHngCgl/wM
qTFBfyIEGhu/dNSQzeRrwYQ=
=Xvjq
-----END PGP SIGNATURE-----


I use a combination of both the os.stat() on filesize, and md5.
Checking md5s works, but it can take a long time on big files. To fix
that, I wrote a simple sparse md5 sum generator. It takes a small
number bytes from various areas of the file, and creates an md5 by
combining all the sections. This is, in fact, the only solution I have
come up with for watching a folder for windows copys.

The filesize solution doesn't work when a user copies into the watch
folder using drag and drop on Windows because it allocates all the
attributes of the file before any data is written. The filesize will
always show the full size of the file.

~Sean
Good info, Sean, thanks. One more option may be to attempt to rename
the file -- if it's still open for copying, that will fail; success
indicates the copy is done. Of course, as Larry Bates pointed out, this
could fail if the copy is followed by a re-open and appending.
Hopefully that's not an issue for the OP.
--
Ethan
Jul 14 '08 #13

P: n/a
You could also copy to a different name on the same disk, and when the copying
has been finished just 'move' (mv) the file to the filename the other
application expects. E.g. QMail works this way, writing incoming mails in
folders.

Met vriendelijke groet,
Wilbert Berendsen

--
http://www.wilbertberendsen.nl/
"You must be the change you wish to see in the world."
-- Mahatma Gandhi
Jul 19 '08 #14

This discussion thread is closed

Replies have been disabled for this discussion.