On Sun, 17 Oct 2004 03:13:46 GMT, User <1@2.3> wrote:
Anyone have ideas which os command could be used to get the size of a
file without actually opening it? My intention is to write a script
that identifies duplicate files with different names. I have no
trouble getting the names of all the files in the directory using the
os.listdir() command, but that doesn't return the file size. In order
to be identical, files must be the same size, so I want to use file
size as the first criteria, then, if they are the same size, actually
open them up and compare the contents.
I have written such a script in the past, but had to resort to
something like:
os.system('dir *.* >> trash.txt')
The next step was then to open up 'trash.txt', and piece together the
information I need compare file sizes. The problems with this
approach are that it is very platform dependent (worked on WIN 95, but
don't know what else it will work on) and 8.3 filename limitations
that apply within this environment. That is the reason I'm looking
for some other command to obtain file size before the files are ever
opened.
This should list duplicate files in the specified directory:
You can hack to suit. Not very tested. Just what you see ;-)
------------------------------------------------
# get_dupes.py
import os, md5
def get_dupes(thedir):
finfo = {}
for f in os.listdir(thedir):
if os.path.isfile(f):
finfo.setdefault(os.path.getsize(f), []).append(f)
result = []
for size, flist in finfo.items():
if len(flist)>1:
dupes = {}
for name in flist:
dupes.setdefault(md5.new(open(name, 'rb').read()).hexdigest(),[]).append(name)
for digest, names in dupes.items():
if len(names)>1: result.append((size, digest, names))
return result
if __name__ == '__main__':
import sys
try:
dupes = get_dupes(sys.argv[1])
if dupes:
print
print '%8s %32s %s' % ('size','md5 digest','files with the given size, digest')
print '%8s %32s %s' % ('----','-'*32 ,'---------------------------------')
for duped in dupes:
print '%8s %32s %s' % duped
else:
print 'No duplicate files in %r' % sys.argv[1]
except:
raise SystemExit, 'Usage: python get_dupes.py directory'
-------------------------------------------
(I was surprised at the amount of duplicated stuff ;-)
[23:23] C:\pywk\clp>python get_dupes.py .
size md5 digest files with the given size, digest
---- -------------------------------- ---------------------------------
0 d41d8cd98f00b204e9800998ecf8427e ['z3', 'zero_len.py']
111 ea70a0f814917ef8861bebc085e5e7d0 ['MyConsts.py', 'MyConsts.py~']
163 f8e4add20e45bb253bd46963f25a7057 ['ramb.txt', 'rambxx.txt']
4096 d96633a4b58522ce5787ef80a18e9c7b ['yyy2', 'yyy3']
786 05956208d5185259b47362afcf1812fd ['startmore.py', 'startmore.py~']
851 3845f161fa93cbb9119c16fc43e7b62a ['quadratic.py', 'quadratic.py~']
1536 72f5c05b7ea8dd6059bf59f50b22df33 ['virtest.txt', '~DF30EC.tmp']
1028 fbedc511f9556a8a1dc2ecfa3d859621 ['PaulMoore.py', 'PaulMoore.py~']
1515 568f9732866a9de698732616ae4f9c3b ['loopbreak.py', 'loopbreak.py~']
1662 f54414637ed420fe61b78eeba59737b7 ['for_grodrigues.py', 'for_grodrigues.r1.py']
1702 23fa57926e7fcf2487943acb10db7e2a ['bitfield.py', 'bitfield.py~', 'packbits.py']
3765 e69bf6b018ba305cc3e190378f93e421 ['pythonHi.gif', 'showgif.gif']
5874 bae87bbed53c1e6908bb5c37db9c4292 ['testyenc.py', 'testyenc.py~']
3990 4a5096efaf136f901603a2e1be850eb3 ['pns.py', 'pns.r1.py']
Regards,
Bengt Richter