473,765 Members | 1,967 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

big file hashing

I am implementing dup file finder via getting Hash (MD5 or SHA1) from files.
I know how to get Hash but my problem is files are big. Real big (~3-4G
each). That is lot of read from HDD if I am doing a match with even 20
files.

My solution is to get hash from few MB of each file (e.g. first 10MB). At
least it reduces lots of read and time. It does the job but is not a sure
shot way because it is not 'really' the true hash of a file. Thus I am not
happy with this solution.

Any better way anyone can think off?

Thank you,
--
Po
Nov 17 '05 #1
3 1684
The only way to be real sure is to do the whole hash. You could probably be
fairly sure by checking file size, datetime modified, and first (and maybe
last) xMB as you said. If the files are "yours", you could append a hash to
the file when created or modified, then you could just read the hash from
the start of the file as your comparer.

--
William Stacey [MVP]

"Pohihihi" <po******@hotma il.com> wrote in message
news:ut******** ******@TK2MSFTN GP10.phx.gbl...
I am implementing dup file finder via getting Hash (MD5 or SHA1) from
files.
I know how to get Hash but my problem is files are big. Real big (~3-4G
each). That is lot of read from HDD if I am doing a match with even 20
files.

My solution is to get hash from few MB of each file (e.g. first 10MB). At
least it reduces lots of read and time. It does the job but is not a sure
shot way because it is not 'really' the true hash of a file. Thus I am not
happy with this solution.

Any better way anyone can think off?

Thank you,
--
Po

Nov 17 '05 #2
Thanks.
"William Stacey [MVP]" <st*****@mvps.o rg> wrote in message
news:eo******** ******@TK2MSFTN GP10.phx.gbl...
The only way to be real sure is to do the whole hash. You could probably
be fairly sure by checking file size, datetime modified, and first (and
maybe last) xMB as you said. If the files are "yours", you could append a
hash to the file when created or modified, then you could just read the
hash from the start of the file as your comparer.

--
William Stacey [MVP]

"Pohihihi" <po******@hotma il.com> wrote in message
news:ut******** ******@TK2MSFTN GP10.phx.gbl...
I am implementing dup file finder via getting Hash (MD5 or SHA1) from
files.
I know how to get Hash but my problem is files are big. Real big (~3-4G
each). That is lot of read from HDD if I am doing a match with even 20
files.

My solution is to get hash from few MB of each file (e.g. first 10MB). At
least it reduces lots of read and time. It does the job but is not a sure
shot way because it is not 'really' the true hash of a file. Thus I am
not happy with this solution.

Any better way anyone can think off?

Thank you,
--
Po


Nov 17 '05 #3
You might try overlapped io and io completion on the large files. That will
allow multiple threads to read different sections of the file and generate
the hashes for each section asynchronously. I realize that each read is
still happening on the same hdd, I guess it depends on how much overhead the
hashing is, you could pick up some speed there.

jim

"Pohihihi" <po******@hotma il.com> wrote in message
news:ut******** ******@TK2MSFTN GP10.phx.gbl...
I am implementing dup file finder via getting Hash (MD5 or SHA1) from
files.
I know how to get Hash but my problem is files are big. Real big (~3-4G
each). That is lot of read from HDD if I am doing a match with even 20
files.

My solution is to get hash from few MB of each file (e.g. first 10MB). At
least it reduces lots of read and time. It does the job but is not a sure
shot way because it is not 'really' the true hash of a file. Thus I am not
happy with this solution.

Any better way anyone can think off?

Thank you,
--
Po

Nov 17 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
2022
by: Pat | last post by:
I want to look for some one-to-one hashing function. In C++, any one-to-one hashing function?
11
3436
by: Wm. Scott Miller | last post by:
Hello all! We are building applications here and have hashing algorithms to secure secrets (e.g passwords) by producing one way hashes. Now, I've read alot and I've followed most of the advice that made sense. One comment I've seen alot about is "securing the hashing routine" but no-one explains how to accomplish this. So how do I secure my hashing routine? Do I use code access security, role based security, ACLs, etc or combination?...
19
3843
by: Ole Nielsby | last post by:
How does the GetHashCode() of an array object behave? Does it combine the GetHashCode() of its elements, or does it create a sync block for the object? I want to use readonly arrays as dictionary keys, based on their content, not their identity. Is this feasible using the arrays directly, or do I need to wrap them in a struct that handles GetHashCode and Equal? If so, is such a wrapper present in the standard class library?
8
4577
by: Maya | last post by:
Hello all, I'm using MD5 hashing in my application to give unique values to huge list of items my application receives, originally every item's name was difficult to use as an id for this item although its unique but because it had certain characters and variable lengths I ended up using MD5 hashing of the name.
4
2649
by: giftson.john | last post by:
Hi, I am creating an application which migrates all documents from one repository to another repository. Before migration i have to verify all the documents are unique. No duplicates has to be uploaded. Event the document created date, modified date, filename can be different. How to find the document is duplidate. What i did is, i created a file and did save as and saved into another location. I am not able to find that the document...
1
4418
by: Tinku | last post by:
Hi friends I know Static Hashing and i know about Dynamic Hashing, still i have problem to make program with Dynamic Hashing I am new in "C" world, please help me, my problem is: i have to make program in Dynamic hashing i have to store int value in nodes user only enter int value by this value i have to find hash key and make symbol table my struct are
15
3010
by: Vinodh | last post by:
I am reading about hashing techniques. The map data structure available in C++ STL uses hashing techniques?
7
4387
by: John Smith | last post by:
Hi, I am very new to C# and NET framework. I am trying to hash (using MD5CryptoServiceProvider) a source that is split into several files. Now when the source is in one file I can produce the correct md5 hash. My issue is how can I reproduce the correct hash when the file is split into different files.
24
1962
by: Johnny Jörgensen | last post by:
I'm wondering (and hoping that somebody will be able to answer this): If I calculate the hash value of files (either MD5 or SHA1), can I then be sure that: 1) Two files with the same hash value are in fact identical? 2) Two different files will NEVER have the same hash value? 3) If two files have the same MD5 hash value, they will ALSO have the same
0
9568
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10160
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10007
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9951
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9832
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8831
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5275
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5421
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
2805
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.