I am implementing dup file finder via getting Hash (MD5 or SHA1) from files.
I know how to get Hash but my problem is files are big. Real big (~3-4G
each). That is lot of read from HDD if I am doing a match with even 20
files.
My solution is to get hash from few MB of each file (e.g. first 10MB). At
least it reduces lots of read and time. It does the job but is not a sure
shot way because it is not 'really' the true hash of a file. Thus I am not
happy with this solution.
Any better way anyone can think off?
Thank you,
--
Po 3 1684
The only way to be real sure is to do the whole hash. You could probably be
fairly sure by checking file size, datetime modified, and first (and maybe
last) xMB as you said. If the files are "yours", you could append a hash to
the file when created or modified, then you could just read the hash from
the start of the file as your comparer.
--
William Stacey [MVP]
"Pohihihi" <po******@hotma il.com> wrote in message
news:ut******** ******@TK2MSFTN GP10.phx.gbl... I am implementing dup file finder via getting Hash (MD5 or SHA1) from files. I know how to get Hash but my problem is files are big. Real big (~3-4G each). That is lot of read from HDD if I am doing a match with even 20 files.
My solution is to get hash from few MB of each file (e.g. first 10MB). At least it reduces lots of read and time. It does the job but is not a sure shot way because it is not 'really' the true hash of a file. Thus I am not happy with this solution.
Any better way anyone can think off?
Thank you, -- Po
Thanks.
"William Stacey [MVP]" <st*****@mvps.o rg> wrote in message
news:eo******** ******@TK2MSFTN GP10.phx.gbl... The only way to be real sure is to do the whole hash. You could probably be fairly sure by checking file size, datetime modified, and first (and maybe last) xMB as you said. If the files are "yours", you could append a hash to the file when created or modified, then you could just read the hash from the start of the file as your comparer.
-- William Stacey [MVP]
"Pohihihi" <po******@hotma il.com> wrote in message news:ut******** ******@TK2MSFTN GP10.phx.gbl...I am implementing dup file finder via getting Hash (MD5 or SHA1) from files. I know how to get Hash but my problem is files are big. Real big (~3-4G each). That is lot of read from HDD if I am doing a match with even 20 files.
My solution is to get hash from few MB of each file (e.g. first 10MB). At least it reduces lots of read and time. It does the job but is not a sure shot way because it is not 'really' the true hash of a file. Thus I am not happy with this solution.
Any better way anyone can think off?
Thank you, -- Po
You might try overlapped io and io completion on the large files. That will
allow multiple threads to read different sections of the file and generate
the hashes for each section asynchronously. I realize that each read is
still happening on the same hdd, I guess it depends on how much overhead the
hashing is, you could pick up some speed there.
jim
"Pohihihi" <po******@hotma il.com> wrote in message
news:ut******** ******@TK2MSFTN GP10.phx.gbl... I am implementing dup file finder via getting Hash (MD5 or SHA1) from files. I know how to get Hash but my problem is files are big. Real big (~3-4G each). That is lot of read from HDD if I am doing a match with even 20 files.
My solution is to get hash from few MB of each file (e.g. first 10MB). At least it reduces lots of read and time. It does the job but is not a sure shot way because it is not 'really' the true hash of a file. Thus I am not happy with this solution.
Any better way anyone can think off?
Thank you, -- Po This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: Pat |
last post by:
I want to look for some one-to-one hashing function.
In C++, any one-to-one hashing function?
|
by: Wm. Scott Miller |
last post by:
Hello all!
We are building applications here and have hashing algorithms to secure
secrets (e.g passwords) by producing one way hashes. Now, I've read alot
and I've followed most of the advice that made sense. One comment I've seen
alot about is "securing the hashing routine" but no-one explains how to
accomplish this. So how do I secure my hashing routine? Do I use code
access security, role based security, ACLs, etc or combination?...
|
by: Ole Nielsby |
last post by:
How does the GetHashCode() of an array object behave?
Does it combine the GetHashCode() of its elements, or does
it create a sync block for the object?
I want to use readonly arrays as dictionary keys, based on
their content, not their identity. Is this feasible using the
arrays directly, or do I need to wrap them in a struct that
handles GetHashCode and Equal? If so, is such a wrapper
present in the standard class library?
|
by: Maya |
last post by:
Hello all,
I'm using MD5 hashing in my application to give unique values to huge
list of items my application receives, originally every item's name was
difficult to use as an id for this item although its unique but because
it had certain characters and variable lengths I ended up using MD5
hashing of the name.
|
by: giftson.john |
last post by:
Hi,
I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.
What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document...
| |
by: Tinku |
last post by:
Hi friends
I know Static Hashing and i know about Dynamic Hashing, still i have
problem to make program with Dynamic Hashing I am new in "C" world,
please help me, my problem is:
i have to make program in Dynamic hashing i have to store int value in
nodes user only enter int value by this value i have to find hash key
and make symbol table
my struct are
|
by: Vinodh |
last post by:
I am reading about hashing techniques. The map data structure
available in C++ STL uses hashing techniques?
|
by: John Smith |
last post by:
Hi,
I am very new to C# and NET framework. I am trying to hash (using
MD5CryptoServiceProvider) a source that is split into several files.
Now when the source is in one file I can produce the correct md5 hash.
My issue is how can I reproduce the correct hash when the file is split
into different files.
|
by: Johnny Jörgensen |
last post by:
I'm wondering (and hoping that somebody will be able to answer this):
If I calculate the hash value of files (either MD5 or SHA1), can I then be
sure that:
1) Two files with the same hash value are in fact identical?
2) Two different files will NEVER have the same hash value?
3) If two files have the same MD5 hash value, they will ALSO have the same
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: Hystou |
last post by:
Overview:
Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |