calculate an unique id of a file

timor.super

Hi group,

I need a way of calculating an unique id for a file.

I've seen things like Crc32, 64, checksum .... there's a list here :
http://en.wikipedia.org/wiki/List_of_hash_functions

What is the best option for me ? I have to identify larges files and
small files. I need something fast if possible. How can I be sure that
it is unique ?

Thanks for any advice

Oct 17 '08 #1

Subscribe Post Reply

9068

Hans Kesting

ti*********@gmail.com pretended :

Hi group,

I need a way of calculating an unique id for a file.

I've seen things like Crc32, 64, checksum .... there's a list here :
http://en.wikipedia.org/wiki/List_of_hash_functions

What is the best option for me ? I have to identify larges files and
small files. I need something fast if possible. How can I be sure that
it is unique ?

Thanks for any advice

I don't think you can guarantee that you can identify file uniquely by
some hash code.
Say you calculate a hash of a single byte. This can hold 256 distinct
values, so by the time you encode file #257 you *will* have found a
duplicate hashcode. For a hash of a 32-bit integer, that amount is in
the billions while larger sizes go to astronomical numbers, but still
you cannot guarantee that there will be no duplicates.

As for speed, large files will require more time, as every byte has to
be read and processed.

Hans Kesting

Oct 17 '08 #2

Peter Morris

Based only on the data or the filename too?

If you have the same filename, same size, and same hashcode it is likely it
is the same file. However, you don't say whether or not the filename is
important. If it isn't a factor then I would probably make the ID based on

FileSize/Hash1/Hash2

Where Hash1 and Hash2 are the results of two different hash algorithms.
Pete
====
http://mrpmorris.blogspot.com
http://www.capableobjects.com

Oct 17 '08 #3

rossum

On Fri, 17 Oct 2008 05:23:05 -0700 (PDT), ti*********@gmail.com wrote:

>Hi group,

I need a way of calculating an unique id for a file.

I've seen things like Crc32, 64, checksum .... there's a list here :
http://en.wikipedia.org/wiki/List_of_hash_functions

What is the best option for me ? I have to identify larges files and
small files. I need something fast if possible. How can I be sure that
it is unique ?

Thanks for any advice

No hash function can guarantee uniqueness; a CRC32 will have a
collision probability of between 1 and 1 in 2^32. The 1 is for cases
where you should have used a cryptographically secure hash function,
i.e. where there is someone deliberately trying to break your system
or the Data Protection law requires a reasonable level of security.

For non-cryptographic use go for a CRC with sufficient size to reduce
the collision probability to an acceptably low level.

For cryptographic purposes use SHA-256 or SHA-512. The MD series are
broken and SHA-384 just calculates a SHA-512 result and truncates it
so you might as well go for SHA-512.

rossum

Oct 17 '08 #4

timor.super

On 17 oct, 19:18, rossum <rossu...@coldmail.comwrote:

On Fri, 17 Oct 2008 05:23:05 -0700 (PDT), timor.su...@gmail.com wrote:
Hi group,

I need a way of calculating an unique id for a file.

I've seen things like Crc32, 64, checksum .... there's a list here :
http://en.wikipedia.org/wiki/List_of_hash_functions

What is the best option for me ? I have to identify larges files and
small files. I need something fast if possible. How can I be sure that
it is unique ?

Thanks for any advice

No hash function can guarantee uniqueness; a CRC32 will have a
collision probability of between 1 and 1 in 2^32. *The 1 is for cases
where you should have used a cryptographically secure hash function,
i.e. where there is someone deliberately trying to break your system
or the Data Protection law requires a reasonable level of security.

For non-cryptographic use go for a CRC with sufficient size to reduce
the collision probability to an acceptably low level.

For cryptographic purposes use SHA-256 or SHA-512. *The MD series are
broken and SHA-384 just calculates a SHA-512 result and truncates it
so you might as well go for SHA-512.

rossum

Thanks for your answer,
I think I don't need SHA.

In fact, I have to know if a file has been modified between to access.
Then, if I use a crc64, it should be enough to know that file has been
modified. Ins't it ?

Do you think this class is good ? http://damieng.com/blog/2007/11/19/c...4-in-c-and-net

Best regards

Oct 17 '08 #5

Peter Duniho

On Fri, 17 Oct 2008 13:39:16 -0700, <ti*********@gmail.comwrote:

[...]
In fact, I have to know if a file has been modified between to access.
Then, if I use a crc64, it should be enough to know that file has been
modified. Ins't it ?

That depends on what your criteria is. If the CRC is different, then
yes...you know for sure that the file has been modified. But it's
possible for the CRC to be the same even though the file has changed. If
it's the same, there is a small but non-zero possibility that the file has
been changed but in a way that maps to the same CRC.

If you need to know with 100% certainty whether the file is changed, then
you have to keep a copy of it and compare byte-by-byte.

For many applications, a CRC plus a feature to allow the user to force
notification of a change is sufficient. Though, for that matter, for many
applications simply using the "Modified" timestamp provided by the OS is
sufficient. It really depends on your reliability requirements.

Pete

Oct 17 '08 #6

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

ti*********@gmail.com wrote:

I need a way of calculating an unique id for a file.

I've seen things like Crc32, 64, checksum .... there's a list here :
http://en.wikipedia.org/wiki/List_of_hash_functions

What is the best option for me ? I have to identify larges files and
small files. I need something fast if possible. How can I be sure that
it is unique ?

Any CRC or Hash should do.

Hash will have less probability of collisions than CRC.

If you need to guarantee no collisions, then you can not use
any of them.

If you can live with a small probability of collisions, then
everything is possible.

The important questions is then whether you need to worry about
files with identical start but different end. Because if not then
you could speed up the process a lot by only checksumming
up to like 100 KB of data.

Arne

Oct 18 '08 #7

by: Phil Powell | last post by:

Relevancy scores are normally defined by a MySQL query on a table that has a fulltext index. The rules for relevancy scoring will exclude certain words due to their being too short (minimum...

PHP

Semi-unique IDs in Schema

by: Victor Engmark | last post by:

How do I define that the contents of an element should be unique only in a sub-tree of the whole XML file? In my case, I have several documents, each containing several files. The file names...

.NET Framework

Some unknown browsers fail to calculate my JavaScript code.

by: Cardman | last post by:

Greetings, I am trying to solve a problem that has been inflicting my self created Order Forms for a long time, where the problem is that as I cannot reproduce this error myself, then it is...

Javascript

Calculate/create Row Number without identity

by: chrispycrunch | last post by:

How do I output a row number for a table solely for the purpose of querying for a unique row? In my problem, the table from a legacy system does not have a primary key, so it limits various...

Microsoft SQL Server

how to calculate man hours for different project ?

by: sg_s123 | last post by:

============================================================================ 02-Feb-04 03-Feb-04 Staff Staff 0800hr- 1300hr- 1700hr- 1900hr- 0800hr- 1300hr- 1700hr- 1900hr- Number...

Microsoft Access / VBA

Python SHA-1 as a method for unique file identification ? [help!]

by: EP | last post by:

This inquiry may either turn out to be about the suitability of the SHA-1 (160 bit digest) for file identification, the sha function in Python ... or about some error in my script. Any insight...

Python

Generate a 10 digit unique id every time

by: gamernaveen | last post by:

I am coding a script , where basically the user has to enter his name , choose file , file comments if required and upload. The file comments , name , filename will be stored in the database with...

PHP

Calculate Median with Conditions

by: rrstudio2 | last post by:

I am using the following vba code to calculate the median of a table in MS Access: Public Function MedianOfRst(RstName As String, fldName As String) As Double 'This function will calculate the...

Microsoft Access / VBA

Calculate sha1 hash of a binary file

by: LaundroMat | last post by:

Hi - I'm trying to calculate unique hash values for binary files, independent of their location and filename, and I was wondering whether I'm going in the right direction. Basically, the hash...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

calculate an unique id of a file

Similar topics