hash() yields different results for different platforms

Qiangning Hong

I'm writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
hash function depend on machine's word length?

If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?

I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here. :)

Jul 11 '06 #1

Subscribe Post Reply

2407

Paul Rubin

"Qiangning Hong" <ho****@gmail.comwrites:

However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
hash function depend on machine's word length?

The hash function is unspecified and can depend on anything the
implementers feel like. It may(?) even be permitted to differ from
one run of the interpreter to another (I haven't checked the spec for
this). Don't count on it being consistent from machine to machine.

If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique key?

If you're going to accept the overhead of an SQL database you might as
well enjoy the use of the abstraction it gives you, instead of trying
to implement what amounts to your own form of indexing instead of
letting the db take care of it. But md5(url) is certainly very fast
compared with processing the outgoing http connection that you
presumably plan to open for each url.

I will do some benchmarking to find it out.

That's the right way to answer questions like this.

Jul 12 '06 #2

Grant Edwards

On 2006-07-11, Qiangning Hong <ho****@gmail.comwrote:

I'm writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).

However, when I come to Python's builtin hash() function, I
found it produces different values in my two computers! In a
pentium4, hash('a') --468864544; in a amd64, hash('a') ->
12416037344. Does hash function depend on machine's word
length?

Apparently. :)

The low 32 bits match, so perhaps you should just use that
portion of the returned hash?

>>hex(12416037344)

'0x2E40DB1E0L'

>>hex(-468864544 & 0xffffffffffffffff)

'0xFFFFFFFFE40DB1E0L'

>>hex(12416037344 & 0xffffffff)

'0xE40DB1E0L'

>>hex(-468864544 & 0xffffffff)

'0xE40DB1E0L'

--
Grant Edwards grante Yow! Uh-oh!! I forgot
at to submit to COMPULSORY
visi.com URINALYSIS!

Jul 12 '06 #3

Carl Banks

Grant Edwards wrote:

On 2006-07-11, Qiangning Hong <ho****@gmail.comwrote:

I'm writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).

Playing Devil's Advocate: The hash would be a one-time operation during
database insertion, whereas string comparison would happen every
search. Conceivably, comparing hash strings (which is O(1)) could
result in a big savings compared to comparing regular strings; but I
expect most decent sql implementations already hash data internally, so
rolling your own hash would be useless at best.

If the OP's database is lacking, md5 is probably fine. Perhaps using a
subset of the md5 (the low 32 bits, say) could speed up comparisons at
risk of more collisions. Probably a good trade off unless the DB is
humungous.
Carl Banks

Jul 12 '06 #4

Qiangning Hong

Grant Edwards wrote:

On 2006-07-11, Qiangning Hong <ho****@gmail.comwrote:
However, when I come to Python's builtin hash() function, I
found it produces different values in my two computers! In a
pentium4, hash('a') --468864544; in a amd64, hash('a') ->
12416037344. Does hash function depend on machine's word
length?

Apparently. :)

The low 32 bits match, so perhaps you should just use that
portion of the returned hash?

>hex(12416037344)

'0x2E40DB1E0L'

>hex(-468864544 & 0xffffffffffffffff)

'0xFFFFFFFFE40DB1E0L'

>hex(12416037344 & 0xffffffff)

'0xE40DB1E0L'

>hex(-468864544 & 0xffffffff)

'0xE40DB1E0L'

Is this relationship (same low 32 bits) guaranteed? Will it change in
the future version?

Jul 12 '06 #5

Tim Peters

[Grant Edwards]

>...
The low 32 bits match, so perhaps you should just use that
portion of the returned hash?

>>hex(12416037344)
'0x2E40DB1E0L'
>>hex(-468864544 & 0xffffffffffffffff)
'0xFFFFFFFFE40DB1E0L'

>>hex(12416037344 & 0xffffffff)
'0xE40DB1E0L'
>>hex(-468864544 & 0xffffffff)
'0xE40DB1E0L'

[Qiangning Hong]

Is this relationship (same low 32 bits) guaranteed?

No. Nothing about hashes is guaranteed, except that when x and y are
of a hashable type, and x == y, then hash(x) == hash(y) too.

Will it change in the future version?

That's possible, but not planned. Note that the guts of string
hashing in CPython today is implemented via

while (--len >= 0)
x = (1000003*x) ^ *p++;

where x is C type "long", and the C language doesn't even define what
that does (behavior when signed multiplication overflows isn't defined
in C).

Jul 12 '06 #6

Fredrik Lundh

Qiangning Hong wrote:

/.../ add a "hash" column in the table, make it a unique key

at this point, you should have slapped yourself on the forehead, and gone
back to the drawing board.

</F>

Jul 12 '06 #7

Nick Vatamaniuc

Using Python's hash as column in the table might not be a good idea.
You just found out why. So you could instead just use the base url and
create an index based on that so next time just quickly get all urls
from same base address then do a linear search for a specific one, or
even easier, implement your own hashes without using any of the
Python's built-in hash() functions. For example, transform each
character to an int and multply them all mod 2^32-1 or something like
that. Even better I think someone already posted the Python's way of
generating hashes for string, well, just re-implement it in Python such
that your version will yield the same hash on any platform.

Hope this helps,
Nick V.

Qiangning Hong wrote:

I'm writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

However, when I come to Python's builtin hash() function, I found it
produces different values in my two computers! In a pentium4,
hash('a') --468864544; in a amd64, hash('a') -12416037344. Does
hash function depend on machine's word length?

If it does, I must consider another hash algorithm because the spider
will run concurrently in several computers, some are 32-bit, some are
64-bit. Is md5 a good choice? Will it be too slow that I have no
performance gain than using the "url" column directly as the unique
key?

I will do some benchmarking to find it out. But while making my hands
dirty, I would like to hear some advice from experts here. :)

Jul 12 '06 #8

Piet van Oostrum

>>>>Grant Edwards <gr****@visi.com(GE) wrote:

>GEThe low 32 bits match, so perhaps you should just use that
GEportion of the returned hash?

If the hashed should be unique, 32 bits is much too low if you have
millions of entries.
--
Piet van Oostrum <pi**@cs.uu.nl>
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: pi**@vanoostrum.org

Jul 12 '06 #9

Grant Edwards

On 2006-07-12, Carl Banks <in**********@aerojockey.comwrote:

Grant Edwards wrote:
>On 2006-07-11, Qiangning Hong <ho****@gmail.comwrote:

I'm writing a spider. I have millions of urls in a table (mysql) to
check if a url has already been fetched. To check fast, I am
considering to add a "hash" column in the table, make it a unique key,
and use the following sql statement:
insert ignore into urls (url, hash) values (newurl, hash_of_newurl)
to add new url.

I believe this will be faster than making the "url" column unique key
and doing string comparation. Right?

I doubt it will be significantly faster. Comparing two strings
and hashing a string are both O(N).

Playing Devil's Advocate: The hash would be a one-time operation during
database insertion, whereas string comparison would happen every
search.

Good point.

Conceivably, comparing hash strings (which is O(1)) could
result in a big savings compared to comparing regular strings;

Still, I doubt that the URLs are long enough so that there's a
significant difference.

but I expect most decent sql implementations already hash data
internally, so rolling your own hash would be useless at best.

Precisely. DB designers and implementers have been working on
this problem for 30 years. I doubt the OP is going to be able
to best them with a few minutes work.

If the OP's database is lacking, md5 is probably fine. Perhaps
using a subset of the md5 (the low 32 bits, say) could speed
up comparisons at risk of more collisions. Probably a good
trade off unless the DB is humungous.

My advice: do it the simple way first (let the DB handle it).
Don't try to fix a problem until you know it exists.

Premature optimization....

--
Grant Edwards grante Yow! It's strange, but I'm
at only TRULY ALIVE when I'm
visi.com covered in POLKA DOTS and
TACO SAUCE...

Jul 12 '06 #10

Grant Edwards

On 2006-07-12, Qiangning Hong <ho****@gmail.comwrote:

Grant Edwards wrote:
>On 2006-07-11, Qiangning Hong <ho****@gmail.comwrote:
However, when I come to Python's builtin hash() function, I
found it produces different values in my two computers! In a
pentium4, hash('a') --468864544; in a amd64, hash('a') ->
12416037344. Does hash function depend on machine's word
length?

Apparently. :)

The low 32 bits match, so perhaps you should just use that
portion of the returned hash?

>>hex(12416037344)
'0x2E40DB1E0L'
>>hex(-468864544 & 0xffffffffffffffff)
'0xFFFFFFFFE40DB1E0L'

>>hex(12416037344 & 0xffffffff)
'0xE40DB1E0L'
>>hex(-468864544 & 0xffffffff)
'0xE40DB1E0L'

Is this relationship (same low 32 bits) guaranteed?

No, I don't believe so.

Will it change in the future version?

It may.

--
Grant Edwards grante Yow! Is this an out-take
at from the "BRADY BUNCH"?
visi.com

Jul 12 '06 #11

by: VK | last post by:

Or why I just did myArray = "Computers" but myArray.length is showing 0. What a hey? There is a new trend to treat arrays and hashes as they were some variations of the same thing. But they...

Javascript

Flash calling window.location.hash causes IE6 crash

by: Simula | last post by:

I am developing an HTML javascript application and I want to preserve state in a way that can be book-marked. I chose HTML anchors as a means of preserving state. When the application changes...

Javascript

Better crypto hash functions, long, with code

by: Bryan Olson | last post by:

The current Python standard library provides two cryptographic hash functions: MD5 and SHA-1 . The authors of MD5 originally stated: It is conjectured that it is computationally infeasible to...

Python

Multiplication for types smaller than int yields an int

by: Javier Estrada | last post by:

When multiplying two sbyte, byte, short, ushort, the result is an int. It seems to me that the language should leave the result alone, otherwise let the user decide whether to use a checked...

C# / C Sharp

VB.NET CStr Function Yields Different Results With SQL Decimal Value

by: Bill L. | last post by:

We recently noticed that the vb.net CStr function yields different results than the vb6 version when converting SQL decimal data. If, for example, the data is defined in SQL as decimal(19,10), the...

Visual Basic .NET

General Hash Functions In Python

by: Arash Partow | last post by:

Hi all, I've ported various hash functions to python if anyone is interested: def RSHash(key): a = 378551 b = 63689 hash = 0

Python

My hash table is in need of peer review

by: Johan Tibell | last post by:

I would be grateful if someone had a minute or two to review my hash table implementation. It's not yet commented but hopefully it's short and idiomatic enough to be readable. Some of the code...

C / C++

Hash function for float.

by: shaanxxx | last post by:

I wanted to write hash function float or double. Any suggestion would be appreciated.

C / C++

hash

by: Douglas Dude | last post by:

how much time does it take to lok for a value - key in a hash_map ? Thanks

C / C++

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

hash() yields different results for different platforms

Similar topics