How to implement a Hash Table in C

ravi

Hi
can anybody tell me that which ds will be best suited to implement a
hash table in C/C++

thanx. in advanced

Aug 11 '07

Subscribe Post Reply

139

14010

Richard Heathfield

Malcolm McLean said:

"Mark McIntyre" <ma**********@spamcop.netwrote in message
news:k1********************************@4ax.com...
>On Mon, 13 Aug 2007 21:40:52 +0100, in comp.lang.c , "Malcolm McLean"
<re*******@btinternet.comwrote:

>>>"Ben Bacarisse" <be********@bsb.me.ukwrote in message
news:87************@bsb.me.uk...

I am sorry that my report was not clear. I am not reacting to this
topic rationally.

It's surprising how often people react like that. I get similar
accusations
all the time about 12 Common Atheist Arguments (refuted).

<OT>
What, you mean as opposed to the irrationality of the book itself?
</ot>

As you might expect a lot of atheists are very hostile to my religious
books.

Someone who claims that your Refutations book is irrational need not
necessarily be an atheist. I haven't read it myself, so I am commenting
only generally, but it seems to me that there are many possibilities:

Your book is rational, your critic is an atheist, your critic is wrong;
your book is irrational, your critic is an atheist, your critic is
right; your book is rational, your critic is of another faith, your
critic is wrong; your book is irrational, your critic is of another
faith, your critic is right; your book is rational, your critic is a
Christian, your critic is wrong; or your book is irrational, your
critic is a Christian, and your critic is right.

But these boil down to just two truly distinct possibilities - either
the book is irrational (in which case the criticism is correct) or it
is not (in which case the criticism is incorrect), and the faith or
otherwise of the critic is of no particular relevance.

What is interesting is how similar the response is to Basic
Algorithms. They are quite different subjects, and there is no reason
why someone who agrees with me on Christiamity should see eye to eye
on programming matters.

Indeed - and in fact other Christians might not even agree with you on
Christianity. Or they might. It's a broad church (if I may use that
expression in this context!).

But the things said are almost identical - I
regularly get demands to withdraw the book because it doesn't contain
a definitive proof of God's existence, for instance

Neither does the Bible, but I don't see anyone clamouring for it to be
withdrawn.

(it doesn't claim
to, it refutes 12 Common Atheist Arguments, not the same thing as
proving Christianity to be true).

Right. The book should be judged on its merits. If the refutations are
of poor quality, or are easily refuted themselves, or attack the wrong
arguments (e.g. arguments that are not commonly used by atheists), then
the book is a poor book. That doesn't mean it should be withdrawn,
however. The world needs horrible warnings just as much as it needs
good examples.

In case of Basic Algorithms the
pretext is technical, of course, but I think the basic motive is the
same.

The motive is technical. If your purpose is to illustrate algorithms, I
suggest that you use either pseudocode or a language you know far
better than you know C.

People see a book as something socially unacceptable.

Not in comp.lang.c they don't.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 15 '07 #51

Very.Little.Gravitas.Indeed

On Aug 15, 12:25 am, "Malcolm McLean" <regniz...@btinternet.com>
wrote:

However they have been magnified into an assertion that I am
"unqualified to write" a book on algorithms. The technical objections,
though real issues, are obviously a pretext for something a bit deeper.

That was not what the assertion was, it was you are unqualified to use
C as a language to descibe the algorithms you are describing in your
books.

C is a strictly defined language which has been around for years, not
only has it standards it has conventions. yet you ignore all those and
choose to do it your own way.

If you are using C you should do so rigorously, anyone who finds a
book with C code in it is going to try to use that code when they find
it doesn't work, it's not going to teach them anything, except they
just wasted money on a poor book.

If you choose to use your own ideas and conventions on how C should be
written then the your algorithms are lost in your personal ideas about
how C is or isn't a good language. Instead of looking at your
algorithms and thinking why they work or how they work, you look at
the code and you wonder why the standards and conventions were not
followed and why the code doesn't work.

If you don't follow any standards in C or conventions, then why use C
at all, use a pseudo code, because any advantages from using C go out
the window when you ignore the stanards and conventions that are used.

Aug 15 '07 #52

Christopher Benson-Manica

[comp.lang.c] Richard Heathfield <rj*@see.sig.invalidwrote:

Malcolm McLean said:

>(a bunch of stuff about atheism, all OT)

(a bunch of stuff about atheism, all OT)

Can we possibly take discussion of Malcolm's anti-atheism book (or
whatever it's about) someplace where it's on topic? Thanks.

>People see a book as something socially unacceptable.

I'm trying to imagine a context where this statement is reasonable,
without success.

--
C. Benson Manica | I appreciate all corrections, polite or otherwise.
cbmanica(at)gmail.com |
----------------------| I do not currently read any posts posted through
sdf.lonestar.org | Google groups, due to rampant unchecked spam.

Aug 15 '07 #53

Richard Heathfield

Christopher Benson-Manica said:

[comp.lang.c] Richard Heathfield <rj*@see.sig.invalidwrote:

>Malcolm McLean said:

>>(a bunch of stuff about atheism, all OT)

>(a bunch of stuff about atheism, all OT)

Er, I didn't actually say anything about atheism. But yes, I know what
you mean.

Can we possibly take discussion of Malcolm's anti-atheism book (or
whatever it's about) someplace where it's on topic? Thanks.

It was more of a meta-discussion really - the intent was to encourage
Malcolm to fold my points back onto the discussion of his algorithms
book. But your point is nevertheless well-taken.

>>People see a book as something socially unacceptable.

I'm trying to imagine a context where this statement is reasonable,
without success.

Likewise. Especially here in comp.lang.c, which is one of the more
literate groups on Usenet.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 15 '07 #54

Malcolm McLean

"Richard Heathfield" <rj*@see.sig.invalidwrote in message
news:Qv******************************@bt.com...

It was more of a meta-discussion really - the intent was to encourage
Malcolm to fold my points back onto the discussion of his algorithms
book. But your point is nevertheless well-taken.

I observed that the reaction to Basic Algorithms was almost identical to the
reaction to 12 Common Atheist Arguments (refuted). I wouldn't really want to
pursue the point further, but it is telling.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 15 '07 #55

Richard Heathfield

Malcolm McLean said:

>
"Richard Heathfield" <rj*@see.sig.invalidwrote in message
news:Qv******************************@bt.com...
>It was more of a meta-discussion really - the intent was to encourage
Malcolm to fold my points back onto the discussion of his algorithms
book. But your point is nevertheless well-taken.

I observed that the reaction to Basic Algorithms was almost identical
to the reaction to 12 Common Atheist Arguments (refuted). I wouldn't
really want to pursue the point further, but it is telling.

Indeed, but not necessarily for the reason you imagine.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 15 '07 #56

Keith Thompson

"Malcolm McLean" <re*******@btinternet.comwrites:

"Richard Heathfield" <rj*@see.sig.invalidwrote in message
news:Qv******************************@bt.com...
>It was more of a meta-discussion really - the intent was to encourage
Malcolm to fold my points back onto the discussion of his algorithms
book. But your point is nevertheless well-taken.

I observed that the reaction to Basic Algorithms was almost identical
to the reaction to 12 Common Atheist Arguments (refuted). I wouldn't
really want to pursue the point further, but it is telling.

Telling? How exactly is it telling?

The reasons for the reaction to your Basic Algorithms book have been
discussed at great length. They have nothing to do with the author.
I, for one, would have been delighted if you had written and published
a *good* Basic Algorithms book, one that presented well-written code
that both presented the algorithms clearly and demonstrated a strong
command of whatever implementation language you chose (it happened to
be C). Minor errors in such a work are nearly inevitable; time
permitting, I would have been glad to review the book and point them
out so they can be corrected.

I can't claim to speak for anyone else, but I'm sure that many other
people here feel the same way.

But when several people read a sample chapter and find numerous
blatant errors (code that doesn't even compile, algorithms that
exhibit undefined behavior, stubborn refusal to use the features of
the language), blaming others for their reaction is absurd.

I might be interested in discussing your atheism book, but I won't do
so here, except to suggest that if two books by the same author
receive similar reactions, you should consider the possibility that
the common factor is the author and his writing style, not some
conspiracy by others to denigrate the author personally.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Aug 15 '07 #57

Malcolm McLean

"CBFalconer" <cb********@yahoo.comwrote in message
news:46***************@yahoo.com...

Eric Sosman wrote:
>Malcolm McLean wrote:

... snip ...

>>>
The objection is to a regular publishing such a book.
I have written a good book.

You have done no such thing. I have read good books, and I know.
... snip all further elucidation ...

I suspect you need go no further to make an enemy.

No, I'm happy to have my works condemned.
Even with Ivor Rockbrain it's more a case of "this post is pure abuse, I'd
better not tolerate it" than anything personal.

Keith's point that "the general consensus on clc is that the book is no
good, therefore it is no good" was false, and naturally I had to reply to
that, by giving the real explanation. But I don't blame people for acting as
they do. Kenny McCormack is essentially right when he sees most of the
bandwidth on clc as dominance games, but where he is wrong is in imagining
that life could be any different.

Life is a game, and the vast majority of people are opponents. not enemies.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 15 '07 #58

Keith Thompson

"Malcolm McLean" <re*******@btinternet.comwrites:
[...]

Keith's point that "the general consensus on clc is that the book is
no good, therefore it is no good" was false, and naturally I had to
reply to that, by giving the real explanation.

[...]

I don't recall saying that. But if the general consensus on clc is
that the book is no good, you should seriously consider the
possibility that the general consensus may have some truth behind it.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Aug 15 '07 #59

Ben Bacarisse

"Malcolm McLean" <re*******@btinternet.comwrites:

"CBFalconer" <cb********@yahoo.comwrote in message
news:46***************@yahoo.com...
>Eric Sosman wrote:
>>Malcolm McLean wrote:

... snip ...
>>>>
The objection is to a regular publishing such a book.
I have written a good book.

You have done no such thing. I have read good books, and I know.
... snip all further elucidation ...

I suspect you need go no further to make an enemy.

No, I'm happy to have my works condemned.

"Alas, to wear the mantle of Galileo it is not enough that you be
persecuted by an unkind establishment, you must also be right."
Bob Park

This new twist you have thrown in (that because two things you have
written have drawn criticism, there might be something biased about the
criticism) has conveniently side-tracked the thread.

Since you complained about what you see as "trivial" criticisms, I
raised a bigger issue which you have not commented on, so I'll ask
yet again: what use is your (non) queue implementation? How could it
do anything but baffle a student? If you are worried about c.l.c
topicality, why not re-post on comp.programming where the wider issues
can be discussed?

--
Ben.

Aug 15 '07 #60

CBFalconer

Malcolm McLean wrote:

"CBFalconer" <cb********@yahoo.comwrote in message
>Eric Sosman wrote:
>>Malcolm McLean wrote:

... snip ...
>>>>
The objection is to a regular publishing such a book.
I have written a good book.

You have done no such thing. I have read good books, and I know.
... snip all further elucidation ...

I suspect you need go no further to make an enemy.

No, I'm happy to have my works condemned. Even with Ivor Rockbrain
it's more a case of "this post is pure abuse, I'd better not
tolerate it" than anything personal.

Keith's point that "the general consensus on clc is that the book
is no good, therefore it is no good" was false, and naturally I had
to reply to that, by giving the real explanation. But I don't blame
people for acting as they do. Kenny McCormack is essentially right
when he sees most of the bandwidth on clc as dominance games, but
where he is wrong is in imagining that life could be any different.

McCormack is a troll and a joke. Plonk it. Keith is generally
respected, and accurate, as is Eric Sosman.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

--
Posted via a free Usenet account from http://www.teranews.com

Aug 16 '07 #61

Flash Gordon

Kelsey Bjarnason wrote, On 16/08/07 17:21:

<snip>

So I can't type - compilers generally complain about mistyped identifiers. :)

Not if you mistype them consistently :-)
--
Flash Gordon
Dyslexic SW Developer.
At least the compiler ensures I spell identifiers consistently wrong.

Aug 16 '07 #62

Kelsey Bjarnason

On Thu, 16 Aug 2007 18:17:36 +0100, Flash Gordon wrote:

Kelsey Bjarnason wrote, On 16/08/07 17:21:

<snip>

>So I can't type - compilers generally complain about mistyped identifiers. :)

Not if you mistype them consistently :-)

If I could do that, I'd be happy. As it stands, I live in constant fear
that even the automated spell checkers around me will at some point rise
up against me for cruel and unusual punishment.

Well, not quite, but good goat, my non-formal writing does tend to contain
more than its fair share of typos.

Aug 16 '07 #63

Malcolm McLean

"Ben Bacarisse" <be********@bsb.me.ukwrote in message
news:87************@bsb.me.uk...

"Malcolm McLean" <re*******@btinternet.comwrites:

Since you complained about what you see as "trivial" criticisms, I
raised a bigger issue which you have not commented on, so I'll ask
yet again: what use is your (non) queue implementation? How could it
do anything but baffle a student? If you are worried about c.l.c
topicality, why not re-post on comp.programming where the wider issues
can be discussed?

I didn't complain about trivial criticisms. If there is typo that means a
closing parenthesis is a bracket, that's worth knowing and pointing out.
What I dispute is that several trivial criticisms of this sort add up to a
major criticism. I suppose there is a point at which they do, but I don't
think we're anywhere near there.

A lot of algorithm books make a big meal of the basic data structures. I
checked the Java queue classes before writing this post, and they were
pretty complicated. I takes two interfaces, the Collection and the Iterable,
and there are nine implementations - linked list queues, delay queues, and
so on. Then you've got add methods, which throw execeptions if the queue is
full, and offer methods which return an error code, so that you can decide
whether the queue filling up is or is not part of normal operation. It's
quite an impressive interface.

However the structures chapter is onyl chapter two. I want to describe the
simplicity of the structure instead of its complexity. I could have provided
something similar to the Java classes, using void pointers and access
functions. But it is unlikely that anyone would use such a function suite.
If you want a trivial queue you have a buffer, a circular buffer if you
don't want the inefficiency of shifting everything up on every remove
operation. Just as you build linked lists by incorporating a "next" pointer
in your structure, and trees built as you go. Whether this says something
good or somethign bad about C is a moot point. The fact is that no generic
basic data structure libraries ahve gained acceptance. If I was writing the
book in Java it would be different.

The whole chapter says "these are are basic data structures you will need,
here are some code fragments which show how to implement them". The reader
should be able to extend the expandable array to produce a dynamically
expanding queue, add functions to check for length and spare capacity, and
so on. An "empty" function is pretty patronising. The priority queue,
implemented efficiently, is the red black tree, but it is appropriate to
hold that back. Actually it is possible to speed up the deletion, I
believe - a web seach on "priority queue" and "efficiency" turns up a slew
of algorithms. The book is Basic Algorithms, covering a wide range of topics
in outline, not a massively detailed description of any one.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 16 '07 #64

Malcolm McLean

"Ben Pfaff" <bl*@cs.stanford.eduwrote in message
news:87************@blp.benpfaff.org...

"Malcolm McLean" <re*******@btinternet.comwrites:

>I could have provided something similar to the Java classes,
using void pointers and access functions. But it is unlikely
that anyone would use such a function suite. If you want a
trivial queue you have a buffer, a circular buffer if you don't
want the inefficiency of shifting everything up on every remove
operation.

A circular buffer can be encapsulated in C and still have some
benefits for code clarity, without reducing efficiency. At
least, that's my own evaluation of my own implementation of
circular queues in C:

http://cvs.savannah.gnu.org/viewvc/*...e=text%2Fplain

http://cvs.savannah.gnu.org/viewvc/*...e=text%2Fplain
Commentary is welcome of course.

>>

/* Initializes DEQUE as an empty deque of elements ELEM_SIZE
bytes in size, with an initial capacity of at least
CAPACITY. Returns the initial deque data array. */
void *
deque_init (struct deque *deque, size_t capacity, size_t elem_size)
{
void *data = NULL;
deque_init_null (deque);
if (capacity 0)
{
deque->capacity = 1;
while (deque->capacity < capacity)
deque->capacity <<= 1;
data = xnmalloc (deque->capacity, elem_size);
}
return data;
}

You need to check that capacity is less than 1 << (size_t_bits -1) or the
code will go into an infinite loop.
It's not a practical problem, of course, but it is typical of the sorts of
bugs that do creep into code. I was going to launch into an anti-size_t
tirade and the way in which it gives false promises of security, but I'm too
tired and I've got to prepare a mighty defence of another of my books which
is under attack on another ng.
--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 16 '07 #65

Ben Pfaff

while (deque->capacity < capacity)

deque->capacity <<= 1;

You need to check that capacity is less than 1 << (size_t_bits -1) or
the code will go into an infinite loop.

This is a good point. Thank you. However, there's no way to
solve the problem that doesn't violate the postcondition stated
in the function header, so I think I'll "solve" it by adding an
assertion.

--
char a[]="\n .CJacehknorstu";int putchar(int);int main(void){unsigned long b[]
={0x67dffdff,0x9aa9aa6a,0xa77ffda9,0x7da6aa6a,0xa6 7f6aaa,0xaa9aa9f6,0x11f6},*p
=b,i=24;for(;p+=!*p;*p/=4)switch(0[p]&3)case 0:{return 0;for(p--;i--;i--)case+
2:{i++;if(i)break;else default:continue;if(0)case 1:putchar(a[i&15]);break;}}}

Aug 16 '07 #66

CBFalconer

Malcolm McLean wrote:

>

.... snip ...

>
/* Initializes DEQUE as an empty deque of elements ELEM_SIZE
bytes in size, with an initial capacity of at least
CAPACITY. Returns the initial deque data array. */
void *
deque_init (struct deque *deque, size_t capacity, size_t elem_size)
{
void *data = NULL;

deque_init_null (deque);
if (capacity 0) {
deque->capacity = 1;
while (deque->capacity < capacity)
deque->capacity <<= 1;
data = xnmalloc (deque->capacity, elem_size);
}
return data;
}

Makes no sense, lacking descriptions of struct deque,
deque_init_null, xnmalloc.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

--
Posted via a free Usenet account from http://www.teranews.com

Aug 17 '07 #67

James Dow Allen

On Aug 13, 8:50 pm, Richard Heathfield <r...@see.sig.invalidwrote:

Two bucks in one day? Hmmm. I'm not talking to
you any more without my lawyer.

But I thought you had a huge supply of rounding-error
halfpennies from your days hacking bank computers :-)

But seriously, there are also advantages in making the table size an
integer power of 2 (which, in all but one case, fails the primality
test with a vengeance), since the hash value can then be reduced to the
table size more rapidly.

Gak! You're not coding ala Falconer are you?
(Doing a division on every reprobe, instead
of a simple add-compare-subtract.) Most hashers
will "want to" divide by a prime for scrambling
anyway, so the index reduction is free.

(BTW, when memory is at a premium one may wish
to expand tables, when required, by less than 100%,
precluding any guarantee of power-of-two.)

Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.

In my posting I mentioned "interesting" recent
methods like Cuckoo and Compact Hash and got
little response. Is this because the
"Google Groups" kill-files me? (Or, because
people have taken to kill-filing
"James Dow Allen" specifically?)

I'm not talking to
you any more without my lawyer.

Well, have a nice day anyway!
James

Aug 17 '07 #68

user923005

On Aug 16, 11:58 pm, James Dow Allen <jdallen2...@yahoo.comwrote:

On Aug 13, 8:50 pm, Richard Heathfield <r...@see.sig.invalidwrote:

Two bucks in one day? Hmmm. I'm not talking to
you any more without my lawyer.

But I thought you had a huge supply of rounding-error
halfpennies from your days hacking bank computers :-)

But seriously, there are also advantages in making the table size an
integer power of 2 (which, in all but one case, fails the primality
test with a vengeance), since the hash value can then be reduced to the
table size more rapidly.

Gak! You're not coding ala Falconer are you?
(Doing a division on every reprobe, instead
of a simple add-compare-subtract.) Most hashers
will "want to" divide by a prime for scrambling
anyway, so the index reduction is free.

When table size is an integral power of 2, the address can be
calculated from the hash with a single AND operation.

(BTW, when memory is at a premium one may wish
to expand tables, when required, by less than 100%,
precluding any guarantee of power-of-two.)

Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.

In my posting I mentioned "interesting" recent
methods like Cuckoo and Compact Hash and got
little response. Is this because the
"Google Groups" kill-files me? (Or, because
people have taken to kill-filing
"James Dow Allen" specifically?)

I mentioned cuckoo hash also.
This one is easy to understand:
http://www.mpi-inf.mpg.de/~sanders/programs/cuckoo/
It is written in C++ but would be trivial to rewrite in C.

I do not know of any useful implementation of compact hashing that has
a license that would allow me to use it, and I have not read the paper
on compact hashing yet.

I'm not talking to
you any more without my lawyer.

Well, have a nice day anyway!
James

Aug 17 '07 #69

user923005

/* Here is that cuckoo hash function converted to C */
/* Dann Corbit did the quick & dirty C conversion. */
/* */
/* K-ary cuckoo hashing (c) 2002-2003 by Peter Sanders */
/* this is a simple implementation for the purposes */
/* of measuring the performance of cuckoo hashing in abstract */
/* terms (number of probes per insert, storage efficiency) */
/* usage: compile using g++ or another ISO C++ compiler */
/* a.out <K<n<r<repeat<seed for rng*/
/* there is also a constant tMax that defines the maximum number */
/* of trials for an insertion */
/* allocates space for n elements in K subtables, and repeats */
/* the follwing measurements repeat time: */
/* - find a hash function by filling full lookup tables */
/* with pseudo-random numbers. */
/* - insert elements i=0..n-1 into the table until this fails */
/* for the first time. (The cost of these insertions is not */
/* measured.) */
/* Every n/r successfully inserted elements, the follwing */
/* measurement is repeated n/r times: */
/* * a random element i2 is deleted */
/* * the hash table entries for i2 are filled with new random values
*/
/* * i2 is reinserted. */
/* Note that this is equivalent to removing a random element */
/* inserting a new element that */
/* has never been in the table before. */
/* The output is a table that gives */
/* - x the number of elements in the table at each measuremnt interval
*/
/* - the average number of probes for an insertion during the
measruements */
/* for the given number of inserted elements */
/* - K */
/* - n */
/* - repeat */
/* - seed */
#define DEBUGLEVEL 1
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <time.h>
#include "util.h"
#include "mt-real.c"

#define split 1
const int tMax = 10000; /* max number of random walk trials */
int K; /* number of probes */
int n; /* number of elements */
int m; /* number of elements per table */
int **hash; /* K times n array */
int **table; /* K times m array */
double limit(double x, double bound)
{
if (x bound) {
return bound;
} else if (x < -bound) {
return -bound;
} else
return x;
}
double cpuTime()
{
return clock() * 1e-6;
}

/* generate random int in 0..x-1 */
int rand0K(int x)
{
return (int) (genrand() * x);
}

/* insert element i into table */
/* return value: */
/* -1 failure */
/* otherwise number of hash function evaluation */
int insert(int i)
{
int forbidden = -1;
int j = rand0K(K);
int t;
for (t = 1; t <= tMax; t++) {
int p = hash[j][i];
int newI = table[j][p];
table[j][p] = i; /* play the cuckoo */
if (newI == -1)
return t; /* done */
forbidden = j;
i = newI; /* find new home for cuckoo victim */
j = rand0K(K - 1);
if (j == forbidden)
j = K - 1;
}
return tMax + 1; /* insertion failed */
}

/* delete element i from the table */
void delete(int i)
{
int j;
for (j = 0; j < K; j++) {
int p = hash[j][i];
if (table[j][p] == i) {
table[j][p] = -1;
return;
}
}
}

int main(int argc, char **argv)
{
int i,
j;
int r;
int step;
int repeat;

int seed;
double *sumT;
int *cf;
int rep;

assert(argc == 6);
K = atoi(argv[1]); /* number of probes */
n = atoi(argv[2]); /* number of elements */
r = atoi(argv[3]); /* number of measured densities */
repeat = atoi(argv[4]); /* how often to start from scratch */
seed = atoi(argv[5]);
m = (int) (n / K + 0.5);
step = n / r;
sgenrand(seed);
puts("# x tAvg(x) K N repeat seed");

/* allocate hash function and table */
/* and an empty table */
hash = calloc(K, sizeof(int *));
table = calloc(K, sizeof(int *));
for (j = 0; j < K; j++) {
hash[j] = calloc(n, sizeof(int));
table[j] = calloc(m, sizeof(int));
}

/* initialize statistics */
/* sumT[i] is the average time for size i*step */
sumT = calloc(r + 1, sizeof(double));
cf = calloc(r + 1, sizeof(int));
for (i = 0; i < r; i++) {
sumT[i] = 0;
cf[i] = 0;
}

/* main loop */
for (rep = 0; rep < repeat; rep++) {
/* init hash function and empty table */
for (j = 0; j < K; j++) {
for (i = 0; i < n; i++) {
hash[j][i] = rand0K(m);
}
for (i = 0; i < m; i++) {
table[j][i] = -1;
}
}

/* fill table and keep measuring from time to time */
for (i = 0; i < n; i++) {
if (insert(i) tMax)
break; /* table is full */
/* measure in detail here */
if (((i + 1) % step) == 0) {
int i1;
for (i1 = 0; i1 < step; i1++) {
int t;
/* delete & reinsert a random element */
int i2 = rand0K(i);
delete(i2);
for (j = 0; j < K; j++)
hash[j][i2] = rand0K(m);
t = insert(i2);
cf[i / step] += (t tMax);
/* cout << t << endl; */
sumT[i / step] += t;
}
}
}
}

for (rep = 0; rep < r; rep++) {
printf("%d %f %d %d %d %d %d\n",
rep * step + step,
sumT[rep] / step / repeat,
K,
n,
repeat,
seed,
cf[rep]);
}
return 0;
}

/*
mt-real.c follows:
*/
/* A C-program for MT19937B: Real number version */
/* genrand() generate one pseudorandom number with double precision
*/
/* which is uniformly distributed on [0,1]-interval for each call.
*/
/* sgenrand(seed) set initial values to the working area of 624
words.*/
/* sgenrand(seed) must be called once before calling genrand()
*/
/* (seed is any integer except 0).
*/

/*
LICENCE CONDITIONS:

Matsumoto and Nishimura consent to GNU General
Public Licence

NOTE:
When you use it in your program, please let Matsumoto
<ma******@math.keio.ac.jpknow it.

Because of a machine-trouble, Matsumoto lost emails
which arrived during May 28-29.
*/
#include<stdio.h>

/* Period parameters */
#define N 624
#define M 397
#define MATRIX_A 0x9908b0df /* constant vector a */
#define UPPER_MASK 0x80000000 /* most significant w-r bits */
#define LOWER_MASK 0x7fffffff /* least significant r bits */

/* for tempering */
#define TEMPERING_MASK_B 0x9d2c5680
#define TEMPERING_MASK_C 0xefc60000
#define TEMPERING_SHIFT_U(y) (y >11)
#define TEMPERING_SHIFT_S(y) (y << 7)
#define TEMPERING_SHIFT_T(y) (y << 15)
#define TEMPERING_SHIFT_L(y) (y >18)

static unsigned long ptgfsr[N]; /* set initial seeds: N = 624 words */

void
sgenrand(unsigned long seed) /* seed should not be 0 */
{
int k;

/* setting initial seeds to ptgfsr[N] using */
/* the generator Line 25 of Table 1 in */
/* [KNUTH 1981, The Art of Computer Programming */
/* Vol. 2 (2nd Ed.), pp102] */

ptgfsr[0]= seed & 0xffffffff;
for (k=1; k<N; k++)
ptgfsr[k] = (69069 * ptgfsr[k-1]) & 0xffffffff;
}

double
genrand()
{
unsigned long y;
static int k = 1;
static unsigned long mag01[2]={0x0, MATRIX_A};
/* mag01[x] = x * MATRIX_A for x=0,1 */

if(k == N){ /* generate N words at one time */
int kk;
for (kk=0;kk<N-M;kk++) {
y = (ptgfsr[kk]&UPPER_MASK)|(ptgfsr[kk+1]&LOWER_MASK);
ptgfsr[kk] = ptgfsr[kk+M] ^ (y >1) ^ mag01[y & 0x1];
}
for (;kk<N-1;kk++) {
y = (ptgfsr[kk]&UPPER_MASK)|(ptgfsr[kk+1]&LOWER_MASK);
ptgfsr[kk] = ptgfsr[kk+(M-N)] ^ (y >1) ^ mag01[y & 0x1];
}
y = (ptgfsr[N-1]&UPPER_MASK)|(ptgfsr[0]&LOWER_MASK);
ptgfsr[N-1] = ptgfsr[M-1] ^ (y >1) ^ mag01[y & 0x1];

k = 0;
}

y = ptgfsr[k++];
y ^= TEMPERING_SHIFT_U(y);
y ^= TEMPERING_SHIFT_S(y) & TEMPERING_MASK_B;
y ^= TEMPERING_SHIFT_T(y) & TEMPERING_MASK_C;
y &= 0xffffffff; /* you may delete this line if word size = 32 */
y ^= TEMPERING_SHIFT_L(y);

return ( (double)y / (unsigned long)0xffffffff );
}

/*
util.h follows:
*/
/*
// this files contains all the application independent little
// functions and macros used for the optimizer.
// In particular Peters debug macros and Dags stuff
// from dbasic.h cdefs, random,...

//////////////// stuff originally from
debug.h ///////////////////////////////
// (c) 1997 Peter Sanders
// some little utilities for debugging adapted
// to the paros conventions
*/

#ifndef UTIL
#define UTIL
#ifndef Max
#define Max(x,y) ((x)>=(y)?(x):(y))
#endif

#ifndef Min
#define Min(x,y) ((x)<=(y)?(x):(y))
#endif

#ifndef Abs
#define Abs(x) ((x) < 0 ? -(x) : (x))
#endif

#ifndef PI
#define PI 3.1415926535
#endif

#endif

Aug 17 '07 #70

Richard Heathfield

James Dow Allen said:

On Aug 13, 8:50 pm, Richard Heathfield <r...@see.sig.invalidwrote:

>But seriously, there are also advantages in making the table size an
integer power of 2 (which, in all but one case, fails the primality
test with a vengeance), since the hash value can then be reduced to
the table size more rapidly.

Gak! You're not coding ala Falconer are you?

No, I don't think so. I don't actually use hash tables a great deal
(which is not to say that I never use them), but when I *do* use them,
I don't probe at all. I just use stretchy buckets instead (using trees
to handle the stretchy bit).

<snip>

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 17 '07 #71

Eric Sosman

James Dow Allen wrote:

[...]
Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

Some people seem very interested in the relative speed
of AND vs MOD for this purpose, and on most machines AND
is faster or at the very least no slower. I think, though,
that the time savings must be viewed as just one component
of the search cost -- another piece (sometimes rather large)
is computing the key's hash value in the first place, not
to mention the cost of equality comparisons on the table
items encountered along the way.

(Slight digression): It's often suggested that those
equality comparisons can be made cheaper by storing hash
codes in the table entries along with the payload. Before
making a possibly expensive strcmp() or whatever, one can
first do a (fast) arithmetic comparison of the hashes: if
the hashes don't match, the objects mush be unequal and
it is unnecessary to make the expensive comparison. But this
stratagem doesn't seem to me to be a "Well, doh!" proposition;
at least two counter-arguments exist:

- If the table stores pointers to items rather than item
instances, storing hash values in the table roughly
doubles the amount of memory required. If the same
amount of memory were used for a pointers-only table,
the number of table slots would double and the load
factor would be halved. This should lead to shorter
searches -- and it is not at all clear _a priori_ that
more fast probes will outrace fewer slow probes.

- Storing the hash codes saves time whenever a probe hits
a slot occupied by the "wrong" key (it saves no time when
the probe finds the "right" key). If the time savings is
significant, it must mean that the key comparisons are
very expensive or that the probe sequences are very long.
If we've taken care to construct good hash functions and
have reason to believe that searches take two-point-mumble
probes, on the average, the savings is only one-point-mumble
comparisons per search.

As always when one is interested in speed, the gold standard
is the familiar "Measure, measure, measure!"

--
Eric Sosman
es*****@ieee-dot-org.invalid

Aug 17 '07 #72

Richard Harter

<OT>
On Fri, 17 Aug 2007 08:43:06 -0400, Eric Sosman
<es*****@ieee-dot-org.invalidwrote:

>James Dow Allen wrote:
>[...]
Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

In a thread in comp.programming (IIRCC) Paul Hsieh pointed out
that always using the same probe increment in a power of two
table size also has bad properties. His solution is to use the
unused bits of the full hash to select an increment size from a
table of primes.

[snip]

re: counterargument to storing the hash code.

- If the table stores pointers to items rather than item
instances, storing hash values in the table roughly
doubles the amount of memory required. If the same
amount of memory were used for a pointers-only table,
the number of table slots would double and the load
factor would be halved. This should lead to shorter
searches -- and it is not at all clear _a priori_ that
more fast probes will outrace fewer slow probes.

This doesn't sound right - ordinarily a table has to hold a
key/value pair. Adding a hash would increase the load factor by
50% - for a given amount of memory.

>
- Storing the hash codes saves time whenever a probe hits
a slot occupied by the "wrong" key (it saves no time when
the probe finds the "right" key). If the time savings is
significant, it must mean that the key comparisons are
very expensive or that the probe sequences are very long.
If we've taken care to construct good hash functions and
have reason to believe that searches take two-point-mumble
probes, on the average, the savings is only one-point-mumble
comparisons per search.

As always when one is interested in speed, the gold standard
is the familiar "Measure, measure, measure!"

</OT>

Aug 17 '07 #73

Richard Harter

On Fri, 17 Aug 2007 01:01:27 -0700, user923005
<dc*****@connx.comwrote:

>/* Here is that cuckoo hash function converted to C */
/* Dann Corbit did the quick & dirty C conversion. */

<snip code>

Thanks for posting the code - much appreciated.

Aug 17 '07 #74

CBFalconer

Richard Harter wrote:

Eric Sosman <es*****@ieee-dot-org.invalidwrote:
>James Dow Allen wrote:

>>[...]
Speaking of Falconer, his method (linear probing with second
hash value used for probe increment) *requires* (even more so
than quadratic probing, see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

In a thread in comp.programming (IIRCC) Paul Hsieh pointed out
that always using the same probe increment in a power of two
table size also has bad properties. His solution is to use the
unused bits of the full hash to select an increment size from a
table of primes.

That is dangerous. The purpose of the multiple hash is to have
different search patterns for different objects that have the same
primary hash pattern. This avoids clustering, and tends to keep
the search chain short. The very iffy thing in hashlib is the
reduction of the range of the secondary hash, which encourages
light clustering in nearly empty tables. The system attempts to
maintain tables at between 44 and 88% of capacity.

Also, I want the table action to be independent of the hash
functions (apart from speed), so the functions are computed every
time needed. The system functions with all the hash functions
replaced by the constant 1 (unity) (tested in the test suite) at a
considerable speed penalty. So the action is hash function
independent, but the speed is highly dependent.

The system is intended to isolate all these decisions in the hash
table code, so the user need not worry about them. Similarly the
algorithms for table expansion, etc. The result is a storage and
retrieval black box.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

--
Posted via a free Usenet account from http://www.teranews.com

Aug 17 '07 #75

Malcolm McLean

"CBFalconer" <cb********@yahoo.comwrote in message
news:46***************@yahoo.com...

Malcolm McLean wrote:
>>
... snip ...
>>
/* Initializes DEQUE as an empty deque of elements ELEM_SIZE
bytes in size, with an initial capacity of at least
CAPACITY. Returns the initial deque data array. */
void *
deque_init (struct deque *deque, size_t capacity, size_t elem_size)
{
void *data = NULL;

deque_init_null (deque);
if (capacity 0) {
deque->capacity = 1;
while (deque->capacity < capacity)
deque->capacity <<= 1;
data = xnmalloc (deque->capacity, elem_size);
}
return data;
}

Makes no sense, lacking descriptions of struct deque,
deque_init_null, xnmalloc.

For it to be OK dequeue->capacity would have to be type guaranteed to be
wider than a size_t. There is no such type in standard C. So you can see
that the code is incorrect.
Actually calling with ~0 is a mistake a calling programmer is not too
unlikely to make. If the he knows he needs N-1 spaces in his queue for a
dental surgery, one patient on the chair and thus not in queue, then he
might pass in N -1 and forget for check for N == 0.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 17 '07 #76

user923005

On Aug 17, 2:00 pm, "Malcolm McLean" <regniz...@btinternet.comwrote:
[snip]

There are many generic queue libraries. None has gained wide acceptance.
Therefore the bar for "usable" is in fact very high. This is not true of
other languages, but it is true of C.

These queues have a large following:
http://www.microsoft.com/windowsserv...q/default.mspx
http://www-306.ibm.com/software/integration/wmq/

Here is a list of queues:
http://sourceforge.net/search/?type_...ft&words=queue

I like this one:
http://sourceforge.net/projects/safmq/
It's C++ though, and is a lot more that most people probably need for
a simple fifo.

For those with simple needs, here is a thread-safe fifo:
http://fifoembed.sourceforge.net/
It's POSIX oriented, but it should be easy enough to convert to
generic use (POSIX threads for Windows could be used for that OS, for
instance).

Aug 17 '07 #77

Malcolm McLean

"Richard Harter" <cr*@tiac.netwrote in message
news:46***************@news.sbtc.net...

This doesn't sound right - ordinarily a table has to hold a
key/value pair. Adding a hash would increase the load factor by
50% - for a given amount of memory.

Normally you can derive the keys from the data item. So in C you would pass
in a function pointer to extract a key from a datum, or if you want to be a
little faster but less flexible, pass in a length and offset.
Unfortunately it make the table harder to use.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 17 '07 #78

Richard Heathfield

Ben Bacarisse said:

<snip>

I'd have add() and remove() do these tests, but
if you don't do that they need to be provided as queue operations.

I pointed out the name clash with <stdio.h>'s remove() some days - or
possibly even weeks - ago. That should have been plenty of time to
correct the electronic version of the book (given that MM seems to have
plenty of time to discuss it, so I presume he has enough time to fix it
too - the more important task).

So I went to have a look, to see if it had been fixed yet. (This is the
first time I've looked at the chapter.)

MM's A123 XYZ / B123 CDE example is wrong, in that it suggests checking
Slot 134. If that's empty, MM says, B123 CDE goes goes in Slot 124. I
can see no justification whatsoever for checking slot (S + 10) before
deciding whether to use Slot S.

I was puzzled to find that hash table entries were not abstracted as the
table itself was. "Divide and conquer" is too useful a strategy to be
ignored lightly, especially in teaching.

Oddly, MM seems to be using int rather than size_t for recording lengths
and sizes (neither of which can be negative). Presumably this is a
typo, since no experienced C programmer would deliberately be that
daft.

His malloc is not ideal - answer = malloc(sizeof *answer); would be
preferable. He repeats this mistake later in the (poorly named)
hashtable() function.

Good functions perform a task. They are doers. And they need "doing
words" (which we sometimes call "verbs") in their names. For example,
stdio's "remove" and "rename" and "fopen" and "printf" - all reasonably
good names (given their creator's habit of terseness).

The indentation is appalling, and led me to believe at first that
hashtable() always returns 0. I had to double back to check. Blech.

There's a reference to a strdup call, but as far as I can see there is
neither a definition of a strdup function nor indeed a call to one.
There is a call to a mystrdup function, but that isn't the same thing.
(Nor is it defined, so we have no idea what it does.)

I wonder what "menas" means.

MM seems to be thinking that "quadratic probing" means "square the hash
value" (and that's a direct quote). Either he's misunderstood quadratic
probing completely, or I have. As far as I'm concerned, quadratic
probing has nothing whatsoever to do with the hash value except as a
place to start looking for a slot. My understanding is that, if the
hash value h has so far resulted in n collisions in a table of size s,
then the place to look next is slot (h + n*n) % s - and most certainly
NOT (h * h) % s!

At this point, I started to lose interest, since it seems evident that
the chapter is so badly broken as to be beyond recovery - and I started
to skim. At random, I discovered an invasion of implementation
namespace, a correct claim that the primality test was interesting (but
alas, it is interesting for all the wrong reasons), an incorrect claim
that trial division is the only guaranteed way to determine whether a
number is prime, and a prime number function that wrongly decides 2 is
not a prime number.

Overall, I found the chapter to be of disappointingly low quality, and I
could not possibly recommend it. If that chapter is the advertisement,
the product itself must be dire indeed.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 18 '07 #79

Malcolm McLean

"Richard Heathfield" <rj*@see.sig.invalidwrote in message
news:34*********************@bt.com...

Ben Bacarisse said:

<snip>

>I'd have add() and remove() do these tests, but
if you don't do that they need to be provided as queue operations.

I pointed out the name clash with <stdio.h>'s remove() some days - or
possibly even weeks - ago. That should have been plenty of time to
correct the electronic version of the book (given that MM seems to have
plenty of time to discuss it, so I presume he has enough time to fix it
too - the more important task).

Corrrections are being stored up for an edition 2, which will be released
shortly.

MM's A123 XYZ / B123 CDE example is wrong, in that it suggests
checking Slot 134. If that's empty, MM says, B123 CDE goes goes in Slot
124. I can see no justification whatsoever for checking slot (S + 10)
before
deciding whether to use Slot S.

I was puzzled to find that hash table entries were not abstracted as the
table itself was. "Divide and conquer" is too useful a strategy to be
ignored lightly, especially in teaching.

There are two tables. One stores data items of any size, one stores
pointers.

>
Oddly, MM seems to be using int rather than size_t for recording lengths
and sizes (neither of which can be negative). Presumably this is a
typo, since no experienced C programmer would deliberately be that
daft.

It's an algorithms book. The algorithms work on integers. The hash function
then has to to return a size_t, which is confusing to people using the book
to understand how to port to another language.

>
His malloc is not ideal - answer = malloc(sizeof *answer); would be
preferable. He repeats this mistake later in the (poorly named)
hashtable() function.

Too hard to read. malloc( sizeof *ptr) is a bad clc ism that you will seldom
see in another environment. The reader has to perform a mental dereference
which makes it harder to see if the size of correct.

>
Good functions perform a task. They are doers. And they need "doing
words" (which we sometimes call "verbs") in their names. For example,
stdio's "remove" and "rename" and "fopen" and "printf" - all reasonably
good names (given their creator's habit of terseness).

My convention is to have an opaque structure, and a constructor with the
name of the structure in lower case. This is analagous to C++. As you say,
create_xxx would have the advantage of being a verb.

>
The indentation is appalling, and led me to believe at first that
hashtable() always returns 0. I had to double back to check. Blech.

Read the actual book - preview is available. The html is dumped form the
Open Office book files, unfortunately it doesn't understand code formatting.
I really ought to be able to find a way around that, I know.

>
There's a reference to a strdup call, but as far as I can see there is
neither a definition of a strdup function nor indeed a call to one.
There is a call to a mystrdup function, but that isn't the same thing.
(Nor is it defined, so we have no idea what it does.)

strdup() is discussed in chapter 1.

>
I wonder what "menas" means.

typo.

>
MM seems to be thinking that "quadratic probing" means "square the hash
value" (and that's a direct quote). Either he's misunderstood quadratic
probing completely, or I have. As far as I'm concerned, quadratic
probing has nothing whatsoever to do with the hash value except as a
place to start looking for a slot. My understanding is that, if the
hash value h has so far resulted in n collisions in a table of size s,
then the place to look next is slot (h + n*n) % s - and most certainly
NOT (h * h) % s!

I'll have to check on this one.

>
At this point, I started to lose interest, since it seems evident that
the chapter is so badly broken as to be beyond recovery - and I started
to skim. At random, I discovered an invasion of implementation
namespace, a correct claim that the primality test was interesting (but
alas, it is interesting for all the wrong reasons), an incorrect claim
that trial division is the only guaranteed way to determine whether a
number is prime, and a prime number function that wrongly decides 2 is
not a prime number.

Overall, I found the chapter to be of disappointingly low quality, and I
could not possibly recommend it. If that chapter is the advertisement,
the product itself must be dire indeed.

You seem determined to pick faults. The problem it then becomes distorting,
like Ben Bacarisse's comments. It is then impossible to know whether the
criticisms actually represent anything legitimate or are just fussing.

abstracted - is an entry of arbitrary size abstract enough? The keys, yes, I
admit it might have been better to go for non-strings.

size_t - yes, there is an obvious case, but you don't seem to have
considered the implications.

malloc(size *ptr) - that's not the convention in most places. The argument
for it is based on the needs of the compiler rather than the human reader.

verbs for function names. There is a case, but it is not obvious that you
are right. C++ doesn't agree.

indentations. Yes, it's my fault for not getting a decent HTML dumper. If I
edit the html

strdup - discussed in chapter 1.

menas - unambiguously valid criticism.

quadratic probing - you might be right here.

prime stuff - yes, that needs a bit of fixing, I will agree.

So actually most of the criticism dissolves. I am not saying that there are
not things you can say against the book, or that it couldn't have been done
better. However can you point to a better introductory web page on hash
tables? That's the real test.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 18 '07 #80

user923005

On Aug 18, 1:15 am, "Malcolm McLean" <regniz...@btinternet.comwrote:
[snip]

So actually most of the criticism dissolves. I am not saying that there are
not things you can say against the book, or that it couldn't have been done
better.

I don't understand how the criticism dissolves. I think that the
major issue is really the reaction to the criticism even more so than
the defects themselves.
The worst possible reaction is "The emperor's new clothes" reaction --
pretending that problems do not exist.
The best possible reaction is "Yes, I see the problem. I'll have it
fixed right away." -- and then actually fixing it right away.
Everyone makes mistakes, no matter how careful we are. But
recognizing our mistakes and correcting them is very, very important.
When creating a reference work, special effort should be made to
ensure correctness on the first pass. And diligent effort to improve
it is also highly important.

However can you point to a better introductory web page on hash
tables? That's the real test.

I suspect that something in this tangle will turn up worthwhile:
http://www.google.com/search?hl=en&q...to+hash+tables

Personally, I like this:
http://eternallyconfuzzled.com/tuts/...hashtable.aspx
When the twisted minds of Julienne Walker and The EC Team get together
and write a book, I will buy a copy for sure.

Aug 18 '07 #81

Malcolm McLean

"user923005" <dc*****@connx.comwrote in message
news:11**********************@g4g2000hsf.googlegro ups.com...

On Aug 18, 1:15 am, "Malcolm McLean" <regniz...@btinternet.comwrote:
[snip]
>So actually most of the criticism dissolves. I am not saying that there
are
not things you can say against the book, or that it couldn't have been
done
better.

I don't understand how the criticism dissolves. I think that the
major issue is really the reaction to the criticism even more so than
the defects themselves.
The worst possible reaction is "The emperor's new clothes" reaction --
pretending that problems do not exist.
The best possible reaction is "Yes, I see the problem. I'll have it
fixed right away." -- and then actually fixing it right away.
Everyone makes mistakes, no matter how careful we are. But
recognizing our mistakes and correcting them is very, very important.
When creating a reference work, special effort should be made to
ensure correctness on the first pass. And diligent effort to improve
it is also highly important.

I don't see how else I could react. Criticisms have been made, some of them
indisuptable bug reports, some of them very good points, some of them
defensible but not really good criticisms, some criticisms which fail to
understand the purpose of the code or the book, and some which are
objectively false.
That's what you would expect. However there is also a sort of assuption that
the book is no good. In fact I found a bug in Ben Pfaff's circular buffer
code within a couple of minutes, but no one was demanding that the code be
taken down. Nor did I report it in a silly way, like "this implementation
exhibits catastrophic and resource-hogging behaviour if size_t is used
beyond the range of a typical signed integer type". I just pointed it out,
as any sensible, experienced person does when making bug reports.

>
>However can you point to a better introductory web page on hash
tables? That's the real test.

I suspect that something in this tangle will turn up worthwhile:
http://www.google.com/search?hl=en&q...to+hash+tables

Personally, I like this:
http://eternallyconfuzzled.com/tuts/...hashtable.aspx
When the twisted minds of Julienne Walker and The EC Team get together
and write a book, I will buy a copy for sure.

It is a good description. The writing is rather dry and my car park metaphor
is more interesting. Also I describe how to do deletions with linear
probing. However it is a vey good explanation of hashing, it would be wrong
to say it is not.

The code works on global variables. Whilst this is acceptable for
demonstration code, it also tends to encourage bad programming habits.

This looks like a bug to me

(Quote recommended hash table website)

Insertion and deletion are equally simple relative to linear probing:

1void jsw_insert ( void *key, int len )
2...{
3 unsigned h = hash ( key, len ) % N;
4 unsigned step;
5
6 for ( step = 1; table[h] != NULL; ++step )
7 h = ( h + ( step * step - step ) / 2 ) % N;
8
9 table[h] = key;
10}
1void jsw_remove ( void *key, int len )
2...{
3 unsigned h = hash ( key, len ) % N;
4 unsigned step;
5
6 for ( step = 1; table[h] != NULL; ++step )
7 h = ( h + ( step * step - step ) / 2 ) % N;
8
9 table[h] = DELETED;
10}

You need to call the comparison function and break if equal. As it is the
code marks a null at the end of the cluster as deleted.

There are always problems. As you say, it depends how you react to them.
There are errors and weaknesses in Basic Algorithms, of course. However you
can tell by the tone that mostly these are used as a pretext for envy.
--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 18 '07 #82

websnarf

On Aug 17, 10:32 am, CBFalconer <cbfalco...@yahoo.comwrote:

Richard Harter wrote:
Eric Sosman <esos...@ieee-dot-org.invalidwrote:
James Dow Allen wrote:
[...]
Speaking of Falconer, his method (linear probing with
second hash value used for probe increment) *requires*
(even more so than quadratic probing, see upthread)
prime table size.

Well, not exactly: It requires only that the
increment between probes be relatively prime to the table
size. A prime table size makes this criterion easy to
satisfy, but it's not the only way. A table size equal
to a power of two, for example, works just fine if the
probe increment is odd. (Degenerate special case: pure
linear probing with an increment of unity works with any
table size at all, prime or composite.)

In a thread in comp.programming (IIRCC) Paul Hsieh pointed
out [...]

Yeah it was probably me, since I am unable to find a citation from
anyone else that seems to be aware of this. Its as if people think
there is no need to think past Knuth.

[...] that always using the same probe increment in a
power of two table size also has bad properties. His
solution is to use the unused bits of the full hash to
select an increment size from a table of primes.

That is dangerous. The purpose of the multiple hash is to
have different search patterns for different objects that
have the same primary hash pattern.

No, its to have a different search pattern for different object that
have the same primary hash *INDEX* (i.e., post masking). This is
obvious since, for any good hash function, index collisions will be
many many times more common than full hash function collisions.

Mathematically speaking, on average a single 32-bit (good) hash
function collision will occur roughly when 65536 entries appear, while
a hash index collision will occur roughly when the square root of the
table size number of entries occur (so for a 9 million entry table,
that's just 3000 entries).

This avoids clustering, and tends to keep
the search chain short. The very iffy thing in hashlib is
the reduction of the range of the secondary hash, which
encourages light clustering in nearly empty tables.

You are ignoring the fact that skip values of 2, 4 and 6, for example,
for indexes starting on even offsets near each other will tend to
collide with each other as the table fills up.

Same skip-value, and commensurate offset collisions must be minimized
by increasing the chance they miss each other. This happens most
easily if the skip values are predominantly co-prime, which is easiest
to construct when they are simply taken from a list of prime numbers.

Now if we imagine a rough worse case scenario of about a 17-sequence
longest chain, then that's at most 16 collisions from other chains. A
sample size of 16 from a random set will cause about 1 collision if
the random set is 256 entries in size. So a prime table of 256, 512
or 1024 entries should be fine (notice how this calculation is
independent of table size). And obviously a quick way of getting at
those tables is through the high bits of a full 32 bit hash function.

[...] The
system attempts to maintain tables at between 44 and 88% of
capacity.

Also, I want the table action to be independent of the hash
functions (apart from speed), so the functions are computed
every time needed. The system functions with all the hash
functions replaced by the constant 1 (unity) (tested in the
test suite) at a considerable speed penalty. So the action
is hash function independent, but the speed is highly
dependent.

I don't understand the point you are making. This is an interesting
robustness test that any well defined hash table should be able to
pass. Its a good idea, but independent of what's being discussed.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 18 '07 #83

Flash Gordon

Malcolm McLean wrote, On 18/08/07 10:58:

"user923005" <dc*****@connx.comwrote in message
news:11**********************@g4g2000hsf.googlegro ups.com...
>On Aug 18, 1:15 am, "Malcolm McLean" <regniz...@btinternet.comwrote:
[snip]
>>So actually most of the criticism dissolves. I am not saying that
there are
not things you can say against the book, or that it couldn't have
been done
better.

I don't understand how the criticism dissolves. I think that the
major issue is really the reaction to the criticism even more so than
the defects themselves.
The worst possible reaction is "The emperor's new clothes" reaction --
pretending that problems do not exist.
The best possible reaction is "Yes, I see the problem. I'll have it
fixed right away." -- and then actually fixing it right away.
Everyone makes mistakes, no matter how careful we are. But
recognizing our mistakes and correcting them is very, very important.
When creating a reference work, special effort should be made to
ensure correctness on the first pass. And diligent effort to improve
it is also highly important.

I don't see how else I could react.

Had you started reacting by either fixing the valid criticisms or
publishing an errata for them the reaction to your book would not
steadily get worse.

Criticisms have been made, some of
them indisuptable bug reports,

For which you have not published an errata so that your audience know
about them.

some of them very good points,

For which you have not published an errata so that your audience knowa
about them.

some of
them defensible but not really good criticisms,

If it is a defensible criticism then it still needs to be dealt with.

some criticisms which
fail to understand the purpose of the code or the book,

It is to present and teach about basic algorithms. As far as I can see
everyone understands that, however as this group is about C rather than
algorithms a significant amount of the criticism has focused on the C
code, and this has been pointed out to you.

and some which
are objectively false.

This is no reason to dismiss all the other criticisms and not publish an
errata for them.

That's what you would expect.

So why does it make you get defensive and start attacking those
criticising it rather than publishing an errata for the things you
accept are either wrong or could be improved?

However there is also a sort of assuption
that the book is no good.

No, it is not an assumption, it is an opinion formed by first reading
the material you present and then seeing your reaction to criticism of it.

In fact I found a bug in Ben Pfaff's circular
buffer code within a couple of minutes, but no one was demanding that
the code be taken down.

You noted in your criticism that it was a failure that you would be
unlikely to hit in the real world, since in the real world you would not
be requesting a buffer large enough to hit the problem, and people did
not attack you for criticising it.

Nor did I report it in a silly way, like "this
implementation exhibits catastrophic and resource-hogging behaviour if
size_t is used beyond the range of a typical signed integer type". I
just pointed it out, as any sensible, experienced person does when
making bug reports.

In my opinion peoples reaction to you has become more extreme because
you refuse to accept any criticism of anything you post.

>>However can you point to a better introductory web page on hash
tables? That's the real test.

I suspect that something in this tangle will turn up worthwhile:
http://www.google.com/search?hl=en&q...to+hash+tables

Personally, I like this:
http://eternallyconfuzzled.com/tuts/...hashtable.aspx
When the twisted minds of Julienne Walker and The EC Team get together
and write a book, I will buy a copy for sure.

It is a good description. The writing is rather dry and my car park
metaphor is more interesting. Also I describe how to do deletions with
linear probing. However it is a vey good explanation of hashing, it
would be wrong to say it is not.

The code works on global variables. Whilst this is acceptable for
demonstration code, it also tends to encourage bad programming habits.

This looks like a bug to me

A valid criticism, so when will you fix this particular bug in your
stack and queue implementations? You are the one saying using globals is
a bug, so you cannot now claim it isn't.

(Quote recommended hash table website)

Insertion and deletion are equally simple relative to linear probing:

That is only one of several methods of doing insertion and deletion
presented in the text, and not even the first method of dealing with
clashes that is presented! Before mentioning the linear probing, it even
states there are several methods, and after the linear probing it goes
on to quadratic probing. So your criticism here is misleading at best.

1void jsw_insert ( void *key, int len )
2...{
3 unsigned h = hash ( key, len ) % N;
4 unsigned step;
5
6 for ( step = 1; table[h] != NULL; ++step )
7 h = ( h + ( step * step - step ) / 2 ) % N;
8
9 table[h] = key;
10}
1void jsw_remove ( void *key, int len )
2...{
3 unsigned h = hash ( key, len ) % N;
4 unsigned step;
5
6 for ( step = 1; table[h] != NULL; ++step )
7 h = ( h + ( step * step - step ) / 2 ) % N;
8
9 table[h] = DELETED;
10}

You need to call the comparison function and break if equal. As it is
the code marks a null at the end of the cluster as deleted.

Immediately after that code it says, "Keep in mind that the insertion
algorithm must also be modified..." so this is clearly presented as NOT
being a complete implementation of linear probing.

There are always problems.

Yes, but at least some of your criticisms of that text look like they
were deliberately trying to make it look bad by, for example, saying
that it uses linear probing whilst ignoring that it does this just after
saying there are several methods and when it goes on to quadratic
immediately after.

As you say, it depends how you react to them.

Also how you react to the criticisms and whether they are valid. You
have just presented invalid criticisms in an apparant attempt to make
your work look better.

There are errors and weaknesses in Basic Algorithms, of course.

Yet you don't publish an errata for those you know know about and have
admitted.

However
you can tell by the tone that mostly these are used as a pretext for envy.

That is your opinion of peoples motives, but it certainly is not my
motive and I don't believe it is the motive of the other criticisms. I
just don't like bad code being presented or bad advice.
--
Flash Gordon

Aug 18 '07 #84

pete

Flash Gordon wrote:

>
Malcolm McLean wrote, On 18/08/07 10:58:
"user923005" <dc*****@connx.comwrote in message

In fact I found a bug in Ben Pfaff's circular
buffer code within a couple of minutes,
but no one was demanding that the code be taken down.

You noted in your criticism that it was a failure that you would be
unlikely to hit in the real world,
since in the real world you would not
be requesting a buffer large enough to hit the problem, and people did
not attack you for criticising it.

Nor did I report it in a silly way, like "this
implementation exhibits catastrophic
and resource-hogging behaviour if
size_t is used beyond the range of a typical signed integer type". I
just pointed it out, as any sensible, experienced person does when
making bug reports.

I took a look at that,
but I don't hash and I didn't understand the code.

What was it that was being counted
by the size_t type object?

If it was the number of allocations,
then size_t is the wrong type.
The concept of what size_t is supposed to be,
is not related to how many allocated
objects can exist at the same time.
There's nothing in the standard which suggests
that there should be a type which is capable of holding
the number of allocated objects that can exist at one time.
The best type for counting allocations, is the largest unsigned type.

Or was it something else that was being counted
by the size_t type in Ben's circular buffer code?

--
pete

Aug 18 '07 #85

Ben Bacarisse

"Malcolm McLean" <re*******@btinternet.comwrites:

"user923005" <dc*****@connx.comwrote in message
news:11**********************@g4g2000hsf.googlegro ups.com...

<snip>

>Personally, I like this:
http://eternallyconfuzzled.com/tuts/...hashtable.aspx

<snip>

It is a good description. The writing is rather dry and my car park
metaphor is more interesting. Also I describe how to do deletions with
linear probing. However it is a vey good explanation of hashing, it
would be wrong to say it is not.

The code works on global variables. Whilst this is acceptable for
demonstration code, it also tends to encourage bad programming
habits.

How are we to read this? You say it is acceptable for demonstration
code (which this clearly is) but you seem to want to tack on a
criticism, just in case. You can't have it both ways. You do this for
your stack and your queue.

However you can tell by the tone that mostly these are used as
a pretext for envy.

If the cap fits...

--
Ben.

Aug 18 '07 #86

Richard Heathfield

Malcolm McLean said (in reply to me):

<snip>

Corrrections are being stored up for an edition 2, which will be
released shortly.

I hope you will forgive us for continuing to breathe in the interim.

>MM's A123 XYZ / B123 CDE example is wrong, in that it suggests
checking Slot 134. If that's empty, MM says, B123 CDE goes goes in
Slot
124. I can see no justification whatsoever for checking slot (S +
10) before
deciding whether to use Slot S.

[no comment from MM]

You appear to have ignored this item.

>I was puzzled to find that hash table entries were not abstracted as
the table itself was. "Divide and conquer" is too useful a strategy
to be ignored lightly, especially in teaching.

There are two tables. One stores data items of any size, one stores
pointers.

Okay, fair point - let's try that one again, editing as appropriate:

I was puzzled to find that hash table entries were not abstracted as the
tables themselves were. "Divide and conquer" is too useful a strategy
to be ignored lightly, especially in teaching.

>Oddly, MM seems to be using int rather than size_t for recording
lengths and sizes (neither of which can be negative). Presumably this
is a typo, since no experienced C programmer would deliberately be
that daft.

It's an algorithms book. The algorithms work on integers.

I think you'll find that hashing algorithms generally work on unsigned
integers.

The hash
function then has to to return a size_t, which is confusing to people
using the book to understand how to port to another language.

Fine, so use unsigned long int instead. Anyone can work out what that
means, and it's cross-portable between C90 and C99.

>His malloc is not ideal - answer = malloc(sizeof *answer); would be
preferable. He repeats this mistake later in the (poorly named)
hashtable() function.

Too hard to read.

No, it isn't.

malloc( sizeof *ptr) is a bad clc ism

No, it's a good clc-ism.

that you will seldom see in another environment.

Right. As Sturgeon says, at least 90% of everything is crud, so you'd
expect to find malloc(sizeof *ptr) in only 10% of cases (or fewer).
That doesn't mean it's bad. It just means most people haven't come
across it yet.

The reader has to perform a mental dereference which makes it harder
to see if the size of correct.

If the reader has to perform a mental dereference, he (or she) doesn't
understand sizeof. No dereferencing occurs, because sizeof does not
evaluate its argument. Furthermore, the size is *bound* to be correct
for any object pointer type.

>Good functions perform a task. They are doers. And they need "doing
words" (which we sometimes call "verbs") in their names. For example,
stdio's "remove" and "rename" and "fopen" and "printf" - all
reasonably good names (given their creator's habit of terseness).

My convention is to have an opaque structure, and a constructor with
the name of the structure in lower case. This is analagous to C++. As
you say, create_xxx would have the advantage of being a verb.

Better: xxx_create or xxx_Create or something like that.

>The indentation is appalling, and led me to believe at first that
hashtable() always returns 0. I had to double back to check. Blech.

Read the actual book - preview is available.

I was reading the preview (presumably). Previews allow one to assess the
quality of the book prior to deciding whether to purchase it. I have no
intention of paying money for a book whose preview is of such low
quality.

The html is dumped form
the Open Office book files, unfortunately it doesn't understand code
formatting. I really ought to be able to find a way around that, I
know.

Yes, you ought to, really. I do sympathise about formatting issues -
deciding on a file format in the first place is no easy decision, since
at that stage it's quite probable that you don't know how you're going
to market the book, and you don't know whether other companies will be
involved - will they demand Word format? LaTeX? HTML? Impossible to
guess.

Nevertheless, if I can produce properly formatted code in HTML with
ease, I'm sure you can too, if you try. (And yes, I can.)

>There's a reference to a strdup call, but as far as I can see there
is neither a definition of a strdup function nor indeed a call to
one. There is a call to a mystrdup function, but that isn't the same
thing. (Nor is it defined, so we have no idea what it does.)

strdup() is discussed in chapter 1.

So the reference in the hash table chapter is to the strdup discussed in
chapter 1? Okay, fine - but *why*? After all, as I have already pointed
out above, the hash table code doesn't call strdup. It calls mystrdup.
These are different names.

>>
I wonder what "menas" means.

typo.

Fixed yet?

>>
MM seems to be thinking that "quadratic probing" means "square the
hash value" (and that's a direct quote). Either he's misunderstood
quadratic probing completely, or I have. As far as I'm concerned,
quadratic probing has nothing whatsoever to do with the hash value
except as a place to start looking for a slot. My understanding is
that, if the hash value h has so far resulted in n collisions in a
table of size s, then the place to look next is slot (h + n*n) % s -
and most certainly NOT (h * h) % s!

I'll have to check on this one.

You're supposed to check on the basics of how basic algorithms work
*before* attempting to write a book about them. You have done your best
to deflect many criticisms about the quality of the code and the typos
and design decisions and so on, by saying that that's not what the book
is about - it's about explaining basic algorithms. Well, *this*
criticism drives a coach and horses through your strategy, because it
shows that you don't actually understand the algorithm you're trying to
teach.

>At this point, I started to lose interest, since it seems evident
that the chapter is so badly broken as to be beyond recovery - and I
started to skim. At random, I discovered an invasion of
implementation namespace, a correct claim that the primality test was
interesting (but alas, it is interesting for all the wrong reasons),
an incorrect claim that trial division is the only guaranteed way to
determine whether a number is prime, and a prime number function that
wrongly decides 2 is not a prime number.

Overall, I found the chapter to be of disappointingly low quality,
and I could not possibly recommend it. If that chapter is the
advertisement, the product itself must be dire indeed.

You seem determined to pick faults.

If I were determined to pick faults, I would have gone over the chapter
with a fine-toothed comb. No, Malcolm - these faults jumped off the
screen and spat in my eye.

The problem it then becomes
distorting, like Ben Bacarisse's comments. It is then impossible to
know whether the criticisms actually represent anything legitimate or
are just fussing.

If a criticism that you don't understand the very algorithm you are
explaining is dismissed as "just fussing", then I see no hope for you.
A technical author must not just expect criticism, but be able to
analyse it, determine whether the criticism is proper (and as far as I
can tell, most and quite possibly all of the criticisms levelled at the
chapter are in fact proper and valid), and - wherever it *is* proper -
be prepared to take reasonable and prompt corrective action.

abstracted - is an entry of arbitrary size abstract enough? The keys,
yes, I admit it might have been better to go for non-strings.

I lack the perseverance to explain this to you. Let's see whether you
bother to correct any of the myriad more serious problems first.

size_t - yes, there is an obvious case, but you don't seem to have
considered the implications.

The obvious case is so powerful that the implications would have to be
drastic - and they aren't. You have tried hard to explain them to
several experts here, and convinced none of them. It may not have
occurred to you, but your inability to convince them might just be a
consequence of your arguments being wafer-thin.

malloc(size *ptr) - that's not the convention in most places. The
argument for it is based on the needs of the compiler rather than the
human reader.

On the contrary. The compiler doesn't give a damn how much you malloc.
The clc-conforming style is based on the needs of the programmer, for
whom it gives a simple and easy-to-use pattern.

verbs for function names. There is a case, but it is not obvious that
you are right. C++ doesn't agree.

C++ is irrelevant in comp.lang.c, and I await with interest your sound
logical reasons for giving a function a noun name.

indentations. Yes, it's my fault for not getting a decent HTML dumper.
If I edit the html

strdup - discussed in chapter 1.

menas - unambiguously valid criticism.

And the least important of many valid criticisms.

>
quadratic probing - you might be right here.

YOU SHOULD KNOW! Otherwise you have no business taking money off people
in return for teaching them how quadratic probing works! Okay, some
mistakes are inevitable - typos and stuff, even brainos - they happen.
But this is fundamental!

>
prime stuff - yes, that needs a bit of fixing, I will agree.

Indeed it does.

>
So actually most of the criticism dissolves.

No, it doesn't. You can't dissolve criticism by adding acid. You have
three choices:

1) pretend the criticisms are invalid - which fools nobody but yourself.
2) accept that they are valid, but do nothing to fix them.
3) accept that they are valid, and fix them.
4) show that they are not valid.

4) is best, but you haven't managed it here. 3) is next best, and
certainly good enough; 2) is not good, but at least it's honest; 1) is
a fake and a farce, and it seems to be the option you're choosing.

I am not saying that
there are not things you can say against the book, or that it couldn't
have been done better.

Just as well, since I've already said them, and it could. And should.

However can you point to a better introductory
web page on hash tables? That's the real test.

I think I'd struggle to find a worse one.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 18 '07 #87

Richard Heathfield

Malcolm McLean said:

"user923005" <dc*****@connx.comwrote in message
news:11**********************@g4g2000hsf.googlegro ups.com...

<snip>

>When creating a reference work, special effort should be made to
ensure correctness on the first pass. And diligent effort to improve
it is also highly important.

I don't see how else I could react.

By fixing the chapter.

Criticisms have been made, some of
them indisuptable bug reports, some of them very good points, some of
them defensible but not really good criticisms, some criticisms which
fail to understand the purpose of the code or the book, and some which
are objectively false.

I think you have failed to understand the purpose of the criticisms.

That's what you would expect. However there is also a sort of
assuption that the book is no good.

It seems to be a reasonable assumption, based on what I've seen of the
book.

In fact I found a bug in Ben
Pfaff's circular buffer code within a couple of minutes, but no one
was demanding that the code be taken down.

Did he agree it was a bug? If not, did he explain why? If he did agree,
did he fix the problem and publish the new code in place of the old
code? These are acceptable reactions to bug reports.

<snip>

It is a good description. The writing is rather dry and my car park
metaphor is more interesting. Also I describe how to do deletions with
linear probing. However it is a vey good explanation of hashing, it
would be wrong to say it is not.

No, it isn't a good explanation, and no, it isn't wrong to say so. That
isn't how people park cars, for one thing. For another, you don't
understand the algorithm you're trying to explain.

The code works on global variables. Whilst this is acceptable for
demonstration code,

No, it isn't...

it also tends to encourage bad programming habits.

....and that's why.

<snip>

There are always problems. As you say, it depends how you react to
them. There are errors and weaknesses in Basic Algorithms, of course.
However you can tell by the tone that mostly these are used as a
pretext for envy.

Pfui. What, precisely, are we supposed to be envying? The broken code,
the broken design, or the broken explanation?

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Aug 18 '07 #88

Ben Bacarisse

Richard Heathfield <rj*@see.sig.invalidwrites:

(and as far as I
can tell, most and quite possibly all of the criticisms levelled at the
chapter are in fact proper and valid)

In the interest of scrupulous honesty, I made one erroneous
criticism -- which I have acknowledged as such. The hash table
*relies* on the fixed allocator running out of memory (OK, that then
threw up another bug, but that is beside the point). I did not spot
this linkage, and complained about what might happen as the table
fills up. The table can't fill up. Instead, the allocator returns
NULL but, unfortunately, dereferences it first.

--
Ben.

Aug 18 '07 #89

Keith Thompson

"Malcolm McLean" <re*******@btinternet.comwrites:
[...]

That is a bit more useful than "the code will not compile under C99
and not link under C89, therefore the book is shoddily written",
because prototypes aren't included, sort of comment I get most of the
time.

[...]

Omitting required '#include' directives is one of the most common
errors made by inexperienced C programmers. Reinforcing this error is
not a good idea. If you present code that depends on a predefined
header, you should present the required '#include' directive as well.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Aug 18 '07 #90

websnarf

On Aug 17, 5:43 am, Eric Sosman <esos...@ieee-dot-org.invalidwrote:

James Dow Allen wrote:
[...]
Speaking of Falconer, his method (linear probing
with second hash value used for probe increment)
*requires* (even more so than quadratic probing,
see upthread) prime table size.

Well, not exactly: It requires only that the increment
between probes be relatively prime to the table size. A prime
table size makes this criterion easy to satisfy, but it's not
the only way. A table size equal to a power of two, for example,
works just fine if the probe increment is odd. (Degenerate
special case: pure linear probing with an increment of unity
works with any table size at all, prime or composite.)

Some people seem very interested in the relative speed
of AND vs MOD for this purpose, and on most machines AND
is faster or at the very least no slower.

This would include anyone who actually believes in mantras such as
"measure measure measure". I'm not aware of any architecture for
which this difference is speed is not at least 10 times in favor of
AND.

Oh and if there are any down sides to using power of 2 tables I am
unaware of them. Using prime sized tables ... it just looks like a
wrong decision on its face.

[...] I think, though,
that the time savings must be viewed as just one component
of the search cost -- another piece (sometimes rather large)
is computing the key's hash value in the first place, not
to mention the cost of equality comparisons on the table
items encountered along the way.

Its not the largest cost, but my timings of it shows that for certain
simplistic scenarios (using a hash table to do word counts for a text
word sequence) it has a definite measurable secondary effect.

(Slight digression): It's often suggested that those
equality comparisons can be made cheaper by storing hash
codes in the table entries along with the payload. Before
making a possibly expensive strcmp() or whatever, one can
first do a (fast) arithmetic comparison of the hashes: if
the hashes don't match, the objects mush be unequal and
it is unnecessary to make the expensive comparison. But this
stratagem doesn't seem to me to be a "Well, doh!" proposition;
at least two counter-arguments exist:

- If the table stores pointers to items rather than item
instances, storing hash values in the table roughly
doubles the amount of memory required. If the same
amount of memory were used for a pointers-only table,
the number of table slots would double and the load
factor would be halved. This should lead to shorter
searches -- and it is not at all clear _a priori_ that
more fast probes will outrace fewer slow probes.

Ridiculous. If you have n items, then its the equivalent of
increasing your memory consumption by n pointers (or perhaps ~ 2.5*n
if you want to harp on the typical worst case of average scenario
consumption). But you already know that you have memory for n items
in the first place. This complaint is only valid if you think adding
an extra pointer per item is an unacceptable memory cost. This is
unlikely to be the case for most types of items that you would care to
store in a hash table.

- Storing the hash codes saves time whenever a probe hits
a slot occupied by the "wrong" key (it saves no time when
the probe finds the "right" key). If the time savings is
significant, it must mean that the key comparisons are
very expensive or that the probe sequences are very long.
If we've taken care to construct good hash functions and
have reason to believe that searches take two-point-mumble
probes, on the average, the savings is only one-point-mumble
comparisons per search.

For medium complexity items (such as a string), a properly implemented
hash table has absolutely no perceivable overhead on top of the hash
computation and comparison functions.

So if you go by a table doubling resizing strategy, and you tend not
to delete a lot, then on at least some scenarios I found the average
seek length to be about 1.4 slots per item. By not having a stored
hash value, you simply make 40% more comparison calls. This is a huge
cost for any well designed hash table, of non-trivial items.

As always when one is interested in speed, the gold standard
is the familiar "Measure, measure, measure!"

This comment is so ironic coming from you. All my claims are
empirically backed, yet you are just repeating these same false claims
you've always made, which of course, you never would if you ever
bothered to measure any of this stuff for yourself.

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Aug 18 '07 #91

Keith Thompson

"Malcolm McLean" <re*******@btinternet.comwrites:
[...]

There are always problems. As you say, it depends how you react to them.
There are errors and weaknesses in Basic Algorithms, of
course. However you can tell by the tone that mostly these are used as
a pretext for envy.

Envy?? That's completely delusional. Nobody here is envious of you
or your book. Why on Earth would we be?

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Aug 18 '07 #92

Eric Sosman

we******@gmail.com wrote:

On Aug 17, 5:43 am, Eric Sosman <esos...@ieee-dot-org.invalidwrote:
> As always when one is interested in speed, the gold standard
is the familiar "Measure, measure, measure!"

This comment is so ironic coming from you. All my claims are
empirically backed, yet you are just repeating these same false claims
you've always made, which of course, you never would if you ever
bothered to measure any of this stuff for yourself.

Thank you for informing me what I do and don't measure.
I'm always glad to hear from someone who knows more about
my activities than I do myself.

--
Eric Sosman
es*****@ieee-dot-org.invalid

Aug 18 '07 #93

Malcolm McLean

"Keith Thompson" <ks***@mib.orgwrote in message
news:ln************@nuthaus.mib.org...

"Malcolm McLean" <re*******@btinternet.comwrites:
[...]
>That is a bit more useful than "the code will not compile under C99
and not link under C89, therefore the book is shoddily written",
because prototypes aren't included, sort of comment I get most of the
time.
[...]

Omitting required '#include' directives is one of the most common
errors made by inexperienced C programmers. Reinforcing this error is
not a good idea. If you present code that depends on a predefined
header, you should present the required '#include' directive as well.

The problem is the code is interspersed with the text. It might be an idea
to #include everythign I use in a chapter right at the begining, however.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 18 '07 #94

Malcolm McLean

"Keith Thompson" <ks***@mib.orgwrote in message
news:ln************@nuthaus.mib.org...

"Malcolm McLean" <re*******@btinternet.comwrites:
[...]
>There are always problems. As you say, it depends how you react to them.
There are errors and weaknesses in Basic Algorithms, of
course. However you can tell by the tone that mostly these are used as
a pretext for envy.

Envy?? That's completely delusional. Nobody here is envious of you
or your book. Why on Earth would we be?

I don't include you in that.

--
Free games and programming goodies.
http://www.personal.leeds.ac.uk/~bgy1mm

Aug 18 '07 #95

Chris Dollin

we******@gmail.com wrote:

This would include anyone who actually believes in mantras such as
"measure measure measure". I'm not aware of any architecture for
which this difference is speed is not at least 10 times in favor of
AND.

The question is not whether AND is faster than MOD; it's whether
hash-tables implemented with AND as the reduction method are faster
than broadly-similar hashtables using MOD.

If I recall correctly, MOD is said to be better than AND because
more of the bits of the hash contribute to the index. If the hash
function is really good, this won't matter -- but we don't always
get to use "really good" hash functions.

I'm not saying that tables-using-AND are no faster than
tables-using-MOD; I'm saying that "AND is faster than MOD" doesn't
justify tuA faster than tuM.

--
Measure For Measure Hedgehog
"Never ask that question!" Ambassador Kosh, /Babylon 5/

Aug 18 '07 #96

Keith Thompson

"Malcolm McLean" <re*******@btinternet.comwrites:

"Keith Thompson" <ks***@mib.orgwrote in message
news:ln************@nuthaus.mib.org...
>"Malcolm McLean" <re*******@btinternet.comwrites:
[...]
>>That is a bit more useful than "the code will not compile under C99
and not link under C89, therefore the book is shoddily written",
because prototypes aren't included, sort of comment I get most of the
time.
[...]

Omitting required '#include' directives is one of the most common
errors made by inexperienced C programmers. Reinforcing this error is
not a good idea. If you present code that depends on a predefined
header, you should present the required '#include' directive as well.

The problem is the code is interspersed with the text. It might be an
idea to #include everythign I use in a chapter right at the begining,
however.

Or you show the required '#include' directives at the top of each code
sample. For example:

#include <stdio.h>
...
void hello(void)
{
printf("Hello, world\n");
}

where the "..." indicates that the directive needs to be there, but it
doesn't necessarily immediately precede the function definition. You
could explain this convention in the text.

Adding 2 or 3 lines to each code sample doesn't seem excessive.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Aug 18 '07 #97

Keith Thompson

"Malcolm McLean" <re*******@btinternet.comwrites:

"Keith Thompson" <ks***@mib.orgwrote in message
news:ln************@nuthaus.mib.org...
>"Malcolm McLean" <re*******@btinternet.comwrites:
[...]
>>There are always problems. As you say, it depends how you react to them.
There are errors and weaknesses in Basic Algorithms, of
course. However you can tell by the tone that mostly these are used as
a pretext for envy.

Envy?? That's completely delusional. Nobody here is envious of you
or your book. Why on Earth would we be?

I don't include you in that.

That's a good start. Don't include anyone else, and you'll be
approaching the truth.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Aug 18 '07 #98

CBFalconer

Flash Gordon wrote:

Malcolm McLean wrote, On 18/08/07 10:58:

.... snip ...

>
>Criticisms have been made, some of them indisuptable bug reports,

For which you have not published an errata so that your audience
know about them.

For contrast, look at my hashlib. It was published complete with
the test suite and test programs that evaluated it during and after
development. I have had two and a half bug reports since the
initial publication, and no more for years.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

--
Posted via a free Usenet account from http://www.teranews.com

Aug 18 '07 #99

CBFalconer

we******@gmail.com wrote:

CBFalconer <cbfalco...@yahoo.comwrote:
>Richard Harter wrote:

.... snip ...

>

>>[...] that always using the same probe increment in a
power of two table size also has bad properties. His
solution is to use the unused bits of the full hash to
select an increment size from a table of primes.

That is dangerous. The purpose of the multiple hash is to
have different search patterns for different objects that
have the same primary hash pattern.

No, its to have a different search pattern for different object
that have the same primary hash *INDEX* (i.e., post masking).
This is obvious since, for any good hash function, index
collisions will be many many times more common than full hash
function collisions.

Mathematically speaking, on average a single 32-bit (good) hash
function collision will occur roughly when 65536 entries appear,
while a hash index collision will occur roughly when the square
root of the table size number of entries occur (so for a 9
million entry table, that's just 3000 entries).

No, you are ignoring the size of the hash table itself. For
example, the starting size for hashlib is 17, so no more than 17
hash values (modulo 17) are useful. This changes as the table
enlarges itself.

>
>This avoids clustering, and tends to keep
the search chain short. The very iffy thing in hashlib is
the reduction of the range of the secondary hash, which
encourages light clustering in nearly empty tables.

You are ignoring the fact that skip values of 2, 4 and 6, for
example, for indexes starting on even offsets near each other
will tend to collide with each other as the table fills up.

Same skip-value, and commensurate offset collisions must be
minimized by increasing the chance they miss each other. This
happens most easily if the skip values are predominantly co-prime,
which is easiest to construct when they are simply taken from a
list of prime numbers.

Again, you are ignoring the effect of varying table size.

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home.att.net>

--
Posted via a free Usenet account from http://www.teranews.com

Aug 18 '07 #100

Similar topics