Benchmarks

Keith Thompson wrote:

You represent your set of words as a linked list. You compare each
new word to every word already in the set. The C++ solution uses a
std::set which, if I recall correctly, can do searches and insertions
in O(n log n).

That would be worse than linear-time, and is of course false.

You meant: O(lg n).

Nov 6 '08 #6

s0****@gmail.com wrote:

Any ideas as to what causes the big slowdown?

Why do you expect that searching a linked list could be even close to
the speed of searching a balanced binary tree?

Nov 6 '08 #7

s0suk3

On Nov 6, 11:22*am, Obnoxious User <OU@127.0.0.1wrote:

On Thu, 06 Nov 2008 07:53:18 -0800, s0suk3 wrote:
The task: Write a program that reads a set of words from standard input
and prints the number of distinct words.

I came across a website that listed a few programs to accomplish this
task:http://unthought.net/c++/c_vs_c++.html(ignore all the language
flaming :-), and thought that all of them did unnecessary operations, so
I wrote my own. But for some reason, my version turned out slower that
ALL of the versions in the website, even though it seems to perform less
operations (yes, I benchmarked them on my own computer).

According to the website, the slowest version is:

#include <set>
#include <string>
#include <iostream>

int main(int argc, char **argv)
{
* * * * // Declare and Initialize some variables std::string word;
* * * * std::set<std::stringwordcount;
* * * * // Read words and insert in rb-tree
* * * * while (std::cin >word) wordcount.insert(word); // Print the
* * * * result
* * * * std::cout << "Words: " << wordcount.size() << std::endl; return
* * * * 0;
}

My version is about 12 times slower than that. It uses lower-level
constructs. Here it is:

[snip]

So, since your version uses "lower-level constructs" you assume
it would be automatically faster?

Well, in this case it did seem to have a couple of advantages, like
for example, InsertWord() is given a pointer directly to the malloc()-
allocated string, which it simply copies into the .word member of the
struct; it doesn't need to allocate new memory and copy the string
from one place to the other, whereas std::set::insert() does need to
do make a copy.

Also, I'm not sure, but is the set's destructor invoked when main()
goes out of scope (causing memory cleanup)? (Currently the other
version has the DeleteSet() call commented-out.)

But, as others have pointed out, it's clear that the reason for the
difference in performance is the searching mechanism.

Sebastian

Nov 6 '08 #8

Pawel Dziepak

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

s0****@gmail.com wrote:

Well, in this case it did seem to have a couple of advantages, like
for example, InsertWord() is given a pointer directly to the malloc()-
allocated string, which it simply copies into the .word member of the
struct; it doesn't need to allocate new memory and copy the string
from one place to the other, whereas std::set::insert() does need to
do make a copy.

Also, I'm not sure, but is the set's destructor invoked when main()
goes out of scope (causing memory cleanup)? (Currently the other
version has the DeleteSet() call commented-out.)

But, as others have pointed out, it's clear that the reason for the
difference in performance is the searching mechanism.

Indeed it is. All that low-level stuff is faster, but it means nothing
when the problem is an algorithm and/or used data structure.

Pawel Dziepak
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iEYEARECAAYFAkkTNCAACgkQPFW+cUiIHNo9IQCgr6NC76/yXnouBhUYLn3fx3Rn
2HQAn3h19yS62OZqMv+jo0wSsr6M576O
=WTrr
-----END PGP SIGNATURE-----

Nov 6 '08 #9

Keith Thompson

Juha Nieminen <no****@thanks.invalidwrites:

Keith Thompson wrote:
>You represent your set of words as a linked list. You compare each
new word to every word already in the set. The C++ solution uses a
std::set which, if I recall correctly, can do searches and insertions
in O(n log n).

That would be worse than linear-time, and is of course false.

You meant: O(lg n).

Oops, you're correct of course; thanks for the correction.

Actually I meant O(log n), but it's the same thing.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Nov 6 '08 #10

Keith Thompson

Longjun <Lo***********@gmail.comwrites:

On Nov 6, 11:53*pm, s0s...@gmail.com wrote:
>The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.

I came across a website that listed a few programs to accomplish this
task:http://unthought.net/c++/c_vs_c++.html(ignore all the language
flaming :-), and thought that all of them did unnecessary operations,
so I wrote my own. But for some reason, my version turned out slower
that ALL of the versions in the website, even though it seems to
perform less operations (yes, I benchmarked them on my own computer).

According to the website, the slowest version is:

#include <set>
#include <string>
#include <iostream>

[...]

>>
My version is about 12 times slower than that. It uses lower-level
constructs. Here it is:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

[...]

>>
Any ideas as to what causes the big slowdown?

Noticed that you've implemented your own mechanism of scanning words
from standard input and insert a new elements in your "sets". I don't
know why you implement it by yourself. Are you clear the principle of
the class cin/cout and set? Are you sure that your own function have a
better performance to the standard one?

This discussion is cross-posted to comp.lang.c and comp.lang.c++. His
solution is written in C, which doesn't have sets built into the
standard library.

But certainly the equivalent functionality could be written in C.

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Nov 6 '08 #11

On Nov 6, 7:53*am, s0s...@gmail.com wrote:

The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.

[snip]

I am surprised chuck has not chimed in here.
A hash table is *ideal* for this.

P.S. to those analyzing performance...
Since we have to examine every word in the list, the performance of
the algorithm *cannot* be faster than O(n).
The hash table solution is O(n).

Using a btree, skiplist, avltree, etc. will be O(n log n) because:
For each word, we must collect it.
For this word, we must check for duplicity. With a hash table the
check is O(1). With a logarithmic search structure, the check is
O(log n). (There is a multiplicative constant less than 1, but that
does not alter the O(log n) behavior.
Hence: O(log n) * O(n) = O(n log n) for most ordered list variants.

There is another structure that would be competitive. I guess that a
ternary search tree might beat a hash table just because of the
excellence in memory access pattern. At least for lists of less than
a million items (and it would be hard to come up with more than a
million correctly spelled real words).
http://www.cs.princeton.edu/~rs/strings/

Nov 6 '08 #12

user923005 wrote:

The hash table solution is O(n).

You would have hard time proving that.

Nov 6 '08 #13

On Nov 6, 1:21*pm, Juha Nieminen <nos...@thanks.invalidwrote:

user923005 wrote:
The hash table solution is O(n).

* You would have hard time proving that.

Cuckoo hashing has guaranteed O(1) lookup and delete and amortized
O(1) insert.

My solution would also do the counting at insert time (IOW, the hash
insert function will return 0 if the item is already in the table and
return 1 if the item was inserted).
In that way, there is no need to scan the table and you can make it as
large as you like.

IOW:

unsigned long long count = 0;

while (item = fgets(string, sizeof string, stdin))
count += cuckoo_hash_insert(item);

Nov 6 '08 #14

Paul Hsieh

On Nov 6, 1:51*pm, c...@tiac.net (Richard Harter) wrote:

On Thu, 6 Nov 2008 12:53:55 -0800 (PST), user923005
<dcor...@connx.comwrote:
On Nov 6, 7:53=A0am, s0s...@gmail.com wrote:
The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.
[snip]

[...] Using a btree, skiplist, avltree, etc. will be O(n log n) because:[...]

hash table * * *- O(log n)
comparison tree - O((log n)^2)
radix trees * * - O(log n)

[etc]

I don't have any idea what anyone here is talking about. This is
clearly a "trie" problem. The performance is O(n), where n is the
length of the input (in characters). If your performance is any
different from that your implementation is just wrong. Here is my
solution, which is simpler/shorter than anything given so far or on
that website:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct trieNode {
int canTerminateHere;
struct trieNode * letter[26];
};

static int * insertTrie (struct trieNode * tn, const char * w) {
if ('\0' == *w) return &tn->canTerminateHere;
if (NULL == tn->letter[*w-'a']) {
if (NULL == (tn->letter[*w-'a'] = (struct trieNode *) calloc
(1, sizeof (struct trieNode)))) return NULL;
}
return insertTrie (tn->letter[*w-'a'], w+1);
}

int main () {
struct trieNode start = {0};
char buff[2048], *s, *t;
int count = 0, *ret;
while (buff == fgets (buff, 2048, stdin)) {
for (t = buff; *t;) {
s = t + strspn (t, "abcdefghijklmnopqrstuvwxyz");
if (s != t) {
char c = *s;
*s = '\0';
if (NULL == (ret = insertTrie (&start, t))) exit (-1);
*s = c;
count += 1 ^ *ret;
*ret = 1;
}
t = s + strcspn (s, "abcdefghijklmnopqrstuvwxyz");
}
}
printf ("Count: %d\n", count);
return 0;
}

This makes the assumption that all inputs are continguous words in
lines no longer than 2048 characters separated by white space or line
feeds or whatever. It also assumes that you have enough memory to
hold all the words input, at a rate of (26 * sizeof (void *) + sizeof
(int)) * (size of the dictionary of the input in characters), roughly
speaking. The program doesn't clean up after itself, but I don't
think that was in the requirements.

As for the *real* performance of this thing, it will come down to the
calloc() speed. I could waste a lot more lines of code to massively
improve the performance here. It might be worth it if, say, the
benchmark were trying to measure performance per line of code. So the
score would be, say, lines of code times time taken to output the
correct count or something, and it would then probably be worth it to
implement a calloc pooler (it would double the size at least, but
probably make the performance at least 3 times higher, if the words
had a high enough uniqueness rate).

--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/

Nov 7 '08 #15

Nick Keighley

On 6 Nov, 15:53, s0s...@gmail.com wrote:

The task: Write a program that reads a set of words from standard
input and prints the number of distinct words.

I came across a website that listed a few programs to accomplish this
task:http://unthought.net/c++/c_vs_c++.html(ignore all the language
flaming :-), and thought that all of them did unnecessary operations,
so I wrote my own. But for some reason, my version turned out slower
that ALL of the versions in the website, even though it seems to
perform less operations (yes, I benchmarked them on my own computer).

According to the website, the slowest version is:

#include <set>
#include <string>
#include <iostream>

int main(int argc, char **argv)
{
* * * * // Declare and Initialize some variables
* * * * std::string word;
* * * * std::set<std::stringwordcount;
* * * * // Read words and insert in rb-tree
* * * * while (std::cin >word) wordcount.insert(word);
* * * * // Print the result
* * * * std::cout << "Words: " << wordcount.size() << std::endl;
* * * * return 0;

}

the above uses an rb tree (or equivalent) in the set class

My version is about 12 times slower than that. It uses lower-level
constructs. Here it is:

[snip version using linear search]

Any ideas as to what causes the big slowdown?

this is a very important lesson. Print it out in big letters
and post it on your wall.

CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

--
Nick Keighley

Nov 7 '08 #16

Juha Nieminen wrote:

s0****@gmail.com wrote:

>Any ideas as to what causes the big slowdown?

Why do you expect that searching a linked list could be even
close to the speed of searching a balanced binary tree?

Which is O(log N), as compared to O(N). However a hashtable is
even faster, being O(1) for suitable organization.

F'ups set to c.l.c. only.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

Nov 7 '08 #17

user923005 wrote:

s0s...@gmail.com wrote:

>The task: Write a program that reads a set of words from
standard input and prints the number of distinct words.

[snip]

I am surprised chuck has not chimed in here.
A hash table is *ideal* for this.

I just did, a few minutes ago. Been having problems with the
news-server.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

Nov 7 '08 #18

CBFalconer wrote:

Juha Nieminen wrote:
>s0****@gmail.com wrote:

>>Any ideas as to what causes the big slowdown?
Why do you expect that searching a linked list could be even
close to the speed of searching a balanced binary tree?

Which is O(log N), as compared to O(N). However a hashtable is
even faster, being O(1) for suitable organization.

F'ups set to c.l.c. only.

Why? Because computational complexities only apply to C?

Nov 7 '08 #19

Juha Nieminen <no****@thanks.invalidwrites:

user923005 wrote:
>The hash table solution is O(n).

You would have hard time proving that.

He shouldn't. It should be pretty clear. No hash table needs
to examine each element more than once.

Phil
--
We must respect the other fellow's religion, but only in the sense and to the
extent that we respect his theory that his wife is beautiful and his children
smart. -- Henry Louis Mencken (1880-1956), American editor and critic

Nov 7 '08 #20

Nick Keighley wrote:

CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

That's so true. However, we shouldn't completely dismiss
"micro-optimization" and "hacker-optimization" once we have come
up with the fastest possible overall algorithm. Even though two
implementations might both by eg. O(n lg n) with a rather small
constant factor, one of them might still be 10 times faster than
the other if it performs clever low-level tricks inside the
innermost loop (which might be run eg. millions of times).

Of course when dealing with hacker optimization, there's always
a point where it's too detrimental to the readability and
maintainability of the code in relation to the speed benefit it
gives. If the code becomes significantly more obfuscated while
giving a speed boost of 1%, it might not be worth the trouble.
Personally I prefer to write 100 lines of clear, readable and
maintainable code, even if it's 1% slower than 1000 lines of code
worthy of the IOCCC.

Nov 7 '08 #21

Phil Carmody <th*****************@yahoo.co.ukwrites:

Juha Nieminen <no****@thanks.invalidwrites:
>user923005 wrote:
>>The hash table solution is O(n).

You would have hard time proving that.

He shouldn't. It should be pretty clear. No hash table needs
to examine each element more than once.

Oooops, suffering from Falconeritis - but you did over-snip rather.
The whole task is indeed o(n^2), but omega(n).

Phil
--
We must respect the other fellow's religion, but only in the sense and to the
extent that we respect his theory that his wife is beautiful and his children
smart. -- Henry Louis Mencken (1880-1956), American editor and critic

Nov 7 '08 #22

Hendrik Schober

Juha Nieminen wrote:

[...]
Personally I prefer to write 100 lines of clear, readable and
maintainable code, even if it's 1% slower than 1000 lines of code
worthy of the IOCCC.

It's always easier to make a bug-free program fast than to
bug-fix a fast program. :)

Schobi

Nov 7 '08 #23

user923005 wrote:

Cuckoo hashing has guaranteed O(1) lookup and delete and amortized
O(1) insert.

Wikipedia says: "Lookup requires inspection of just two locations in
the hash table, which takes constant time in the worst case".

I don't understand how that is possible. Suppose an element is
inserted into the table. Then later a new element displaces this first
element to a second location. Then later another new element, to be put
in this second location, displaces this first element to a third
location. How exactly do you find this first element anymore? It's not
in either of the two alternative locations defined by the two hashing
functions for that element.

Nov 7 '08 #24

James Kanze

On Nov 7, 10:25 am, Juha Nieminen <nos...@thanks.invalidwrote:

Nick Keighley wrote:
CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

That's so true. However, we shouldn't completely dismiss
"micro-optimization" and "hacker-optimization" once we have
come up with the fastest possible overall algorithm.

Yes, but this should only be undertaken when necessary, under
the control of a profiler---what optimizes on one platform may
slow down on another. And don't forget that a good compiler
will do many of these optimizations for you. A long time ago, I
tried replacing "h = 127 * h + *p" (in a hashing algorithm) with
"h = (h << 7 - h) + *p", knowing that my hardware didn't have
hardware multiply. The results were slower than the original;
the compiler also converted the multiplication into a shift and
subtract, and since it knew the final purpose in those
operations, was actually able to save one machine instruction
over an explicit shift and subtract.

Even though two implementations might both by eg. O(n lg n)
with a rather small constant factor, one of them might still
be 10 times faster than the other if it performs clever
low-level tricks inside the innermost loop (which might be run
eg. millions of times).

I've never seen a factor of 10 here. Even a factor of 2 is
exceptional (and usually indicative of a very poor optimizer in
the compiler).

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Nov 7 '08 #25

James Dow Allen

On Nov 7, 4:51*am, c...@tiac.net (Richard Harter) wrote:

Just as a note, the common claim that hash table are O(1) per
access whereas search trees are O(log n) per access is
misleading, ...computing the hash code ...
is necessarily greater
than log2(n)....
hash table * * *- O(log n)

Note that, for sufficiently pedantic O(.)
estimates, many of the usual O(.) measures are wrong.
A single arithmetic op may be considered to require
O(log N) time since log N bits are needed to represent
an index on a set of size N.

Returning to OP's problem, I wonder: are Compact
Hash Trees are in wide use? If there's interest
I'll post my implementation.

James Dow Allen

Nov 7 '08 #26

James Kanze wrote:

>Even though two implementations might both by eg. O(n lg n)
with a rather small constant factor, one of them might still
be 10 times faster than the other if it performs clever
low-level tricks inside the innermost loop (which might be run
eg. millions of times).

I've never seen a factor of 10 here. Even a factor of 2 is
exceptional (and usually indicative of a very poor optimizer in
the compiler).

There's just so much a compiler can do. There are cases where only the
programmer can understand what's going on (rather than the compiler) and
be able to optimize something.

For example, comparing two strings for inequality is O(1) if the
maximum size of the string is fixed (eg. you know that no string has
more than 30 characters), but still much slower than comparing two
integers. If you can somehow change the string comparison to an integer
comparison, and these comparisons are performed millions of times in the
innermost loop of your code, the program may speed up considerably. This
is a kind of optimization which is completely out or the realm of
compiler optimizations.

Nov 7 '08 #27

In article <79b10b2f-6abf-4440-9d04-f731e6b51bf2
@f40g2000pri.googlegroups.com>, ja*********@gmail.com says...

[ ... ]

Even though two implementations might both by eg. O(n lg n)
with a rather small constant factor, one of them might still
be 10 times faster than the other if it performs clever
low-level tricks inside the innermost loop (which might be run
eg. millions of times).

I've never seen a factor of 10 here. Even a factor of 2 is
exceptional (and usually indicative of a very poor optimizer in
the compiler).

I've seen a factor of 10, and even a _little_ bit more -- but it was
_extremely_ hardware specific, depending almost entirely on the cache
organization of the processor. As the data was originally placed in
memory, two crucial items ended up contending for the same cache lines.
Rearranging the data allowed those items to live in different cache
lines.

In this case, the CPU involved wasn't even made when the compiler was
written, so I have a hard time blaming the compiler for the poor
optimization. Unfortunately, I don't know of any newer compilers that
would do much better (and when they do, it's mostly by accident...)

--
Later,
Jerry.

The universe is a figment of its own imagination.

Nov 7 '08 #28

In article <gf**********@hoshi.visyn.net>, sp******@gmx.de says...

Juha Nieminen wrote:
[...]
Personally I prefer to write 100 lines of clear, readable and
maintainable code, even if it's 1% slower than 1000 lines of code
worthy of the IOCCC.

It's always easier to make a bug-free program fast than to
bug-fix a fast program. :)

Oft repeated, but not really true. Sometimes a fast program still
contains a trivial bug, but I can remember one program in particular
that was bug-free as far as I know, but a horrible mess, to the point
that by _far_ the easiest way to improve it was write a set of
specifications for what it did, and start over. By the time I realized
that, however, I'd spent more time trying to optimize one small part
than it eventually took to rewrite the whole thing...

--
Later,
Jerry.

The universe is a figment of its own imagination.

Nov 7 '08 #29

Richard Harter

On Fri, 07 Nov 2008 09:41:08 GMT, Juha Nieminen
<no****@thanks.invalidwrote:

>user923005 wrote:
>Cuckoo hashing has guaranteed O(1) lookup and delete and amortized
O(1) insert.

Wikipedia says: "Lookup requires inspection of just two locations in
the hash table, which takes constant time in the worst case".

I don't understand how that is possible. Suppose an element is
inserted into the table. Then later a new element displaces this first
element to a second location. Then later another new element, to be put
in this second location, displaces this first element to a third
location. How exactly do you find this first element anymore? It's not
in either of the two alternative locations defined by the two hashing
functions for that element.

It doesn't work like that, though you had me perplexed for a
moment. The first element goes back to its original location,
displacing the second. Thus, suppose that h0, h1, etc are
locations and the first three element are e0, e1, e2, and that
their address pairs are:

e0: h0,h1
e1: h0,h2
e2: h1,h3

Initially e0 is inserted in location h0. When e1 is inserted, e0
is moved to h1, and e1 is inserted in h0. The contents of the
table look like this:

h0: e1
h1: e0
h2: -
h3: -

Now e2 is inserted; it goes into location h1, displacing e0. In
turn e0 goes back to its alternate location h0, displacing e1,
which has to go to its alternate location, h2. The table now
looks like this:

h0: e0
h1: e2
h2: e1
h3: -

Richard Harter, cr*@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.

Nov 7 '08 #30

Juha Nieminen <no****@thanks.invalidwrites:

user923005 wrote:
>Cuckoo hashing has guaranteed O(1) lookup and delete and amortized
O(1) insert.

Wikipedia says: "Lookup requires inspection of just two locations in
the hash table, which takes constant time in the worst case".

I don't understand how that is possible. Suppose an element is
inserted into the table. Then later a new element displaces this first
element to a second location. Then later another new element, to be put
in this second location, displaces this first element to a third
location. How exactly do you find this first element anymore? It's not
in either of the two alternative locations defined by the two hashing
functions for that element.

The insert can be quite a complicated procedure. It might be
amortized O(1), but worst-case, it's O(epic fail).

Phil
--
We must respect the other fellow's religion, but only in the sense and to the
extent that we respect his theory that his wife is beautiful and his children
smart. -- Henry Louis Mencken (1880-1956), American editor and critic

Nov 7 '08 #31

In article <87************@nonospaz.fatphil.org>,
th*****************@yahoo.co.uk says...

[ ... hash table operations ]

The insert can be quite a complicated procedure. It might be
amortized O(1), but worst-case, it's O(epic fail).

That depends on how the hash table is designed. It's entirely possible
(for one example) to create a hash table that uses balanced trees for
collision resolution. This gives a worst case of O(log N), which isn't
really all that awful at all.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Nov 7 '08 #32

On Nov 7, 10:54*am, Jerry Coffin <jcof...@taeus.comwrote:

In article <87zlkbph9z....@nonospaz.fatphil.org>,
thefatphil_demun...@yahoo.co.uk says...

[ ... hash table operations ]

The insert can be quite a complicated procedure. It might be
amortized O(1), but worst-case, it's O(epic fail).

That depends on how the hash table is designed. It's entirely possible
(for one example) to create a hash table that uses balanced trees for
collision resolution. This gives a worst case of O(log N), which isn't
really all that awful at all.

I like this scheme:
Say that you want a hash table of 2^20th elements...
Create a hash table of 2^20th elements,
Create a shadow hash table of 2^16th elements
Create a shadow hash table of 2^12th elements
Create a shadow hash table of 2^8th elements, each of which is a
skiplist holding the data type.

You form your 32 bit hash using some nice algorithm like Bob Jenkin's
hash or Goulburn hash.
You mask off the lower 20 bits and this is the address for the large
table.
If there is a collision, you mask off 4 more bits and use the second
largest table.
If there is a collision, you mask off 4 more bits and use the third
largest table.
If there is a collision you take the low byte of the hash and insert
into that skiplist.

If you get a lot more data than you expect, you construct a new larger
table of 2^24th elements and put it in the front as the next new
layer.

Of course, if you want, you can rehash the data each time, but I think
it is better to simply mask. That way, if you suddenly start growing
your skiplists and the first level tables are lightly populated, then
you can detect what is obviously a denial of service attack.

Nov 7 '08 #33

Jerry Coffin <jc*****@taeus.comwrites:

In article <87************@nonospaz.fatphil.org>,
th*****************@yahoo.co.uk says...

[ ... hash table operations ]

uhuh - specifically cuckoo hash operations.

>The insert can be quite a complicated procedure. It might be
amortized O(1), but worst-case, it's O(epic fail).

That depends on how the hash table is designed. It's entirely possible

[... to not use a cuckoo hash ...]

Yes, but then it wouldn't be the hash design whose Big-Ohs were
being debated.

Phil
--
We must respect the other fellow's religion, but only in the sense and to the
extent that we respect his theory that his wife is beautiful and his children
smart. -- Henry Louis Mencken (1880-1956), American editor and critic

Nov 7 '08 #34

Richard Harter wrote:

It doesn't work like that, though you had me perplexed for a
moment. The first element goes back to its original location,
displacing the second. Thus, suppose that h0, h1, etc are
locations and the first three element are e0, e1, e2, and that
their address pairs are:

e0: h0,h1
e1: h0,h2
e2: h1,h3

Initially e0 is inserted in location h0. When e1 is inserted, e0
is moved to h1, and e1 is inserted in h0. The contents of the
table look like this:

h0: e1
h1: e0
h2: -
h3: -

Now e2 is inserted; it goes into location h1, displacing e0. In
turn e0 goes back to its alternate location h0, displacing e1,
which has to go to its alternate location, h2. The table now
looks like this:

h0: e0
h1: e2
h2: e1
h3: -

I'm not 100% sure I fully understand it, but anyways, it sound to me
like insertion is O(n) like that (each displaced element could
potentially displace an existing element, which could potentially
displace an existing element, etc, potentially going through O(n)
elements without it going into an infinite loop).

I don't even see how insertion is amortized constant-time. It sounds
to me like insertion is not only O(n), but also could potentially be
amortized linear-time, especially if poor hashing functions are chosen.

There was no conditional ("with properly designed hashing functions")
in the description given by wikipedia for the amortized constant time
insertion, but it made it sound like it's *always* amortized constant
time, regardless of the chosen hashing functions... I can't understand
how it's possible.

Heck, in fact I think it's rather trivial to demonstrate that
insertion can be O(infinite), by simply using these two hashing functions:

f1(x) = 0
f2(x) = 1

With these hashing functions it's impossible to insert more than two
elements in the hash table. Trying to insert a third one is impossible.

Of course these two functions are absolutely extreme, but I think they
demonstrate that the hash table can *not* have constant-time insertion
(or even linear-time, for that matter) for all possible hashing
functions. At most this may be possible for *some* hashing functions.

Nov 7 '08 #35

Phil Carmody wrote:

Jerry Coffin <jc*****@taeus.comwrites:
>th*****************@yahoo.co.uk says...

[ ... hash table operations ]

uhuh - specifically cuckoo hash operations.

>>The insert can be quite a complicated procedure. It might be
amortized O(1), but worst-case, it's O(epic fail).

That depends on how the hash table is designed. It's entirely
possible ...

[... to not use a cuckoo hash ...]

Yes, but then it wouldn't be the hash design whose Big-Ohs were
being debated.

I gather this sub-thread applies specifically to the cuckoo hash,
with which I am not familiar. However note that the design of
hashlib specifically prevents the hashtable being more than about
88% full, and normally is in the 44 to 88% range. This
specifically prevents the occurence of long trails with any
reasonable hash function. The organization does specifically
requires table reorganization at intervals. The result is O(1)
operation. The reorganization is considerably cheaper than the
initial construction of that table, because it is not necessary to
check equality in the rehashing operation (we know that the
existing table has no duplicates). At any rate, see:

<http://cbfalconer.home.att.net/download/hashlib.zip>

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

Nov 7 '08 #36

On Nov 7, 2:47*pm, Juha Nieminen <nos...@thanks.invalidwrote:

Richard Harter wrote:
It doesn't work like that, though you had me perplexed for a
moment. *The first element goes back to its original location,
displacing the second. *Thus, suppose that h0, h1, etc are
locations and the first three element are e0, e1, e2, and that
their address pairs are:

e0: h0,h1
e1: h0,h2
e2: h1,h3

Initially e0 is inserted in location h0. *When e1 is inserted, e0
is moved to h1, and e1 is inserted in h0. *The contents of the
table look like this:

* *h0: e1
* *h1: e0
* *h2: -
* *h3: -

Now e2 is inserted; it goes into location h1, displacing e0. *In
turn e0 goes back to its alternate location h0, displacing e1,
which has to go to its alternate location, h2. *The table now
looks like this:

* *h0: e0
* *h1: e2
* *h2: e1
* *h3: -

* I'm not 100% sure I fully understand it, but anyways, it sound to me
like insertion is O(n) like that (each displaced element could
potentially displace an existing element, which could potentially
displace an existing element, etc, potentially going through O(n)
elements without it going into an infinite loop).

* I don't even see how insertion is amortized constant-time. It sounds
to me like insertion is not only O(n), but also could potentially be
amortized linear-time, especially if poor hashing functions are chosen.

* There was no conditional ("with properly designed hashing functions")
in the description given by wikipedia for the amortized constant time
insertion, but it made it sound like it's *always* amortized constant
time, regardless of the chosen hashing functions... I can't understand
how it's possible.

* Heck, in fact I think it's rather trivial to demonstrate that
insertion can be O(infinite), by simply using these two hashing functions:

f1(x) = 0
f2(x) = 1

* With these hashing functions it's impossible to insert more than two
elements in the hash table. Trying to insert a third one is impossible.

* Of course these two functions are absolutely extreme, but I think they
demonstrate that the hash table can *not* have constant-time insertion
(or even linear-time, for that matter) for all possible hashing
functions. At most this may be possible for *some* hashing functions.

Sure. If hash(x) == return 1; then bad behavior is to be expected.
The assumption is of maximally cascade free hash functions like UMAC
or Bob Jenkin's hash.

There are also little tweaky things done to make the "real-life"
behavior of Cuckoo hash better (there is an article with a title
something like "Hash with a Stash" that explains it in detail.)

Big-O analysis is an examination of average case behavior. If your
hash table is too small and you do not make provision to enlarge it,
then it can suffer from attack by an opponent, and so we should also
carefully examine worst-case behavior.

For example, suppose you have a small hash table with 2000 entries
that overflows into link-lists.
Someone tests one trillion strings and finds three million strings
that map to 4 distinct hash codes (because they have somehow learned
your hash algorithm).
The opponent sends these three million strings to your hash table.
Obviously, we are going to get some long chains, and we will have to
linear search them to discover if an item is in the table or not.
These attacks are realistic and do happen in real-life (it turns out
that the world has a surprisingly large population of whinging twits).

Now, if we had made the table a lot larger (e.g. 64000 entries) it
would be a lot harder to find strings that map to the same value.
And if we had made the spill strategy more robust (e.g. a skiplist
instead of an ordinary linked list) then the behavior never goes
catastrophic.

It's too bad that we have to plan for whinging twits when designing
data structures, but bad or skewed sequences do happen from time to
time in real life also, so it is not entirely wasted effort.

Nov 7 '08 #37

Nick Keighley said:

<snip>

CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

If you do, fist get it spell-checked.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Nov 8 '08 #38

Jerry Coffin said:

In article <gf**********@hoshi.visyn.net>, sp******@gmx.de says...
>Juha Nieminen wrote:
[...]
Personally I prefer to write 100 lines of clear, readable and
maintainable code, even if it's 1% slower than 1000 lines of code
worthy of the IOCCC.

It's always easier to make a bug-free program fast than to
bug-fix a fast program. :)

Oft repeated, but not really true.

All generalisations are false...

Omitting nail-you-to-the-wall words like "always", I would argue that it's
normally easier to make a *clear* program correct than it is to make a
correct program clear, and it's normally easier to make a correct program
fast than to make a fast program correct. Therefore, my priorities when
writing code are:

[1] Clarity
[2] Correctness
[3] Performance

These are NOT mutually exclusive! The job isn't done until the program is
sufficiently clear *and* sufficiently correct *and* sufficiently fast.

<snip>

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Nov 8 '08 #39

user923005 wrote:

Big-O analysis is an examination of average case behavior.

Not true. Big-O denotes the asymptotic worst case behavior.

One can say "the algorithm is O(n lg n) *in average*", but AFAIK
that's a rather shaky definition. The big-O notation always denotes the
upper bound of the asymptotic behavior of the algorithm. In other words,
any asymptotic behavior of the algorithm with any given data will always
fall below the big-O function curve. If with some data the algorithm
behaves asymptotically slower than the given function, then that's not
the true big-O function for that algorithm.

For example quicksort is O(n^2). (Don't believe what people say. It
*is* O(n^2), period.)

Nov 8 '08 #40

Kai-Uwe Bux

Juha Nieminen wrote:

user923005 wrote:
>Big-O analysis is an examination of average case behavior.

Not true. Big-O denotes the asymptotic worst case behavior.

One can say "the algorithm is O(n lg n) *in average*", but AFAIK
that's a rather shaky definition. The big-O notation always denotes the
upper bound of the asymptotic behavior of the algorithm. In other words,
any asymptotic behavior of the algorithm with any given data will always
fall below the big-O function curve. If with some data the algorithm
behaves asymptotically slower than the given function, then that's not
the true big-O function for that algorithm.

For example quicksort is O(n^2). (Don't believe what people say. It
*is* O(n^2), period.)

Big-O notation is oblivious to whether it is used to say something about the
worst case or about the average case. It also does not care whether you say
something about time or space complexity. In fact, Big-O is a purely
mathematical concept: Given two functions f and g, we say that f is a
O(g) if there is a constant C such that f(x) <= C g(x) for all x.

For computer science, f is often the worst case runtime of an algorithm
depending on some input length n (in some measure). So, we say that this
algorithm has worst case time complexity O(n^2) if the worst case run time
is bounded from above by C n^2 for some C. However, f could be the
average runtime or it could be the worst case space complexity, or some
other function. Thus, you can use Big-O notation also to say something
about worst case space complexity, average case time complexity, or other
complexities. The Big-O won't care.
Best

Kai-Uwe Bux

Nov 9 '08 #41

Juha Nieminen wrote:

user923005 wrote:

>Big-O analysis is an examination of average case behavior.

Not true. Big-O denotes the asymptotic worst case behavior.

One can say "the algorithm is O(n lg n) *in average*", but AFAIK
that's a rather shaky definition. The big-O notation always
denotes the upper bound of the asymptotic behavior of the
algorithm. In other words, any asymptotic behavior of the
algorithm with any given data will always fall below the big-O
function curve. If with some data the algorithm behaves
asymptotically slower than the given function, then that's not
the true big-O function for that algorithm.

For example quicksort is O(n^2). (Don't believe what people say.
It *is* O(n^2), period.)

I suspect you know what you are talking about, but you are leaving
the wrong impression. The sort of sequence that produces O(n*n)
quicksort performance is extremely rare, and very slight controls
make it even rarer (such as selecting the median of 3 items as the
value to partition about).

The result is that for most cases quicksort will be the fastest
sort. Right behind it are other O(nLOGn) methods. I prefer
mergesort and data input in lists, because then I don't have to
worry about the size to be sorted. In addition, mergesort is
stable (quicksort is not).

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

Nov 9 '08 #42

Richard Heathfield wrote:

Nick Keighley said:

<snip>

>CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

If you do, fist get it spell-checked.

^^^^
?? :-) F'ups set.

--
[mail]: Chuck F (cbfalconer at maineline dot net)
[page]: <http://cbfalconer.home.att.net>
Try the download section.

Nov 9 '08 #43

CBFalconer <cb********@yahoo.comwrites:

Richard Heathfield wrote:
>Nick Keighley said:

<snip>

>>CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

If you do, fist get it spell-checked.
^^^^
?? :-) F'ups set.

Set, followed, and had fun with.

extern volatile poster Chuck;

void Richard(post fromNick)
{
string s("W");
while(!Chuck.posted()) { s+="o"; }
s+="sh";
Chuck.respond(s);
}

Phil
--
I tried the Vista speech recognition by running the tutorial. I was
amazed, it was awesome, recognised every word I said. Then I said the
wrong word ... and it typed the right one. It was actually just
detecting a sound and printing the expected word! -- pbhj on /.

Nov 9 '08 #44

CBFalconer said:

Richard Heathfield wrote:
>Nick Keighley said:

<snip>

>>CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

If you do, fist get it spell-checked.
^^^^
?? :-)

Since all spelling flames are supposed to include spelling errors, if one
doesn't arise naturally it must be inserted, as on this occasion.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Nov 9 '08 #45

Phil Carmody said:

CBFalconer <cb********@yahoo.comwrites:
>Richard Heathfield wrote:
>>Nick Keighley said:

<snip>

CAREFUL ALGORITH SELECTION CAN CREAM MICRO-OPTIMISATION

I may get this printed on a tee shirt

If you do, fist get it spell-checked.
^^^^
?? :-) F'ups set.

Set, followed, and had fun with.

extern volatile poster Chuck;

void Richard(post fromNick)
{
string s("W");
while(!Chuck.posted()) { s+="o"; }
s+="sh";
Chuck.respond(s);
}

Ah, I wish I'd seen this before posting my own reply. Your response is far
more eloquent than mine.

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999

Nov 9 '08 #46

William Pursell

On 9 Nov, 00:11, Kai-Uwe Bux <jkherci...@gmx.netwrote:

Big-O notation is oblivious to whether it is used to say something about the
worst case or about the average case. It also does not care whether you say
something about time or space complexity. In fact, Big-O is a purely
mathematical concept: Given two functions *f *and *g, we say that *f *is a
O(g) *if there is a constant *C *such that *f(x) <= C g(x) *for all *x.

Almost. f is O(g) if there exists C and N such that f(x) < C*g(x) for
all x N. (ie, we only care about the behavior for sufficiently
large values of x.)

Nov 9 '08 #47

James Kanze

On Nov 8, 12:27*am, user923005 <dcor...@connx.comwrote:

On Nov 7, 2:47*pm, Juha Nieminen <nos...@thanks.invalidwrote:
Sure. *If hash(x) == return 1; then bad behavior is to be
expected. The assumption is of maximally cascade free hash
functions like UMAC or Bob Jenkin's hash.

Do you have any references on those, particularly with an
implementation? All I found for UMAC was a reference to book,
and Bob Jenkin's seems more concerned with cryptographic hashing
than anything else. If these are standard and widely used
algorithms for look-up hashing, I'd like to add them to my set
of hashing benchmarks.

(Note that cryptographic hashing and look-up hashing have very
different constraints, and a good hash function for one isn't
necessarily a good hash function for the other. Also, in
look-up hashing, speed counts. Taking 10 times more time to get
1% better distribution on the average is a loss.)

--
James Kanze (GABI Software) email:ja*********@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Nov 9 '08 #48

Richard Harter

On Sat, 08 Nov 2008 22:59:08 GMT, Juha Nieminen
<no****@thanks.invalidwrote:

>user923005 wrote:
>Big-O analysis is an examination of average case behavior.

Not true. Big-O denotes the asymptotic worst case behavior.

One can say "the algorithm is O(n lg n) *in average*", but AFAIK
that's a rather shaky definition. The big-O notation always denotes the
upper bound of the asymptotic behavior of the algorithm. In other words,
any asymptotic behavior of the algorithm with any given data will always
fall below the big-O function curve. If with some data the algorithm
behaves asymptotically slower than the given function, then that's not
the true big-O function for that algorithm.

For example quicksort is O(n^2). (Don't believe what people say. It
*is* O(n^2), period.)

As others have pointed out, this is seriously misinformed. This
bit of folklore pops up every once in a while. Where does it
come from?

Nov 9 '08 #49