"hat" container class [C++]

AngleWyrm

I have created a new container class, called the hat.

It provides random selection of user objects, both with and without
replacement, with non-uniform probabilities. Uniform probabilities are a
degenerate case, and can also be stored.

Here's the web-page, with complete source, illustration, and example usage.
Take a look-see, and post your critique!

http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #1

Subscribe Post Reply

3004

Kai-Uwe Bux

AngleWyrm wrote:

I have created a new container class, called the hat.

It provides random selection of user objects, both with and without
replacement, with non-uniform probabilities. Uniform probabilities are a
degenerate case, and can also be stored.

Here's the web-page, with complete source, illustration, and example
usage. Take a look-see, and post your critique!

http://home.comcast.net/~anglewyrm/hat.html

I only had a very brief look at what is a pet subject of mine: the random
number generator:

template <class T>
int hat<T>::random(int range)
{
// make sure srand has been called before using rand
if( !rand_seeded )
{
srand( time(NULL) );
rand();
rand_seeded = true;
}
int random_number = rand() % range;

return random_number;
};
I would leave the choice of a random number generator to the client, like
it is done in std::random_shuffle. A second template parameter can be used
for that. You could also prove a constructor that takes an instance of an
RNG as an argument. You would have to provide a reasonable default, of
course. There are good reasons for not making this choice for the user. For
example:

a) There is a *huge* variety of random number generators out there. It is
impossible to predict for a library designer which of these will be good
for a given client problem.

b) It is a big advantage if the user can have independent instances of
random number generators.
Now for the one that you have chosen: relying on std::rand() from <cstdlib>
puts you at the mercy of the operating system. Your code assumes that the
low order bits of rand() are reasonably well distributed. Please note that
some random number generators (e.g. straight forward linear congruence
RNGs) have some troubles with the low order bytes. On some operating
systems those might still be in use. Thus it is good practice to use the
high order bytes instead like so (quoted from the man page):

In Numerical Recipes in C: The Art of Scientific Computing
(William H. Press, Brian P. Flannery, Saul A. Teukolsky,
William T. Vetterling; New York: Cambridge University
Press, 1992 (2nd ed., p. 277)), the following comments are
made:
"If you want to generate a random integer between 1
and 10, you should always do it by using high-order
bits, as in

j=1+(int) (10.0*rand()/(RAND_MAX+1.0));

and never by anything resembling

j=1+(rand() % 10);

(which uses lower-order bits)."

Now, on Linux, this advice is somewhat outdated. And it is to be hoped that
weak RNGs will eventually die out. However, I would reccommend to provide a
reasonable default random number generator of your choosind in the first
place for which the quality of low order bits is established.
BTW, nice job.
Best

Kai-Uwe

Jul 22 '05 #2

Kai-Uwe Bux

Kai-Uwe Bux wrote:

AngleWyrm wrote:
I have created a new container class, called the hat.

It provides random selection of user objects, both with and without
replacement, with non-uniform probabilities. Uniform probabilities are a
degenerate case, and can also be stored.

Here's the web-page, with complete source, illustration, and example
usage. Take a look-see, and post your critique!

http://home.comcast.net/~anglewyrm/hat.html

I only had a very brief look at what is a pet subject of mine: the random
number generator:

template <class T>
int hat<T>::random(int range)
{
// make sure srand has been called before using rand
if( !rand_seeded )
{
srand( time(NULL) );
rand();
rand_seeded = true;
}
int random_number = rand() % range;

return random_number;
};
I would leave the choice of a random number generator to the client, like
it is done in std::random_shuffle. A second template parameter can be used
for that. You could also prove a constructor that takes an instance of an
RNG as an argument. You would have to provide a reasonable default, of
course. There are good reasons for not making this choice for the user.
For example:

a) There is a *huge* variety of random number generators out there. It is
impossible to predict for a library designer which of these will be good
for a given client problem.

b) It is a big advantage if the user can have independent instances of
random number generators.
Now for the one that you have chosen: relying on std::rand() from
<cstdlib> puts you at the mercy of the operating system. Your code assumes

^^^^^^^^^^^^^^^^^
Oops: it is of course the vendor of your cstdlib. Still its a risk!
that the low order bits of rand() are reasonably well distributed. Please
note that some random number generators (e.g. straight forward linear
congruence RNGs) have some troubles with the low order bytes. On some
operating systems those might still be in use. Thus it is good practice to
use the high order bytes instead like so (quoted from the man page):

In Numerical Recipes in C: The Art of Scientific Computing
(William H. Press, Brian P. Flannery, Saul A. Teukolsky,
William T. Vetterling; New York: Cambridge University
Press, 1992 (2nd ed., p. 277)), the following comments are
made:
"If you want to generate a random integer between 1
and 10, you should always do it by using high-order
bits, as in

j=1+(int) (10.0*rand()/(RAND_MAX+1.0));

and never by anything resembling

j=1+(rand() % 10);

(which uses lower-order bits)."

Now, on Linux, this advice is somewhat outdated. And it is to be hoped ^^^^^^^^^^^^^

Same oops.
that weak RNGs will eventually die out. However, I would reccommend to
provide a reasonable default random number generator of your choosind in
the first place for which the quality of low order bits is established.
BTW, nice job.
Best

Kai-Uwe

Jul 22 '05 #3

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

RNGs) have some troubles with the low order bytes. On some operating
systems those might still be in use. Thus it is good practice to use the
high order bytes instead like so (quoted from the man page):

In Numerical Recipes in C: The Art of Scientific Computing
(William H. Press, Brian P. Flannery, Saul A. Teukolsky,
William T. Vetterling; New York: Cambridge University
Press, 1992 (2nd ed., p. 277)), the following comments are
made:
"If you want to generate a random integer between 1
and 10, you should always do it by using high-order
bits, as in

j=1+(int) (10.0*rand()/(RAND_MAX+1.0));

and never by anything resembling

j=1+(rand() % 10);

(which uses lower-order bits)."

Now, on Linux, this advice is somewhat outdated. And it is to be hoped

that

Thanks for the input, I'm now working on a way to template-ize a
user-assigned function, which would have a default. Unfortunately
random_shuffle uses a rather coarse method: It simply defines two
functions, one with and one without. This seems so un-C++ to me that I
can't bring myself to recode a duplicate in the header. There's got to be a
better way. Still researching on that one.

As to the advice that was given about low-order bits...I've done some
research on this (in the form of executing code, and observing results) and
as it turns out std::rand()'s first return value is a sequential number
that steps by threes through RAND_MAX. My version anyway.

What I got was that if srand( time(NULL) ) is used as a seed, then it will
take about nine hours to cycle through the range.

Jul 22 '05 #4

Kai-Uwe Bux

AngleWyrm wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...
RNGs) have some troubles with the low order bytes. On some operating
systems those might still be in use. Thus it is good practice to use the
high order bytes instead like so (quoted from the man page):

In Numerical Recipes in C: The Art of Scientific Computing
(William H. Press, Brian P. Flannery, Saul A. Teukolsky,
William T. Vetterling; New York: Cambridge University
Press, 1992 (2nd ed., p. 277)), the following comments are
made:
"If you want to generate a random integer between 1
and 10, you should always do it by using high-order
bits, as in

j=1+(int) (10.0*rand()/(RAND_MAX+1.0));

and never by anything resembling

j=1+(rand() % 10);

(which uses lower-order bits)."

...
As to the advice that was given about low-order bits...I've done some
research on this (in the form of executing code, and observing results)
and as it turns out std::rand()'s first return value is a sequential
number that steps by threes through RAND_MAX. My version anyway.

What I got was that if srand( time(NULL) ) is used as a seed, then it will
take about nine hours to cycle through the range.

Hi,

experiments are good. They will, however, only be telling you about your
particular platform. Other clients of your code might be less lucky.

Moreover, keep in mind that a long period is not the only thing that
matters in pseudo random number generation. You will also want subsequent
calls to random() to appear independent. For instance it would be bad if a
low return from random() strongly indicating that a high number would
result from the next call. There are various statistical tests that you
should perform before trusting an RNG. I very much recommend Volume 2 of
"The Art of Computer Programming" by D.E. Knuth. It has an in depth account
on how to measure the quality of RNGs. You might also want to have a look
at the random number generators from the Boost libraries. They represent
pretty much the state of the art.

Another property that is sometimes requested is that the RNG may not leak
information about its internal state. This is important if a random number
generator is used in a cryptoscheme for instance as a counter measure to
timing attacks. You assume that the adversary knows the algorithm of the
RNG and can observe the returns from the calls. It shall nevertheless be
computationally infeasible to deduce the internal state of the RNG or
predict successfully its next return.

Upshot: RNGs are pretty cool stuff, darn interesting, and easy to get
wrong. It is very important to provide a reasonable general purpose RNG as
the default but to leave the ultimate choice to client code.
Best

Kai-Uwe

Jul 22 '05 #5

Siemel Naran

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message news:caeb75

Now for the one that you have chosen: relying on std::rand() from <cstdlib> puts you at the mercy of the operating system. Your code assumes that the
low order bits of rand() are reasonably well distributed. Please note that
some random number generators (e.g. straight forward linear congruence
RNGs) have some troubles with the low order bytes. On some operating
systems those might still be in use. Thus it is good practice to use the
high order bytes instead like so (quoted from the man page):

One of the tests says the green dice rolls 1 half of the time. So to test,
roll 6000 times, and about 3000 should be 1. Is this test sufficient? (I
imagine there's more to it though).

BTW, I wrote this class once. It might be useful. Don't know how good it
really is, but it seems to work.
// class crandom
// congruential random number generator
// idea from Wannier: Numerical Methods for Scientists and Engineers
// crandom<3> generates random numbers in [0,8) inclusive
template <unsigned LogRange>
class crandom
{
public:
crandom() : number(1) { ++(*this); }

unsigned lower() const { return 0 ; }
unsigned upper() const { return (1<<LogRange)-1; }

crandom& operator++()
{
number=(number*37)&((4<<LogRange)-1);
return *this;
}

const crandom operator++(int)
{
crandom tmp(*this);
++(*this);
return tmp;
}

unsigned operator*() const { return number>>2; }

private:
unsigned number;
};

Jul 22 '05 #6

Siemel Naran

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:a59zc.30817

What I got was that if srand( time(NULL) ) is used as a seed, then it will
take about nine hours to cycle through the range.

What is the significance of this finding?

Jul 22 '05 #7

Siemel Naran

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:eTxyc.34942

I have created a new container class, called the hat.

It provides random selection of user objects, both with and without
replacement, with non-uniform probabilities. Uniform probabilities are a
degenerate case, and can also be stored.

Here's the web-page, with complete source, illustration, and example usage. Take a look-see, and post your critique!

http://home.comcast.net/~anglewyrm/hat.html

The code looks good and well written, and well commented too, and the web
page is good too.

Kai's suggestion of seperating the random number generation with the logic
of the hat class is good too.

One of your tests says the green dice rolls 1 half of the time. So create a
new array of 10 elements,

{ 1, 1, 1, 1, 1, 2, 3, 4, 5, 6 )

The pick a random number in the range [0, 10) and that's your weighted
random number, with 1 showing up about 50% of the time.

In the actual implementation you don't need to create a new array of 10
elements. Instead you can just pick a number in the range [0, 10) where 10
is the sum of weights. The original array is { 1, 2, 3, 4, 5, 6 }. Numbers
[0,4] map to index 0, numbers [5,5] map to index 1, etc.

What benefit does the tree structure give you?

Jul 22 '05 #8

Siemel Naran

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message

Upshot: RNGs are pretty cool stuff, darn interesting, and easy to get
wrong. It is very important to provide a reasonable general purpose RNG as
the default but to leave the ultimate choice to client code.

For random numbers can they use something in nature, like the number of
neutrons hitting the earth's atmosphere or something, which in principle may
not be random, but is practically so for us?

Jul 22 '05 #9

Kai-Uwe Bux

Siemel Naran wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
Upshot: RNGs are pretty cool stuff, darn interesting, and easy to get
wrong. It is very important to provide a reasonable general purpose RNG
as the default but to leave the ultimate choice to client code.

For random numbers can they use something in nature, like the number of
neutrons hitting the earth's atmosphere or something, which in principle
may not be random, but is practically so for us?

Sure,

actually, I have the vague recollection that there are some hardware devices
out there that provide "true" random numbers. Some computers even have
physical RNGs built in. Usually the problems are that the raw random bits
are a little skewed (that it easy to corrected in software) and that
subsequent calls are tricky--you will have to wait for a certain amount of
time so that the device accumulates enough entropy (from heat or some
quantum yadda yadda) in order to ensure that the random bits it generates
are independent.

If you have such a device, your operating system needs to provide a driver
and a system call. Then you write a simple wrapper around it to integrate
it with C++. I do not have such a device, but I think it should be fun.

Some less fancy ways of obtaining more or less true random numbers is the
operating system watching various events that are seemingly unpredictable
like packages hitting IO ports, user interaction with mouse and keyboard,
etc. Linux turns that into /dev/random which you can read from.
There is one drawback to "true" random numbers though that can be serious:
you will not be able to reproduce identical behavior of your program from
one run to the next. Sometimes you will want identical yet random like
behavior. Then you should go for pseudo random numbers. Moreover, becauer
of non-reproducibility, true random number applications are harder to
debug. Thus in writing drafts of code, I would always prefer pseudo random
numbers.
Best

Kai-Uwe

Jul 22 '05 #10

Kai-Uwe Bux

Siemel Naran wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message news:caeb75
Now for the one that you have chosen: relying on std::rand() from

<cstdlib>
puts you at the mercy of the operating system. Your code assumes that the
low order bits of rand() are reasonably well distributed. Please note
that some random number generators (e.g. straight forward linear
congruence RNGs) have some troubles with the low order bytes. On some
operating systems those might still be in use. Thus it is good practice
to use the high order bytes instead like so (quoted from the man page):

One of the tests says the green dice rolls 1 half of the time. So to
test,
roll 6000 times, and about 3000 should be 1. Is this test sufficient? (I
imagine there's more to it though).

BTW, I wrote this class once. It might be useful. Don't know how good it
really is, but it seems to work.
// class crandom
// congruential random number generator
// idea from Wannier: Numerical Methods for Scientists and Engineers
// crandom<3> generates random numbers in [0,8) inclusive
template <unsigned LogRange>
class crandom
{
public:
crandom() : number(1) { ++(*this); }

unsigned lower() const { return 0 ; }
unsigned upper() const { return (1<<LogRange)-1; }

crandom& operator++()
{
number=(number*37)&((4<<LogRange)-1);
return *this;
}

const crandom operator++(int)
{
crandom tmp(*this);
++(*this);
return tmp;
}

unsigned operator*() const { return number>>2; }

private:
unsigned number;
};

Hi,
thanks a lot for this code. It provides an excellent illustration of the
weakness of low order bits. Let us just test the lowest one:

// class crandom
// congruential random number generator
// idea from Wannier: Numerical Methods for Scientists and Engineers
// crandom<3> generates random numbers in [0,8) inclusive
template <unsigned LogRange>
class crandom
{
public:
crandom() : number(1) { ++(*this); }

unsigned lower() const { return 0 ; }
unsigned upper() const { return (1<<LogRange)-1; }

crandom& operator++()
{
number=(number*37)&((4<<LogRange)-1);
return *this;
}

const crandom operator++(int)
{
crandom tmp(*this);
++(*this);
return tmp;
}

unsigned operator*() const { return number>>2; }

private:
unsigned number;
};

#include <iostream>

int main ( void ) {
crandom<16> R;
for ( unsigned int i = 0; i < 1000; ++i ) {
std::cout << ( *R % 2 );
}
++R;
}

output: 10101010101010...

It sure gives 1 half of the time, but it does not feel like very random,
does it?
And now for the most significant bit:

int main ( void ) {
crandom<17> R;
for ( unsigned int i = 0; i < 1000; ++i ) {
std::cout << ( *R / 65536 );
++R;
}
}

output: 00010101100011100111000110010100101000001001010010 0010111010110...

BTW, what is the meaning of lower and upper, and why is there no
constructor that allows to seed? Or am I missing something?
Thanks again

Kai-Uwe

Jul 22 '05 #11

Siemel Naran

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message news:cajmj3

BTW, what is the meaning of lower and upper, and why is there no
constructor that allows to seed? Or am I missing something?

The random number generates random numbers in the range [0, 2^N), depending
on the template argument N, and lower() returns 0 always, and upper returns
2^N. There is no seed (at present).

Jul 22 '05 #12

AngleWyrm

"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message
news:1h*******************@bgtnsc04-news.ops.worldnet.att.net...

The code looks good and well written, and well commented too, and the web
page is good too.
Thanks :)
Kai's suggestion of seperating the random number generation with the logic of the hat class is good too.
Agreed; I'll post a revision soon
One of your tests says the green dice rolls 1 half of the time. So create a new array of 10 elements,

{ 1, 1, 1, 1, 1, 2, 3, 4, 5, 6 )

The pick a random number in the range [0, 10) and that's your weighted
random number, with 1 showing up about 50% of the time.

In the actual implementation you don't need to create a new array of 10
elements. Instead you can just pick a number in the range [0, 10) where 10 is the sum of weights. The original array is { 1, 2, 3, 4, 5, 6 }. Numbers [0,4] map to index 0, numbers [5,5] map to index 1, etc.

What benefit does the tree structure give you?

If the objects being chosen are small (like ints in the above example), and
we are only going to randomly select with replacement (like rolling dice),
then the above implementation is fine. If the objects are larger (like
personnel records, or a drawing program's shapes, or DNA strands) then
having multiple copies becomes a time-space trade off. More importantly
though, is random selection _without_ replacement, such as drawing cards
from a deck, or lottery balls from a cage. This requires being able to
efficiently delete items from the set.

If the above representation were cards instead of die faces, and we draw
the #1 card, how then shall we remove it so that it isn't drawn again the
second time around? This would require the search for and removal of all
instances of #1 from the set. Even if we give the premise that the objects
are sorted while being inserted, we still wouldn't know how many were a
given item, and the worst-case performance would be examining all records
in order to delete, then deleting many records.

The balanced tree structure guarantees that at most log_2 records will be
examined. This is the number of bits it takes to represent the total number
of records. Thus if we have 32,768 items in the container, then the
worst-case performance would be examining 15 records, and then deleting
one.

Jul 22 '05 #13

AngleWyrm

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:OIrzc.92402$3x.50182@attbi_s54...

"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message
news:1h*******************@bgtnsc04-news.ops.worldnet.att.net...

In the actual implementation you don't need to create a new array of 10
elements. Instead you can just pick a number in the range [0, 10)
where 10 is the sum of weights. The original array is { 1, 2, 3, 4, 5, 6 }.
Numbers [0,4] map to index 0, numbers [5,5] map to index 1, etc.

What benefit does the tree structure give you?

Apoloties, yes, that solves the time-space trade off. Sorry didn't have my
coffee yet. Useful for selection with replacement.

The map would still require a search and possibly multiple deletes to
remove a record, if used for selection without replacement.

Jul 22 '05 #14

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

AngleWyrm wrote:
I have created a new container class, called the hat.

It provides random selection of user objects, both with and without
replacement, with non-uniform probabilities. Uniform probabilities are a degenerate case, and can also be stored.

Here's the web-page, with complete source, illustration, and example
usage. Take a look-see, and post your critique!

http://home.comcast.net/~anglewyrm/hat.html

I only had a very brief look at what is a pet subject of mine: the random
number generator:

template <class T>
int hat<T>::random(int range)
{
// make sure srand has been called before using rand
if( !rand_seeded )
{
srand( time(NULL) );
rand();
rand_seeded = true;
}
int random_number = rand() % range;

return random_number;
};
I would leave the choice of a random number generator to the client, like
it is done in std::random_shuffle. A second template parameter can be

used for that. You could also prove a constructor that takes an instance of an
RNG as an argument. You would have to provide a reasonable default, of
course. There are good reasons for not making this choice for the user. For example:

a) There is a *huge* variety of random number generators out there. It is
impossible to predict for a library designer which of these will be good
for a given client problem.

b) It is a big advantage if the user can have independent instances of
random number generators.

Good idea. The new version of the hat container class now supports
user-specifiable random number generators during construction of a hat
object. If none is specified, std::rand() is used as a default.

USAGE:
hat<my_object_type> my_hat; // uses std::rand()
hat<my_object_type> my_hat( my_rng ) // uses my_rng() for random number
generation

Any function that takes no arguments and returns an int may be used.

http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #15

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

Now for the one that you have chosen: relying on std::rand() from <cstdlib> puts you at the mercy of the operating system. Your code assumes that the
low order bits of rand() are reasonably well distributed. Please note that some random number generators (e.g. straight forward linear congruence
RNGs) have some troubles with the low order bytes. On some operating
systems those might still be in use. Thus it is good practice to use the
high order bytes instead like so (quoted from the man page):

In Numerical Recipes in C: The Art of Scientific Computing
(William H. Press, Brian P. Flannery, Saul A. Teukolsky,
William T. Vetterling; New York: Cambridge University
Press, 1992 (2nd ed., p. 277)), the following comments are
made:
"If you want to generate a random integer between 1
and 10, you should always do it by using high-order
bits, as in

j=1+(int) (10.0*rand()/(RAND_MAX+1.0));

and never by anything resembling

j=1+(rand() % 10);

(which uses lower-order bits)."

Question: Is it reasonable to assume that a user-specified random number
generator will redefine RAND_MAX ? If so, then the above formula is
do-able; otherwise it will rely on an inconsistant constant.

Question #2: Is it appropriate to make this a constraint of user-specified
random number generators? "If you specify a random number generator, then
you must redefine RAND_MAX as necessary."

Jul 22 '05 #16

Siemel Naran

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:OIrzc.92402$3x.50182@attbi_s54...

"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message

The pick a random number in the range [0, 10) and that's your weighted
random number, with 1 showing up about 50% of the time.

In the actual implementation you don't need to create a new array of 10
elements. Instead you can just pick a number in the range [0, 10) where

10
is the sum of weights. The original array is { 1, 2, 3, 4, 5, 6 }.

Numbers
[0,4] map to index 0, numbers [5,5] map to index 1, etc.

What benefit does the tree structure give you?

If the objects being chosen are small (like ints in the above example),

and we are only going to randomly select with replacement (like rolling dice),
then the above implementation is fine. If the objects are larger (like
personnel records, or a drawing program's shapes, or DNA strands) then
having multiple copies becomes a time-space trade off. More importantly
though, is random selection _without_ replacement, such as drawing cards
from a deck, or lottery balls from a cage. This requires being able to
efficiently delete items from the set.
In the original algorithm, you don't have to copy objects. If you have to
copy anything, copy pointers (ie. make your implementation use pointers).
Moreover I think you can handle replacement too.

The original array is { 1, 2, 3, 4, 5, 6 } with indices 0 to 5. We make a
virtual array { 1, 1, 1, 1, 1, 2, 3, 4, 5, 6 } with indices 0 to 10. We
don't actually have to construct this array, but instead pretend it's there,
so there's no copying. When we pick a random number in the range 0 to 10,
say 3, this means the 3rd element of the virtual array which is 1. There's
an algorithm to determine which element to choose from the real array
(basically subtract numbers in a loop). If we need to remove this element
from the array we just change the virtual array to { 2, 3, 4, 5, 6 }.
Doubtless we may need an array of storage space to make this work
efficiently, but at worst we could just have an array of pointers, where
each pointer points to an object in the original array.

Numbers
[0,4] map to index 0, numbers [5,5] map to index 1, etc.
If the above representation were cards instead of die faces, and we draw
the #1 card, how then shall we remove it so that it isn't drawn again the
second time around? This would require the search for and removal of all
instances of #1 from the set. Even if we give the premise that the objects
are sorted while being inserted, we still wouldn't know how many were a
given item, and the worst-case performance would be examining all records
in order to delete, then deleting many records.
OK, so the element #1 may occur several times? That doesn't make sense. If
you wanted that you could just increase the weight of #1.
The balanced tree structure guarantees that at most log_2 records will be
examined. This is the number of bits it takes to represent the total number of records. Thus if we have 32,768 items in the container, then the
worst-case performance would be examining 15 records, and then deleting
one.

OK.

Jul 22 '05 #17

Kai-Uwe Bux

AngleWyrm wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

Now for the one that you have chosen: relying on std::rand() from

<cstdlib>
puts you at the mercy of the operating system. Your code assumes that the
low order bits of rand() are reasonably well distributed. Please note

that
some random number generators (e.g. straight forward linear congruence
RNGs) have some troubles with the low order bytes. On some operating
systems those might still be in use. Thus it is good practice to use the
high order bytes instead like so (quoted from the man page):

In Numerical Recipes in C: The Art of Scientific Computing
(William H. Press, Brian P. Flannery, Saul A. Teukolsky,
William T. Vetterling; New York: Cambridge University
Press, 1992 (2nd ed., p. 277)), the following comments are
made:
"If you want to generate a random integer between 1
and 10, you should always do it by using high-order
bits, as in

j=1+(int) (10.0*rand()/(RAND_MAX+1.0));

and never by anything resembling

j=1+(rand() % 10);

(which uses lower-order bits)."

Question: Is it reasonable to assume that a user-specified random number
generator will redefine RAND_MAX ? If so, then the above formula is
do-able; otherwise it will rely on an inconsistant constant.

Question #2: Is it appropriate to make this a constraint of user-specified
random number generators? "If you specify a random number generator, then
you must redefine RAND_MAX as necessary."

Hi,
I think, RAND_MAX is neither a reasonable expectation nor an appropriate
constraint. Any definition of this identifier will be a global constant and
thus not be tied to a particular instance of an RNG.

Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients of
your class will probably use the default RNG of RNGs from that library.
Best

Kai-Uwe

Jul 22 '05 #18

Kai-Uwe Bux

Kai-Uwe Bux wrote:

Hi,
I think, RAND_MAX is neither a reasonable expectation nor an appropriate
constraint. Any definition of this identifier will be a global constant
and thus not be tied to a particular instance of an RNG.

Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients of
your class will probably use the default RNG of RNGs from that library.
Best

Kai-Uwe

To embark on this a little, I have drafted some code that wraps around the
two RNGs proposed in this thread: Here are two classes RNG_cstdlib and
RNG_crandom. Both implement a constructor that does the seeding, and define

unsigned long operator( unsigned long );

The implementation of this operator uses higher order bits. This is
especially important in the case of RNG_crandom as the lower order bits
of that RNG are particularly weak.

The sample code below uses these classes to simulate a coin. It then
proceeds to measure the average length of runs. The expected result is 2.

It should be noted that RNG_crandom has consistently runs that are a little
too long. The reason is that the multiplyer 37 is way too small.

#include <cstdlib>

class RNG_cstdlib {
private:

static
const unsigned long rand_max = RAND_MAX;

public:

RNG_cstdlib ( void ) {}

RNG_cstdlib ( unsigned long seed ) {
std::srand( seed );
}

unsigned long operator() ( unsigned long bound ) {
/*
| We will use the higher order bits of rand().
| Moreover, we avoid floating point arithmetic.
| The cost is that we might have to call rand()
| multiple times.
*/
unsigned long int int_width = rand_max / bound;
unsigned long int max_valid = int_width * bound;
/*
| max_valid is at least rand_max / 2.
*/
/*
| The following loop will be traversed
| rand_max/max_valid < 2 times on the average:
*/
while ( true ) {
// in the worst case, we
unsigned long int raw_random ( std::rand() );
if ( raw_random < max_valid ) {
return( raw_random / int_width );
}
}
}

}; // RNG_cstdlib

class RNG_crandom {
private:

static
const unsigned long internal_max = 4<<30;

static
const unsigned long rand_max = 4<<28;

unsigned long number;

public:

RNG_crandom ( void ) :
number( std::rand() )
{}

RNG_crandom (unsigned long int seed ) :
number( seed + 1 )
{}

unsigned long operator() ( unsigned long bound ) {
unsigned long int int_width = rand_max / bound;
unsigned long int max_valid = int_width * bound;
/*
| max_valid is at least rand_max / 2.
*/
/*
| The following loop will be traversed
| rand_max/max_valid < 2 times on the average:
*/
while ( true ) {
number=(number*37)&( internal_max - 1 );
unsigned long raw_random ( number >>2 );
if ( raw_random < max_valid ) {
return( raw_random / int_width );
}
}
}

}; // RNG_crandom
template < typename RNG >
double average_length_of_runs ( unsigned long tries,
RNG R ) {
unsigned int count = 1;
unsigned int parity = R(2);
for ( unsigned int i = 0; i < tries; ++i ) {
unsigned int next = R(2);
while ( parity == next ) {
++ count;
next = R(2);
}
parity = next;
++count;
}
return( double( count ) / double( tries ) );
}

#include <iostream>

int main ( void ) {
for ( unsigned long seed = 0; seed < 1000000; seed += 12334 ) {
std::cout << average_length_of_runs( 1000000, RNG_cstdlib( seed ) ) <<
" ";
std::cout << average_length_of_runs( 1000000, RNG_crandom( seed ) ) <<
std::endl;
}
}

Jul 22 '05 #19

AngleWyrm

"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message
news:04******************@bgtnsc05-news.ops.worldnet.att.net...

If the above representation were cards instead of die faces, and we draw the #1 card, how then shall we remove it so that it isn't drawn again the second time around? This would require the search for and removal of all instances of #1 from the set. Even if we give the premise that the objects are sorted while being inserted, we still wouldn't know how many were a
given item, and the worst-case performance would be examining all records in order to delete, then deleting many records.
OK, so the element #1 may occur several times? That doesn't make sense.

If you wanted that you could just increase the weight of #1.

Sorry, maybe not very clear. It might help if we used chars for the items,
in order to distinguish the items from a virtual lookup table. Let's
redefine the set of objects as {'a', 'b', 'c', 'd', 'e', 'f' }, and say
that 'a' has a 50% chance of being selected. The proposed virtual array
might be { 1, 1, 1, 1, 1, 2, 3, 4, 5, 6 }, which refer to item numbers. And
further, let's say that we rolled a three out of ten.

Can item #1 be removed from the virtual mapping array in less
examinations/calculations than is linearly proportional to the number of
records?

Jul 22 '05 #20

Siemel Naran

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:9OJzc.55391

Sorry, maybe not very clear. It might help if we used chars for the items,
in order to distinguish the items from a virtual lookup table. Let's
redefine the set of objects as {'a', 'b', 'c', 'd', 'e', 'f' }, and say
that 'a' has a 50% chance of being selected. The proposed virtual array
might be { 1, 1, 1, 1, 1, 2, 3, 4, 5, 6 }, which refer to item numbers. And further, let's say that we rolled a three out of ten.

Can item #1 be removed from the virtual mapping array in less
examinations/calculations than is linearly proportional to the number of
records?

If you use a special data structure, perhaps yes. You could use a standard
heap.

Jul 22 '05 #21

AngleWyrm

"Siemel Naran" <Si*********@REMOVE.att.net> wrote in message
news:4T*******************@bgtnsc05-news.ops.worldnet.att.net...

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:9OJzc.55391
Sorry, maybe not very clear. It might help if we used chars for the items, in order to distinguish the items from a virtual lookup table. Let's
redefine the set of objects as {'a', 'b', 'c', 'd', 'e', 'f' }, and say
that 'a' has a 50% chance of being selected. The proposed virtual array
might be { 1, 1, 1, 1, 1, 2, 3, 4, 5, 6 }, which refer to item numbers. And
further, let's say that we rolled a three out of ten.

Can item #1 be removed from the virtual mapping array in less
examinations/calculations than is linearly proportional to the number of records?

If you use a special data structure, perhaps yes. You could use a

standard heap.

And this is what I have been working on, a special kind of data structuring
in the form of a binary tree that never needs rebalancing. It's called a
male-female binary tree, and it can do inserts, deletes, and searches that
require at most a number of examinations/calculation that is
logarithmically proportional to the number of records.

http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #22

AngleWyrm

In studying your program, and my container, it has become apparent to me
that the constant RAND_MAX is important, as it defines the valid domain of
the random functions.

It also seems inappropriate to me to declare a function that requires
additional pre-existing outside values; the interface to the function
should specify everything necessary, without any assumptions. The scenario
of a library user trying to use the function, but without (re)defining
RAND_MAX could introduce erroneous assumptions about the actual range of
values returned by that function.

But an interface should also be as clean and clear as possible.

This functor can be called with or without a range parameter:

template<T>
T random_function( range = (T)RAND_MAX );

but it suffers from the use of RAND_MAX. I would like to be able to specify
RAND_MAX to this function, but with a default value. Does you know how this
might be done?

Jul 22 '05 #23

Kai-Uwe Bux

AngleWyrm wrote:

In studying your program, and my container, it has become apparent to me
that the constant RAND_MAX is important, as it defines the valid domain of
the random functions.

It also seems inappropriate to me to declare a function that requires
additional pre-existing outside values; the interface to the function
should specify everything necessary, without any assumptions. The scenario
of a library user trying to use the function, but without (re)defining
RAND_MAX could introduce erroneous assumptions about the actual range of
values returned by that function.

But an interface should also be as clean and clear as possible.

This functor can be called with or without a range parameter:

template<T>
T random_function( range = (T)RAND_MAX );

but it suffers from the use of RAND_MAX. I would like to be able to
specify RAND_MAX to this function, but with a default value. Does you know
how this might be done?

Hi,
am I assuming correctly that you are trying to solve the following problem:

The user shall be allowed to provide a RNG of his choosing. But this RNG is
supposed to be a raw RNG, i.e.:
a) the RNG provides random numbers in [0,N) for some N but
b) this RNG does not support the signature

unsigned long R ( unsigned long upper_bound );

returning a number in [0,upper_bound)

Thus, you want to rescale the raw random numbers in [0,N) into the needed
range yourself by some code of yours. I am not sure if that is a good
design decission. The Standard Library sets a different precedent in
std::random_shuffle, and I think for a good engineering reason:

The caution about low-order bits in the RNG is a good rule of thumb. It
applies foremost to linear congruence RNGs which are quite common. However,
in general some knowledge about the underlying RNG is required to code the
appropriate method of rescaling from [0,RAND_MAX) to [0,upper). Since you
cannot predict the RNG that the client will use, it is probably best to
leave the rescaling problem to the client too.
As for the technical C++ question about templates, what about passing
RAND_MAX as a second parameter:

template < typename T >
T random_function ( T range,
T max_range = static_cast<T>( RAND_MAX ) ) {

}

But, maybe I am totally misunderstanding what you are trying to do.
Best

Kai-Uwe

Jul 22 '05 #24

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

am I assuming correctly that you are trying to solve the following:
The user shall be allowed to provide a RNG of his choosing.
But this RNG is supposed to be a raw RNG, i.e.: a) the RNG provides random numbers in [0,N) for some N but
b) this RNG does not support the signature

unsigned long R ( unsigned long upper_bound );

returning a number in [0,upper_bound)

Thus, you want to rescale the raw random numbers in [0,N) into the needed
range yourself by some code of yours. I am not sure if that is a good
design decission. The Standard Library sets a different precedent in
std::random_shuffle, and I think for a good engineering reason:

Hi,
That is not quite what I meant to express. I agree that if an rng is
designed to be given a range, then it should be used that way. I was to
support two signatures:

a). unsigned long R ( unsigned long upper_bound ) // as in the stl
b). int R ( void ) // as in std::rand()

In the first case, the user's rng can be expected to handle it's own
domain, but in the second case a scaling function is required, which needs
RAND_MAX (or it's equivalent for user-defined functions with that
signature). Thus RAND_MAX is only relevant in the second case. Perhaps it
would be best to depreciate the second case, even though std::rand() is the
default if no random number generator is specified.

I can create a wrapper function for std::rand(), so that there is
essentially only one function signature to support. Then using the two
would be easier:
// wrapper redefines function signature for std::rand
unsigned long wrapper( unsigned long rng ){
return (unsigned long) /* use std::rand() */ 100;
};

// a custom generator
unsigned long custom( unsigned long rng ){
return (unsigned long) /* use custom method */ 200;
};

// templatized container, that optionally takes an RNG function
template < class T, unsigned long (*rng_func)(unsigned long range) =
wrapper>
class container{
public:
unsigned long (*random)(unsigned long r);
container(void):random(rng_func){};
};

int main()
{
container<int> my_container;
cout << my_container.random(1);

container<int, custom> my_other_container;
cout << endl << my_other_container.random(2);

system("pause");
}

Jul 22 '05 #25

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients of your class will probably use the default RNG of RNGs from that library.

New revision posted. The hat class now supports optional user-defined
random number generator at construction time, with the signature:

unsigned long rand_func( unsigned long range )

Example: a hat full of strings, using the default rand():
hat<string> hat_instance;

Example#2: a hat full of strings, using a custom random number generator
hat<string, my_random_gen> hat_instance;

--
AngleWyrm
hat at http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #26

Martin Eisenberg

AngleWyrm wrote:

In studying your program, and my container, it has become
apparent to me that the constant RAND_MAX is important, as it
defines the valid domain of the random functions.

It also seems inappropriate to me to declare a function that
requires additional pre-existing outside values; the interface
to the function should specify everything necessary, without any
assumptions. The scenario of a library user trying to use the
function, but without (re)defining RAND_MAX could introduce
erroneous assumptions about the actual range of values returned
by that function.

But an interface should also be as clean and clear as possible.

This functor can be called with or without a range parameter:

template<T>
T random_function( range = (T)RAND_MAX );

but it suffers from the use of RAND_MAX. I would like to be able
to specify RAND_MAX to this function, but with a default value.
Does you know how this might be done?

Do I understand correctly that you want to provide a library of RNG
functions that can be used with your hat? If so, why do you want to
pull in RAND_MAX? RAND_MAX pertains to a particular generator, namely
std::rand(), and you're not bound to it for your own generators. I'd
even think you don't want to be since it would constrain the maximal
sum of weights of hat containees once and for all. Another
possibility is to make your RNG's functors and have them provide
their respective maxima. Let me illustrate what I mean:

#include <limits>
#include <algorithm>
#include <cstdlib>

struct NonRNG {
enum { Max = std::numeric_limits<unsigned long>::max() };

unsigned long operator() (unsigned long range = Max)
{
return std::min(++state, range);
}

private:
unsigned long state;
};

struct StdRNG {
enum { Max = RAND_MAX };

StdRNG() : state(0) {}
StdRNG(unsigned int seed) : state(seed) {}
StdRNG(const StdRNG& rhs) : state(rhs.state) {}

unsigned long operator() (unsigned long range = Max)
{
std::srand(state);
state = std::rand();
return state % (range + 1);
} // Yes, I have read that % is bad here...

private:
unsigned int state;
};

template<typename T, typename Random = StdRNG>
class hat {
public:
hat(const Random& useRandom = Random()) : random(useRandom) {}

void put(T item, unsigned long weight)
{
if(tree[0].family_weight > Random::Max - weight)
throw "Sum of weights too large!";
// ...
}

T& peek()
{
unsigned long random_weight = random( tree[0].family_weight );
return tree[ find_index(0, random_weight) ].self;
}

private:
Random random;
};

Finally, this might answer your literal question:

template<typename T, T Max = T(RAND_MAX)>
T random_function(T range = Max);
Martin

--
Quidquid latine dictum sit, altum viditur.

Jul 22 '05 #27

Kai-Uwe Bux

AngleWyrm wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients

of
your class will probably use the default RNG of RNGs from that library.

New revision posted. The hat class now supports optional user-defined
random number generator at construction time, with the signature:

unsigned long rand_func( unsigned long range )

Example: a hat full of strings, using the default rand():
hat<string> hat_instance;

Example#2: a hat full of strings, using a custom random number generator
hat<string, my_random_gen> hat_instance;

--
AngleWyrm
hat at http://home.comcast.net/~anglewyrm/hat.html

Hi,
I had a look. Looks pretty cool.

As for the O(log N) performance: You might want to have a look at

Y. Matias, J.S. Vitter, W.-C. Hi:
Dynamic Generation of Discrete Random Variates
Symposium on Discrete Algorithms 4 (1993) 361-370

They describe a more complicated method that yields better asymptotic
performance, expected amortized O(log* N) to be precise. Moreover, in the
degenerate but importante special case of a uniform distribution, their
method is expected amortized constant time. I have not seen any
implementation of their method, so I cannot say whether the overhead
generated by their more involved approach will eat away the asymptotic
advantage for all practical purposes. In particular, I do *not* suggest
changing your implementation. But I thought, you might be interested anyway.
Best

Kai-Uwe

Jul 22 '05 #28

Kai-Uwe Bux

AngleWyrm wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

Personally, I prefer to go with the precedent established by
std::random_shuffle, and assume that the user supplied RGN will be a
function object R that provides the signature

unsigned long R ( unsigned long );

where R(n) returns a pseudo random number in [0,n). Compare std:25.2.11
[lib.alg.random.shuffle].

The random number generators from the Boost libraries can all be used in
this way as the library comes with an appropriate adapter. Most clients

of
your class will probably use the default RNG of RNGs from that library.

New revision posted. The hat class now supports optional user-defined
random number generator at construction time, with the signature:

unsigned long rand_func( unsigned long range )

Example: a hat full of strings, using the default rand():
hat<string> hat_instance;

Example#2: a hat full of strings, using a custom random number generator
hat<string, my_random_gen> hat_instance;

--
AngleWyrm
hat at http://home.comcast.net/~anglewyrm/hat.html

Hi,
I have thought a little about the interface:

void put( ... ) // insert an object
T& peek() // obtain a random object with replacement
T pull() // obtain a random object without replacement

Note that the last method has to return the object by value. This could be
inefficient. Maybe a stack container like interface provides a slightly
better solution, like so:

void put() // insert an object
T& peek() // obtain a random object
void pop() // remove the object that was obtained by the last peek()

Now, the client code need not need to know in advance whether the object
shall be replaced or not. This can determined based upon which object was
drawn from the hat. If you wanted to do that with the current interface,
you would need a pull() returning the object by copy followed maybe by a
put(), also copying the object.

On second thought, the pull() method is still a very common special case
and probably should be left in the interface. Internally, it could be
realized as:

T pull ( void ) {
T result ( peek() );
pop();
return( result );
}

What do you think?
Best

Kai-Uwe

Jul 22 '05 #29

Joe Gottman

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

I have thought a little about the interface:

void put( ... ) // insert an object
T& peek() // obtain a random object with replacement
T pull() // obtain a random object without replacement

Note that the last method has to return the object by value. This could be
inefficient. Maybe a stack container like interface provides a slightly
better solution, like so:

void put() // insert an object
T& peek() // obtain a random object
void pop() // remove the object that was obtained by the last peek()

Now, the client code need not need to know in advance whether the object
shall be replaced or not. This can determined based upon which object was
drawn from the hat. If you wanted to do that with the current interface,
you would need a pull() returning the object by copy followed maybe by a
put(), also copying the object.

Your definition of pop() is very limiting. It makes it impossible to look
at several elements of the hat then remove one of them according to some
criterion (for instance, pick three elements at random and remove the median
value).

One possible solution is to have peek() return some kind of iterator into
the hat, and let pop() take an iterator as a parameter. On second thought,
a pop() function taking an iterator parameter is more aptly called erase(),
for consistency with the STL.

Joe Gottman

Jul 22 '05 #30

Kai-Uwe Bux

Joe Gottman wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

I have thought a little about the interface:

void put( ... ) // insert an object
T& peek() // obtain a random object with replacement
T pull() // obtain a random object without replacement

Note that the last method has to return the object by value. This could
be inefficient. Maybe a stack container like interface provides a
slightly better solution, like so:

void put() // insert an object
T& peek() // obtain a random object
void pop() // remove the object that was obtained by the last peek()

Now, the client code need not need to know in advance whether the object
shall be replaced or not. This can determined based upon which object was
drawn from the hat. If you wanted to do that with the current interface,
you would need a pull() returning the object by copy followed maybe by a
put(), also copying the object.

Your definition of pop() is very limiting. It makes it impossible to
look
at several elements of the hat then remove one of them according to some
criterion (for instance, pick three elements at random and remove the
median value).

One possible solution is to have peek() return some kind of iterator
into
the hat, and let pop() take an iterator as a parameter. On second
thought, a pop() function taking an iterator parameter is more aptly
called erase(), for consistency with the STL.

Joe Gottman

You are right.

So what about this interface:

class hat {
public:

a) constructor/destructor/...

b) provide an iterator type

iterator insert ( const Object & obj, unsigned long weight = 1 );
// enter an object and return the position where it was entered

iterator get ( void ) const;
// get an iterator to a random element
// according to the current weights

unsigned long get_weight ( const iterator & iter ) const;
// I forgot, what was the weight of this one.

void set_weight ( const iterator & iter );
// my bad, now I do not like this weight anymore.
}

void erase ( const iterator & iter );
// or maybe, I should get rid of this one altogether?
}

void clear ( void );
// damn, the whole thing is completely messed up.
// let's get rid of it all.

unsigned long size ( void ) const;
// does what you think it does

iterator begin ( void ) const;
// now, this is a container after all.

iterator end ( void ) const;
// see above.

}; // hat
Best

Kai-Uwe Bux

Jul 22 '05 #31

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...

Joe Gottman wrote:
So what about this interface:

class hat {
public:

a) constructor/destructor/...

b) provide an iterator type

iterator insert ( const Object & obj, unsigned long weight = 1 );
// enter an object and return the position where it was entered

iterator get ( void ) const;
// get an iterator to a random element
// according to the current weights

unsigned long get_weight ( const iterator & iter ) const;
// I forgot, what was the weight of this one.

void set_weight ( const iterator & iter );
// my bad, now I do not like this weight anymore.
}

void erase ( const iterator & iter );
// or maybe, I should get rid of this one altogether?
}

Revision 1.14:

Implemented the erase(iterator) functionality, and removed the returns by value
constraint from the pull() function. There is considerable merit to dynamically
changing weight values through the set_weight() functions, so it will probably
be next.

Iterators currently do something like the way associative containers return
pairs, although I'm not sure that this is the best way to go. For instance:

hat<int>::iterator iter = some_hat.begin();
cout << "item: " << iter->item << " weight: " << iter->weight;

This offers a way to rapidly access item and weight values, but unfortunately
suggests that it might be possible to directly modify weight through an
iterator. That cannot be the case, just as is true of keys in an associative
container.
--
AngleWyrm
The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #32

Kai-Uwe Bux

AngleWyrm wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:ca**********@news01.cit.cornell.edu...
Joe Gottman wrote:
So what about this interface:

class hat {
public:

a) constructor/destructor/...

b) provide an iterator type

iterator insert ( const Object & obj, unsigned long weight = 1 );
// enter an object and return the position where it was entered

iterator get ( void ) const;
// get an iterator to a random element
// according to the current weights

unsigned long get_weight ( const iterator & iter ) const;
// I forgot, what was the weight of this one.

void set_weight ( const iterator & iter );
// my bad, now I do not like this weight anymore.
}

void erase ( const iterator & iter );
// or maybe, I should get rid of this one altogether?
}

Revision 1.14:

Implemented the erase(iterator) functionality, and removed the returns by
value constraint from the pull() function. There is considerable merit to
dynamically changing weight values through the set_weight() functions, so
it will probably be next.

Iterators currently do something like the way associative containers
return pairs, although I'm not sure that this is the best way to go. For
instance:

hat<int>::iterator iter = some_hat.begin();
cout << "item: " << iter->item << " weight: " << iter->weight;

This offers a way to rapidly access item and weight values, but
unfortunately suggests that it might be possible to directly modify weight
through an iterator. That cannot be the case, just as is true of keys in
an associative container.

I found the set/get_weight methods quite usefull: for instance you can
simulate a hat containing /0000 As and B0000 Bs just like so:

CharHat h;
h.insert( 'A', 70000 );
h.insert( 'B', 50000 );
while ( ! h.empty() ) {
CharHat::iterator iter = h.grab();
std::cout << *iter << std::endl;
CharHat::weight_type weight = h.get_weight( iter );
--weight;
if ( weight == 0 ) {
h.erase( iter );
} else {
h.set_weight( iter, weight );
}
}
h.clear();

So, here I am assuming that iterators just point to the objects and not to
some struct containing the object and its weight. As you pointed out, once
you settle on the pair semantics for iterators, it is natural to wish that
something like

iter->weight = 5

changes the weight. If you feel so inclined, you could achieve this using
reference proxies: the iterator would define its operator* and operator->
methods to return reference and pointer proxies which in turn need to
cleverly define the assignment operator= and a conversion operator so that
they can be used as rvalues too. In my opinion, this is not quite worth the
effort. Thus, I would suggest:

template < typename Object,
typename unsigned_type = unsigned long,
typename RNG = CstdlibRandomNumberGenerator >
class hat {
private:
typedef std::vector< Object > ObjectVector;

public:

typedef Object value_type;
typedef typename ObjectVector::iterator iterator;
typedef typename ObjectVector::const_iterator const_iterator;
typedef typename ObjectVector::reference reference;
typedef typename ObjectVector::const_reference const_reference;
typedef typename ObjectVector::pointer pointer;
typedef typename ObjectVector::size_type size_type;
typedef typename ObjectVector::difference_type difference_type;
typedef unsigned_type weight_type;

hat ( void );

hat ( const RNG & rng );

hat ( const hat & other );

~hat ( void );

const hat & operator= ( const hat & other );

bool operator< ( const hat & other );

bool operator<= ( const hat & other );

bool operator> ( const hat & other );

bool operator>= ( const hat & other );

bool operator== ( const hat & other );

bool operator!= ( const hat & other );

void reserve ( size_type cap );

iterator insert ( const Object & obj,
unsigned_type weight = 1 );

const_iterator grab ( void ) const;

template < typename Another_RNG >
const_iterator grab ( Another_RNG & rng ) const;

iterator grab ( void );

template < typename Another_RNG >
iterator grab ( Another_RNG & rng );

weight_type get_weight ( const iterator & iter ) const;

void set_weight ( const iterator & iter,
const unsigned_type & new_weight );

void erase ( const iterator & iter );

void clear ( void );

size_type size ( void ) const;

bool empty ( void ) const;

const_iterator begin ( void ) const;

iterator begin ( void );

const_iterator end ( void ) const;

iterator end ( void );
/*
| The following are just some shortcuts that
| might be usefull for standard problems. They also
| serve for compatibility to the hat container by
| AngleWyrm.
*/

const value_type & peek ( void ) const;

template < typename Another_RNG >
const value_type & peek ( Another_RNG & rng ) const;

value_type pull ( void );

template < typename Another_RNG >
value_type pull ( Another_RNG & rng );

}; // hat

This interface, is what I have implemented based on your ideas. As you can
see, I prefer somewhat fat interfaces for library containers.

Jul 22 '05 #33

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...

AngleWyrm wrote:

So, here I am assuming that iterators just point to the objects and not to
some struct containing the object and its weight. As you pointed out, once
you settle on the pair semantics for iterators, it is natural to wish that
something like

iter->weight = 5

changes the weight. If you feel so inclined, you could achieve this using
reference proxies: the iterator would define its operator* and operator->
methods to return reference and pointer proxies which in turn need to
cleverly define the assignment operator= and a conversion operator so that
they can be used as rvalues too. In my opinion, this is not quite worth the
effort.

This shines some light on an issue about the interface that has been bothering
me for a while now: Ideally an iterator should dereference to the contained
objects, increment over contained objects, not some in-between node. The current
arrangement feels a bit like exposing the implementation, a thing I would
rather not do. In persuing iterator proxies, it occurred to me that I might just
be glossing over the issue, and with that thought I suddenly felt like I was
working on something partway between a kludge and outright misrepresentation of
the functionality.

So now I am desirous to attain a clean iterator, free from such burdens of the
current implementation. My approach is going to be rather than storing the data
within a node structure, it will be stored within it's own array. It will then
become possible to properly use iterator de-referencing for the items, and
weights will be through a get_weight/set_weight interface. Were the iterators to
point directly to the objects, then comparison functions on objects would be
handled by those objects, as they should be.

I've also been working on a variant of something we persued in an earlier
thread--a different set of performance characteristics. In situations where the
container will be loaded once, and then drawn upon many times for random items,
it becomes desirable to have the fastest possible access, even at the expense of
a little setup time. It is possible to create a fast_peek() function that has
constant-time access to the contained items, if one pays in advance with an
initialize_fast_peek(). The method behind this scheme is to arrange a lookup
table of iterators, which are repeated according to their weights. The weights
are scaled by the greatest common divisor of the set of weights, and thus the
number of times a given object is pointed to by repeated iterators in the table
is minimized.
--
AngleWyrm
The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #34

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...

unsigned long get_weight ( const iterator & iter ) const;
// I forgot, what was the weight of this one.

void set_weight ( const iterator & iter );
// my bad, now I do not like this weight anymore.

Revision 1.20 is up
It now supports the get_weight/set_weight functionality.
The storage mechanism has been changed, so that iterators are no longer pointing
to nodes, but to an actual container of items. It is now possible to dereference
an iterator and get exactly what is expected from such behavior.

--
AngleWyrm
The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #35

Kai-Uwe Bux

AngleWyrm wrote:

I've also been working on a variant of something we persued in an earlier
thread--a different set of performance characteristics. In situations
where the container will be loaded once, and then drawn upon many times
for random items, it becomes desirable to have the fastest possible
access, even at the expense of a little setup time. It is possible to
create a fast_peek() function that has constant-time access to the
contained items, if one pays in advance with an initialize_fast_peek().
The method behind this scheme is to arrange a lookup table of iterators,
which are repeated according to their weights. The weights are scaled by
the greatest common divisor of the set of weights, and thus the number of
times a given object is pointed to by repeated iterators in the table is
minimized.

I am not sure that I understand the method you are proposing. Could you
explain the structure of this lookup table? Also, what would be its size
and what would be the time needed to set it up. I am deeply confused by
this gcd bit. I hope, the table size does not explode if someone specified
weights like:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31.
Best

Kai-Uwe Bux

Jul 22 '05 #36

AngleWyrm

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...

I am not sure that I understand the method you are proposing. Could you
explain the structure of this lookup table? Also, what would be its size
and what would be the time needed to set it up. I am deeply confused by
this gcd bit. I hope, the table size does not explode if someone specified
weights like:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31.

You have hit it on the head as far as the performance cost of initializing.
Table of pointers size would be a function of the weights; the largest table
size is the sum of the weights -- a worst-case scenario that takes place when
the weights are unique primes. Setup time is also atrocious, being a function of
weights x N, significantly longer than linear performance. It's probably not
worth the cost, both in space and time, of a setup routine in exchange for
constant-time read-only access later.

--
AngleWyrm
The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #37

Kai-Uwe Bux

AngleWyrm wrote:

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...
I am not sure that I understand the method you are proposing. Could you
explain the structure of this lookup table? Also, what would be its size
and what would be the time needed to set it up. I am deeply confused by
this gcd bit. I hope, the table size does not explode if someone
specified weights like:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31.

You have hit it on the head as far as the performance cost of
initializing. Table of pointers size would be a function of the weights;
the largest table size is the sum of the weights -- a worst-case scenario
that takes place when the weights are unique primes. Setup time is also
atrocious, being a function of weights x N, significantly longer than
linear performance. It's probably not worth the cost, both in space and
time, of a setup routine in exchange for constant-time read-only access
later.

Hi,
from this reply, I can actually guess what you had in mind. Now, that
lookup table would allow you to draw items with replacement. Removing an
object from the table would mean to remove all the copies of iterators
pointing to it, which comes with prohibitive costs.

Thus, the problem that you really want to solve is this:

Given a weights on a finite set, realize a datastructure that allows for
constant time generation of a random element in the set with probability
distribution defined by the given weights.

Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.

We want a fast algorithm that yields a with probability 5/25, b ....

In the 1970s, J.A. Walker came up with the following idea. I will explain
it for the example above. Consider the following table:

=================
| left|right
| a: 15 | b
| b: 25 | c
| c: 26 | x
=================

To generate a random letter, proceed as follows:

1) choose a letter L in { a, b, c } with *equal* probabiliy.
2) generate a random integer X uniformly in [0,26).
3) if X < left(L) return L.
4) otherwise, return right(L).

It is a fun exercise to verify that this actually generates the
distribution we wanted. It is also an interesting exercise to check that
such a magic table always can be found, and that the time needed to
construct it is actually linear in the number of letters. You might want to
read Knuth: TAOCP Vol 2 page 120f.

Here is a piece of code that might be usefull to you. It implements
Walker's method.

Check:

a.out | sort -n | uniq -c

produced:

4929 0
12032 1
9039 2

So, you can see that this really modells the distribution that was
specified.

I am sorry, but I did not have the time to document the code properly.

// walker.cc (C) Kai-Uwe Bux [2004]
// ================================

#include <vector>
#include <utility>

class WalkerAlias {
/*
| This class crudely implements J.A. Walker's alias method.
| See:
|
| D.E. Knuth: "The Art of Computer Programming",
| Vol. 2, p. 120f
*/
public:

typedef unsigned long WeightType;
typedef std::vector< unsigned long > WeightVector;
typedef WeightVector::size_type IndexType;
typedef std::pair< WeightType, IndexType > EntryType;

private:

std::vector< EntryType > table;
WeightType total;

// ==================================================
// | accessor methods |
// ==================================================

inline
const IndexType & the_index ( const IndexType & pos ) const {
return( table[pos].second );
}

inline
IndexType & the_index ( const IndexType & pos ) {
return( table[pos].second );
}

inline
const WeightType & the_weight ( const IndexType & pos ) const {
return( table[pos].first );
}

inline
WeightType & the_weight ( const IndexType & pos ) {
return( table[pos].first );
}

inline
IndexType the_size ( void ) const {
return( table.size() );
}

// ==================================================
// | init data |
// ==================================================
/*
| This init routine is a bad hack: it is hard to
| understand and only conjectured to be O(n).
*/

inline
void setup_table ( const WeightVector & weights ) {
table.reserve( weights.size() );
for ( IndexType pos = 0; pos < weights.size(); ++pos ) {
table.push_back( EntryType( weights[pos]*weights.size(),
weights.size() ) );
total += weights[pos];
}
IndexType big_begin = 0;
for ( IndexType pos = 0; pos < the_size(); ++pos ) {
fix_slot( pos, big_begin );
}
}

inline
void fix_slot ( IndexType slot,
IndexType & begin_big ) {
// pre: all weights in [0,begin_big) are at most average.
// all slots in [0,slot) are fine.
// post: all weights in [0,begin_big) are at most average.
// all slots in [0,slot] are fine.
// side: begin_big does not decrease but may increase.
while( ( the_index( slot ) == the_size() )
&&
( the_weight( slot ) < total ) ) {
// not yet fixed and in need of a complement
while ( the_weight( begin_big ) <= total ) {
++ begin_big;
}
the_index( slot ) = begin_big;
the_weight( begin_big ) -= ( total - the_weight( slot ) );
slot = begin_big;
}
}

// ==================================================
// | random choice |
// ==================================================

template < typename RNG >
inline
IndexType grab ( RNG & rng ) const {
/*
| * Pick a random index according to the
| originally specified weights.
| * This clearly is a constant time routine: No looping at all.
| However, we need to calls to the RNG.
*/
IndexType slot ( rng( the_size() ) );
if ( rng( total ) < the_weight( slot ) ) {
return( slot );
} else {
return( the_index( slot ) );
}
}

public:

WalkerAlias ( const WeightVector & weights ) :
total ( 0 )
{
setup_table( weights );
}

~WalkerAlias ( void ) {
}

template < typename RNG >
inline
IndexType operator() ( RNG & rng ) const {
return( grab( rng ) );
}

}; // WalkerAlias
#include <iostream>
#include <cstdlib>

class CstdlibRandomNumberGenerator {
private:

static
const unsigned long rand_max = RAND_MAX;

public:

CstdlibRandomNumberGenerator ( void ) {}

CstdlibRandomNumberGenerator ( unsigned long seed ) {
std::srand( seed );
}

unsigned long operator() ( unsigned long bound ) {
/*
| We will use the higher order bits of rand().
| Moreover, we avoid floating point arithmetic.
| The cost is that we might have to call rand()
| multiple times.
*/
unsigned long int int_width = rand_max / bound;
unsigned long int max_valid = int_width * bound;
/*
| max_valid is at least rand_max / 2.
|
| The following loop will be traversed
| rand_max/max_valid < 2 times on the average:
*/
while ( true ) {
unsigned long int raw_random ( std::rand() );
if ( raw_random < max_valid ) {
return( raw_random / int_width );
}
}
}

}; // CstdlibRandomNumberGenerator

int main ( void ) {
CstdlibRandomNumberGenerator rng ( 129342 );
WalkerAlias::WeightVector weights;
weights.push_back( 5 );
weights.push_back( 12 );
weights.push_back( 9 );
WalkerAlias alias ( weights );
for ( unsigned long i = 0; i < 26000; ++i ) {
std::cout << alias( rng ) << std::endl;
}
}

// end of file
Best

Kai-Uwe

Jul 22 '05 #38

AngleWyrm

The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html
"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...

Thus, the problem that you really want to solve is this:

Given a weights on a finite set, realize a datastructure that allows for
constant time generation of a random element in the set with probability
distribution defined by the given weights.

Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.

We want a fast algorithm that yields a with probability 5/25, b ....

In the 1970s, J.A. Walker came up with the following idea. I will explain
it for the example above. Consider the following table:

=================
| left|right
| a: 15 | b
| b: 25 | c
| c: 26 | x
=================

To generate a random letter, proceed as follows:

1) choose a letter L in { a, b, c } with *equal* probabiliy.
2) generate a random integer X uniformly in [0,26).
3) if X < left(L) return L.
4) otherwise, return right(L).

It is a fun exercise to verify that this actually generates the
distribution we wanted. It is also an interesting exercise to check that
such a magic table always can be found, and that the time needed to
construct it is actually linear in the number of letters. You might want to
read Knuth: TAOCP Vol 2 page 120f.

Here is a piece of code that might be usefull to you. It implements
Walker's method.

Check:

a.out | sort -n | uniq -c

produced:

4929 0
12032 1
9039 2

So, you can see that this really modells the distribution that was
specified.

UNIX :)
Wound up downloading a uniq port from sourceforge just to perform the check in
DOS. If anyone is interested in such command-line utils for DOS, they can be had
at:
http://sourceforge.net/project/showf...ckage_id=26492

I just got a copy of "Random Variables With General Distributions" by ALASTAIR
J. WALKER, but the original OCR scans at the ACM are buggy, and also written in
FORTRAN, with single-letter variables. I'm currently translating it to C++ so
that I can see what he did first-hand, because I have to decipher FORTRAN,
rather than read it. From what I've seen so far, it appears that his algorithm
will make it quite possible to use floating point weights, which has been
requested as a feature.

I haven't gotten to the magic-table part yet, but it's very interesting to me.

Don't yet have a copy of TAOCP (I know, sinner that I am) but I'm gonna have to
pick one up soon.

Jul 22 '05 #39

AngleWyrm

The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html
"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...

Thus, the problem that you really want to solve is this:

Given a weights on a finite set, realize a datastructure that allows for
constant time generation of a random element in the set with probability
distribution defined by the given weights.

Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.

We want a fast algorithm that yields a with probability 5/25, b ....

In the 1970s, J.A. Walker came up with the following idea. I will explain
it for the example above. Consider the following table:

=================
| left|right
| a: 15 | b
| b: 25 | c
| c: 26 | x
=================

To generate a random letter, proceed as follows:

1) choose a letter L in { a, b, c } with *equal* probabiliy.
2) generate a random integer X uniformly in [0,26).
3) if X < left(L) return L.
4) otherwise, return right(L).

It is a fun exercise to verify that this actually generates the
distribution we wanted. It is also an interesting exercise to check that
such a magic table always can be found, and that the time needed to
construct it is actually linear in the number of letters. You might want to
read Knuth: TAOCP Vol 2 page 120f.

Here is a piece of code that might be usefull to you. It implements
Walker's method.

Check:

a.out | sort -n | uniq -c

produced:

4929 0
12032 1
9039 2

So, you can see that this really modells the distribution that was
specified.

Jul 22 '05 #40

AngleWyrm

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:YwLHc.60533$Oq2.12929@attbi_s52...

"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...
Thus, the problem that you really want to solve is this:
Given a weights on a finite set, realize a datastructure that allows for
constant time generation of a random element in the set with probability
distribution defined by the given weights. Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.

We want a fast algorithm that yields a with probability 5/25, b .... In the 1970s, J.A. Walker came up with the following idea. I will explain
it for the example above. Consider the following table:

After some study of both J.A. Walker's alias method (ACM p253, An Efficient
Method For Generating Discrete Random Variables With General Distribution), and
Donald Knuth's implementation of the table generation method(TAOCP vol 2,
seminumerical algorithms, p.550), it seems that this technique is very efficient
when the probability distribution is fixed before selection. This addresses a
large subset of uses, but leaves out a particular class of problem...

In the case where the distribution changes dynamically as choices are added to
or removed from the set, the above technique must recalculate a lookup table for
each insert/delete operation, a linear-time operation. Examples of this problem
space are: Drawing weighted lottery balls, random selection from a changing
client list, or an estimation by successive approximation.

--
AngleWyrm
The C++ hat random selection container:
http://home.comcast.net/~anglewyrm/hat.html

Jul 22 '05 #41

Kai-Uwe Bux

AngleWyrm wrote:

"AngleWyrm" <no***************@hotmail.com> wrote in message
news:YwLHc.60533$Oq2.12929@attbi_s52...
"Kai-Uwe Bux" <jk********@gmx.net> wrote in message
news:cb**********@news01.cit.cornell.edu...
> Thus, the problem that you really want to solve is this:
> Given a weights on a finite set, realize a datastructure that allows
> for constant time generation of a random element in the set with
> probability distribution defined by the given weights.

> Example: Set { a, b, c } weights: w(a) = 5, w(b)=12, w(c)=9.
>
> We want a fast algorithm that yields a with probability 5/25, b ....

> In the 1970s, J.A. Walker came up with the following idea. I will
> explain it for the example above. Consider the following table:

After some study of both J.A. Walker's alias method (ACM p253, An
Efficient Method For Generating Discrete Random Variables With General
Distribution), and Donald Knuth's implementation of the table generation
method(TAOCP vol 2, seminumerical algorithms, p.550), it seems that this
technique is very efficient when the probability distribution is fixed
before selection. This addresses a large subset of uses, but leaves out a
particular class of problem...

In the case where the distribution changes dynamically as choices are
added to or removed from the set, the above technique must recalculate a
lookup table for each insert/delete operation, a linear-time operation.
Examples of this problem space are: Drawing weighted lottery balls, random
selection from a changing client list, or an estimation by successive
approximation.

Yup, updating the magic table for Walker's method is unfortunately linear
time. On the other hand, for the more general problem, you already have a
solution where all orperations are O(log(n)) time: generating a random
variable, adding / deleting an option, and changing a weight. It is tricky
to do better than O(log(n)), and might not be worth the effort since the
set of options has to be *really* big to take advantage of the asymptotic
improvement.

Somewhere up in this thread, I got the impressiont that you were now
searching for an add-on that would provide increased performance for the
special case of a fixed set of options with non-changing weights.
Best

Kai-Uwe Bux

Jul 22 '05 #42

"hat" container class [C++]

Similar topics