471,596 Members | 803 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,596 software developers and data experts.

Does shuffle() produce uniform result ?

Hi,

I have read the source code of the built-in random module, random.py.
After also reading Wiki article on Knuth Shuffle algorithm, I wonder if
the shuffle method implemented in random.py produces results with modulo
bias.

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?

2. If not, is there a fast and uniform shuffle() available somewhere ?

Thanks !

-tooru honda
Aug 24 '07 #1
25 2468
tooru honda wrote:
Hi,

I have read the source code of the built-in random module, random.py.
After also reading Wiki article on Knuth Shuffle algorithm, I wonder if
the shuffle method implemented in random.py produces results with modulo
bias.

The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?
Given the cycle length of the Mersenne twister algorithm that generates
the underlying random numbers, it would have to be a pretty long list to
see a significant bias, don't you think? Have you done any calculations
on the length of list you propose to use?
2. If not, is there a fast and uniform shuffle() available somewhere ?
Frankly I don't think you need to worry.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
--------------- Asciimercial ------------------
Get on the web: Blog, lens and tag the Internet
Many services currently offer free registration
----------- Thank You for Reading -------------

Aug 24 '07 #2
tooru honda <to*********@fast-mail.orgwrites:
I have read the source code of the built-in random module,
random.py. After also reading Wiki article on Knuth Shuffle
algorithm, I wonder if the shuffle method implemented in random.py
produces results with modulo bias.
It doesn't have modulo bias because it doesn't use modulo to produce a
random index; it multiplies the floating point value with the desired
range. I'm not sure if that method produces any measurable bias.
Aug 24 '07 #3
On Aug 24, 8:54 am, Hrvoje Niksic <hnik...@xemacs.orgwrote:
tooru honda <tooru_ho...@fast-mail.orgwrites:
I have read the source code of the built-in random module,
random.py. After also reading Wiki article on Knuth Shuffle
algorithm, I wonder if the shuffle method implemented in random.py
produces results with modulo bias.

It doesn't have modulo bias because it doesn't use modulo to produce a
random index; it multiplies the floating point value with the desired
range. I'm not sure if that method produces any measurable bias.
It produces exactly the same level of bias as the 'modulo bias'
obtained by reducing a random integer in [0, 2**53). For example,
suppose we're trying to produce an integer x in the range 0 through 6
inclusive. If n is a random variable whose values are uniformly
distributed across range(0, 2**53) then:

x = n % 7

will produce 0, 1, 2 and 3 with probability (2**53//7+1)/2**53, and 4,
5 and 6 with probability (2**53//7)/2**53, while

x = floor((n/2**53)*7)

will produce 0, 1, 3 and 5 with probability (2**53//7+1)/2**53, and 2,
4 and 6 with probability (2**53//7)/2*53.

Either way, you'd have a very hard time detecting such a tiny bias.
At the other end of the scale, if you're trying to produce a value in
[0, 2**53-2] (for example) then it looks worse: with either method,
one of the values occurs exactly twice as often as all of the others.
But since there are now so many values, you'd again have problems
detecting any bias.

Steven Holden wrote:
Frankly I don't think you need to worry.
What he said.

Mark

Aug 25 '07 #4
tooru honda <to*********@fast-mail.orgwrites:
The reasoning is as follows: Because the method random() only produces
finitely many possible results, we get modulo bias when the number of
possible results is not divisible by the size of the shuffled list.

1. Does shuffle() produce uniform result ?
The nonuniformity is too small to matter. But what is the
application? If you are doing something like implementing online
poker for real money, you shouldn't use the built-in RNG. It is not
designed for what we call adversarial indistinguishability from true
randomness. Instead, use the random byte stream available from
os.urandom() and derive your random numbers from that.
Aug 25 '07 #5
On Aug 24, 9:30 pm, Mark Dickinson <dicki...@gmail.comwrote:
x = floor((n/2**53)*7)

will produce 0, 1, 3 and 5 with probability (2**53//7+1)/2**53, and 2,
4 and 6 with probability (2**53//7)/2*53.
Oops---I lied; I forgot to take into account the rounding implicit in
the (n/2**53)*7 multiplication. A bit of experimentation shows that
it's 0, 2, 4 and 6 that occur more often, with 1, 3 and 5 less likely
by a miniscule amount (at least on an IEEE-754 system).

Mark

Aug 25 '07 #6
Hi, First of all, my thanks to all of you who replied.

I am writing a gamble simulation to convince my friend that his "winning
strategy" doesn't work. I use shuffle method from a random.SystemRandom
instance to shuffle 8 decks of cards.

As the number of cards is quite small (number of cards is 416), the
nonuniformity doesn't matter as most of you have already said. Just to
avoid argument from my friend, I am considering writing my own randint
and shuffle methods based on os.urandom() though.

-tooru honda
Aug 25 '07 #7
At the end, I think it is worthwhile to implement my own shuffle and
random methods based on os.urandom. Not only does the resulting code
gets rid of the minuscule bias, but the program also runs much faster.

When using random.SystemRandom.shuffle, posix.open and posix.close from
calling os.urandom account for almost half of the total execution time
for my program. By implementing my own random and getting a much larger
chunk of random bytes from os.urandom each time, I am able to reduce the
total execution time by half.

-tooru honda

P.S. I use python 2.5.1 on MacOSX 10.4.10 (PowerPC).
Aug 25 '07 #8
tooru honda <to*********@fast-mail.orgwrote:
At the end, I think it is worthwhile to implement my own shuffle and
random methods based on os.urandom. Not only does the resulting code
gets rid of the minuscule bias, but the program also runs much faster.

When using random.SystemRandom.shuffle, posix.open and posix.close from
calling os.urandom account for almost half of the total execution time
for my program. By implementing my own random and getting a much larger
chunk of random bytes from os.urandom each time, I am able to reduce the
total execution time by half.
If I were in your shoes, I would optimize by subclassing
random.SystemRandom and overriding the random method to use os.urandom
with some large block size and then parcel it out, instead of the
_urandom(7) that it now uses. E.g., something like:

class SystemBlockRandom(random.SystemRandom):

def __init__(self):
random.SystemRandom.__init__(self)
def rand7():
while True:
randata = os.urandom(7*1024)
for i in xrange(0, 7*1024, 7):
yield long(binascii.hexlify(randata[i:i+7]),16)
self.rand7 = rand7().next

def random(self):
"""Get the next random number in the range [0.0, 1.0)."""
return (self.rand7() >3) * random.RECIP_BPF

(untested code). No need to reimplement anything else, it seems to me.
Alex
Aug 25 '07 #9
By incorporating Alex's code, I got another performance boost of 20%.
It is mostly due to Alex's more efficient implementation of block random
than my own version.

-tooru honda

Below is the code I have now:
from binascii import hexlify
from os import urandom

class rcRandomC(random.SystemRandom):

def __init__(self):
random.SystemRandom.__init__(self)

def rand2():

while True:
randata = urandom(2*1024)
for i in xrange(0, 2*1024, 2):
yield int(hexlify(randata[i:i+2]),16) # integer
in [0,65535]

self.rand2_M = rand2().next
# modified from random._randbelow
def randrange(self,startN,stopN):

"""Choose a random integer from range(startN, stopN).
widthN<=65536

"""

widthN=stopN-startN
left_over_N=65536%widthN
upper_bound_N= 65535-left_over_N

random_number=self.rand2_M()

while random_number>upper_bound_N:

random_number=self.rand2_M()

r = random_number%widthN

return startN+r

def shuffle(self, x):
"""x, random=random.random -shuffle list x in place; return
None.

"""

randrange=self.randrange

for i in reversed(xrange(1, len(x))):
# pick an element in x[:i+1] with which to exchange x[i]
j = randrange(0,i+1)
x[i], x[j] = x[j], x[i]
Aug 26 '07 #10
tooru honda <to*********@fast-mail.orgwrote:
...
def rand2():
while True:
randata = urandom(2*1024)
for i in xrange(0, 2*1024, 2):
yield int(hexlify(randata[i:i+2]),16) # integer
in [0,65535]
another equivalent possibility, which might probably be faster:

import array
...
def rand2():
while True:
x = array.array("H")
x.fromstring(urandom(2*4000))
for i in x: yield i
Alex
Aug 26 '07 #11
On 2007-08-26, tooru honda <to*********@fast-mail.orgwrote:
By incorporating Alex's code, I got another performance boost of 20%.
It is mostly due to Alex's more efficient implementation of block random
than my own version.
If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary. If you are using a linux
machine just try to execute "od -x /dev/random" at the same time
as your program.

--
Antoon Pardon
Sep 3 '07 #12
Antoon Pardon <ap*****@forel.vub.ac.bewrites:
If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary.
No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.
Sep 3 '07 #13
On 2007-09-03, Paul Rubin <httpwrote:
Antoon Pardon <ap*****@forel.vub.ac.bewrites:
>If I understand correctly that you are using urandom as a random
generator I wouldn't trust too much on this performance. Urandom
uses the systemwide entropy-pool. If other programs need this pool
too, your performance can drop spectaculary.

No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.
If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom

A read from the /dev/urandom device will not block waiting for
more entropy. As a result, if there is not sufficient
entropy in the entropy pool, the returned values are
theoretically vulnerable to a cryptographic attack on the algorithms
used by the driver. Knowledge of how to do this is not available
in the current non-classified literature, but it is the-
oretically possible that such an attack may exist. If this is a
concern in your application, use /dev/random instead.

And reading from /dev/random can block if there is not enough entropy.

--
Antoon Pardon

Sep 4 '07 #14
Antoon Pardon <ap*****@forel.vub.ac.bewrites:
No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.

If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom. ...
the returned values are theoretically vulnerable to a
cryptographic attack on the algorithms used by the driver.
Right. The idea is that those attacks don't exist and therefore the
output is computationally indistinguishable from random. Of course
whether the idea is correct, an unproven conjecture, but it looks
pretty good; certainly finding any problem with the specific
algorithms in urandom would be a significant research discovery and
not likely to affect the application being discussed. Finding a
general attack that couldn't be fixed with some simple tweak would be
a major crypto breakthrough that would probably reshape a lot of
complexity theory. This is unlike the situation with Mersenne
Twister, which was not designed for indistinguishability against an
opponent who knew what to look for.

In short, using /dev/random is fairly silly once you know there's
enough entropy in the randomness pool to make a good key. If
/dev/urandom's algorithms are broken then whatever you're doing with
the /dev/random output is probably also broken.
Sep 4 '07 #15
On Mon, 03 Sep 2007 23:42:56 -0700, Paul Rubin wrote:
Antoon Pardon <ap*****@forel.vub.ac.bewrites:
No the idea is that once there's enough entropy in the pool to make
one encryption key (say 128 bits), the output of /dev/urandom is
computationally indistinguishable from random output no matter how
much data you read from it.

If you were talking about /dev/random I would agree. But this is what
the man page on my system says about /dev/urandom. ...
the returned values are theoretically vulnerable to a
cryptographic attack on the algorithms used by the driver.

Right. The idea is that those attacks don't exist and therefore the
output is computationally indistinguishable from random.
It is a huge leap from what the man page says, that they don't exist in
the unclassified literature at the time the docs were written, to what
you're saying, that they don't exist.

The man page is clear: there is a possible vulnerability in /dev/urandom.
Any cryptographer worth his salt (pun intended) would be looking to close
that vulnerability BEFORE an attack is made public, and not just wait for
the attack to trickle down from the NSA to the script kiddies. The time
to close the stable door is _before_ the horse gets away.

Of course
whether the idea is correct, an unproven conjecture, but it looks pretty
good; certainly finding any problem with the specific algorithms in
urandom would be a significant research discovery and not likely to
affect the application being discussed.
I agree that this flaw doesn't sound like it will effect the application
being discussed, but a problem has already been found and a solution is
already known: block until there's enough entropy. That's what /dev/
random does.
[snip]
In short, using /dev/random is fairly silly once you know there's enough
entropy in the randomness pool to make a good key. If /dev/urandom's
algorithms are broken then whatever you're doing with the /dev/random
output is probably also broken.
That doesn't follow. Antoon is specifically warning that /dev/urandom is
non-blocking. If you knew there was enough entropy available, you
wouldn't need /dev/random -- but how do you know there's enough entropy?

(I suppose you could look in /proc/sys/kernel/random/entropy_avail.)

For this specific application, it probably doesn't matter -- using /dev/
urandom is surely overkill, and on a single-user Linux desktop you're
unlikely to have vast numbers of applications reading /dev/urandom
without your knowledge. But why not use /dev/random? What's the downside?

--
Steven.
Sep 4 '07 #16
On Tue, 04 Sep 2007 22:01:47 -0700, Paul Rubin wrote:
OK. /dev/random vs /dev/urandom is a perennial topic in sci.crypt and
there are endless long threads about it there, so I tried to give you
the short version, but will give a somewhat longer version here.
Thank you. Your points are taken, in particular:
4) The man page is fairly seriously bogus because it doesn't explain
the real situation with either /dev/urandom or /dev/random.


--
Steven.
Sep 6 '07 #17
In message <13*************@corp.supernews.com>, Steven D'Aprano wrote:
Any cryptographer worth his salt (pun intended) would be looking to close
that vulnerability BEFORE an attack is made public, and not just wait for
the attack to trickle down from the NSA to the script kiddies.
Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.
Sep 9 '07 #18
Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:
Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.
NSA had nothing to do with MD5, and it's to NSA's credit that SHA-1
held up for as long as it did.
Sep 9 '07 #19
Paul Rubin wrote:
Lawrence D'Oliveiro writes:
>Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

NSA had nothing to do with MD5, and it's to NSA's credit that SHA-1
held up for as long as it did.
I haven't kept up. Has anyone exhibited a SHA-1 collision?
--
--Bryan
Sep 9 '07 #20
In message <7x************@ruckus.brouhaha.com>, Paul Rubin wrote:
Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:
>Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

NSA had nothing to do with MD5 ...
Nevertheless, it was their job to anticipate attacks on it. After all, they
call themselves the "National _Security_ Agency", don't they?
... and it's to NSA's credit that SHA-1 held up for as long as it did.
But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk down
to insignificance.

Sep 9 '07 #21
Bryan Olson <fa*********@nowhere.orgwrites:
I haven't kept up. Has anyone exhibited a SHA-1 collision?
I don't think anyone has shown an actual collision, but apparently
there is now a known way to find them in around 2**63 operations. I
don't know if it parallellizes as well as a brute force attack does.
If it does, then it's presumably within reach of the distributed
attacks like the ones used against DES in the late 1990's, given the
hardware speedups that have occurred since then. NIST is trying to
phase out SHA-1 by 2010.

http://en.wikipedia.org/wiki/SHA1#Cr...lysis_of_SHA-1
http://csrc.nist.gov/hash_standards_comments.pdf
Sep 9 '07 #22
Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:
... and it's to NSA's credit that SHA-1 held up for as long as it did.
But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk down
to insignificance.
The successor is SHA-2.
Sep 9 '07 #23
In message <7x************@ruckus.brouhaha.com>, Paul Rubin wrote:
Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:
... and it's to NSA's credit that SHA-1 held up for as long as it did.
But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk
down to insignificance.

The successor is SHA-2.
According to this <http://en.wikipedia.org/wiki/SHA-1>, the family of
algorithms collectively described as "SHA-2" is by no means a definitive
successor to SHA-1.
Sep 10 '07 #24
Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:
According to this <http://en.wikipedia.org/wiki/SHA-1>, the family of
algorithms collectively described as "SHA-2" is by no means a definitive
successor to SHA-1.
See <http://csrc.nist.gov/hash_standards_comments.pdf>:

However, due to advances in technology, NIST plans to phase out of
SHA-1 in favor of the larger and stronger hash functions (SHA-224,
SHA-256, SHA-384 and SHA-512) by 2010. SHA-1 and the larger hash
functions are specified in FIPS 180-2. For planning purposes by
Federal agencies and others, note also that the use of other
cryptographic algorithms of similar strength to SHA-1 will also be
phased out in 2010. SHA-1 and the stronger hash functions in FIPS
180-2 are all NIST approved.

This may also be of interest:

http://www.csrc.nist.gov/pki/HashWorkshop/index.html
Sep 10 '07 #25
In message <13*************@corp.supernews.com>, Steven D'Aprano wrote:
On Sun, 09 Sep 2007 18:53:32 +1200, Lawrence D'Oliveiro wrote:
>In message <7x************@ruckus.brouhaha.com>, Paul Rubin wrote:
>>Lawrence D'Oliveiro <ld*@geek-central.gen.new_zealandwrites:

Except that the NSA's reputation has taken a dent since they failed to
anticipate the attacks on MD5 and SHA-1.

NSA had nothing to do with MD5 ...

Nevertheless, it was their job to anticipate attacks on it. After all,
they call themselves the "National _Security_ Agency", don't they?

The NSA has many jobs, and doing public research in crypto is only one of
them -- and a particularly small one at that. For all we know, they had
an attack on MD5 ten years before anyone else and didn't tell anyone
because keeping it secret made it useful for one of their other jobs.
Yes, but they're supposed to look after US _National_ security, not their
own security. Since people in strategic jobs make so much use of hash
functions in crypto, that means it is most certainly an important part of
the NSA's function to ensure that there are good hash functions available.
They've fallen down on that job.
>>... and it's to NSA's credit that SHA-1 held up for as long as it did.

But they have no convincing proposal for a successor. That means the gap
between the classified and non-classified state of the art has shrunk
down to insignificance.

I don't see how that follows.
Because previously, the NSA has done things that it took open researchers
years, even decades, to figure out. But not any more.
Sep 10 '07 #26

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

23 posts views Thread by JC | last post: by
24 posts views Thread by Joerg Schuster | last post: by
3 posts views Thread by gaulle | last post: by
4 posts views Thread by Curious | last post: by
38 posts views Thread by JTL | last post: by
8 posts views Thread by kiranchahar | last post: by
5 posts views Thread by Tobi Hammert | last post: by
reply views Thread by leo001 | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.