testing for uniquness in a large list

Lol McBride

Hi All,
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

Thanks for taking the time to read this and doubly so if you're able to
help.
Cheers,
Lol McBride

#!/usr/bin/python
from random import *
#seq is the pool to choose from
seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
# this is how many times to choose a unique selection
rlen = 200000
counter = 100
ints = [0]
while 1:
# if this is the last value for rlen then we stop
if rlen == 0:
break
intLen = len(ints)
cnt = 0
testVal = sample(seq,12)
testVal.sort()
if len(ints)>=counter:
print len(ints)
counter = counter+100
while 1:
if ints.count(testVal)==1:
cnt = cnt+1
break
intLen = intLen -1
if intLen == 0:
break
if cnt == 1:
continue
else:
ints.append(testVal)
rlen = rlen -1

Jul 18 '05 #1

Subscribe Post Reply

1444

Cliff Wells

On Wed, 2004-10-20 at 11:01 +0100, Lol McBride wrote:

Hi All,
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

Thanks for taking the time to read this and doubly so if you're able to
help.
Cheers,
Lol McBride

#!/usr/bin/python
from random import *
#seq is the pool to choose from
seq = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
# this is how many times to choose a unique selection
rlen = 200000
counter = 100
ints = [0]
while 1:
# if this is the last value for rlen then we stop
if rlen == 0:
break
intLen = len(ints)
cnt = 0
testVal = sample(seq,12)
testVal.sort()
if len(ints)>=counter:
print len(ints)
counter = counter+100
while 1:
if ints.count(testVal)==1:
cnt = cnt+1
break
intLen = intLen -1
if intLen == 0:
break
if cnt == 1:
continue
else:
ints.append(testVal)
rlen = rlen -1

Disregarding more efficient methods of generating your list (of which
I'm sure there are many), part of the "interminable time" is probably
taken up in the infinite loop you have ;) Set rlen to a much smaller
number and you'll see that the program still never terminates.

You'll notice that:

ints = [0]
while 1:
...
intLen = len(ints) # aka 1
...
while 1:
...
intLen = intLen - 1 # aka 0
if intLen == 0: # always true
break
...

Basically this means that nothing is ever appended to ints nor is rlen
ever decremented (and these are the tests by which the loops terminate).

I hear that Python 3000 will complete infinite loops in half the time
however ;)

Regards,
Cliff

--
Cliff Wells <cl************@comcast.net>

Jul 18 '05 #2

Alex Martelli

Lol McBride <ne******@lolmc.com> wrote:

I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

One word: dictionaries!

Untested, but should work...:
import random # dont't use from...import *, on general grounds

def picks(seq=xrange(1, 21), rlen=200000, picklen=12):
results = {}
while len(results) < rlen:
pick = random.sample(seq, picklen)
pick.sort()
pick = tuple(pick)
results[pick] = 1
return results.keys()

this returns a list of tuples; if you need a list of lists,

return[list(pick) for pick in results]

In 2.4, you could use "result=set()" instead of {}, results.add(pick) to
maybe add a new pick, and possibly "return list(result)" if you need a
list of tuples as the function's overall return value (the list
comprehension for the case in which you need a list lists stays OK).
But that's just an issue of readability, not speed. Slight speedups can
be had by hoisting some lookups out of the while loop, but nothing
major, I think -- eg. sample=random.sample just before the while, and
use sample(seq, picklen) within the while. Putting all the code in a
function and using local variables IS a major speed win -- local
variable setting and access is WAY faster than globals'.
Alex

Jul 18 '05 #3

Lol McBride

On Wed, 20 Oct 2004 04:07:20 -0700, Cliff Wells wrote:

On Wed, 2004-10-20 at 11:01 +0100, Lol McBride wrote:
snip snip <<

Doh! And once againwith feeling Doh!
Thank you for pointing that out to me
Lol

Jul 18 '05 #4

Lol McBride

On Wed, 20 Oct 2004 13:37:32 +0200, Alex Martelli wrote:

Lol McBride <ne******@lolmc.com> wrote:
I'm looking for some help in developing a way to test for the uniqueness
of an item in a large list.To illustrate what I mean the code below is an
indicator of what I want to do but I'm hoping for a faster way than the
one shown.Basically,I have a list of 20 integers and I want to create
another list of 200000 unique subsets of 12 integers from this list.WhatI
have done here is to use the sample()function from the random module
and then compare the result to every item in the ints list to check for
uniqueness - as you can guess this takes an interminable amount of time to
grind through.Can someone please enlighten me as to how I can do this and
keep the amount of time to do it to a minimum?

One word: dictionaries!

Untested, but should work...:
import random # dont't use from...import *, on general grounds

def picks(seq=xrange(1, 21), rlen=200000, picklen=12):
results = {}
while len(results) < rlen:
pick = random.sample(seq, picklen)
pick.sort()
pick = tuple(pick)
results[pick] = 1
return results.keys()

this returns a list of tuples; if you need a list of lists,

return[list(pick) for pick in results]

In 2.4, you could use "result=set()" instead of {}, results.add(pick) to
maybe add a new pick, and possibly "return list(result)" if you need a
list of tuples as the function's overall return value (the list
comprehension for the case in which you need a list lists stays OK).
But that's just an issue of readability, not speed. Slight speedups can
be had by hoisting some lookups out of the while loop, but nothing
major, I think -- eg. sample=random.sample just before the while, and
use sample(seq, picklen) within the while. Putting all the code in a
function and using local variables IS a major speed win -- local
variable setting and access is WAY faster than globals'.
Alex

Thank you Alex - i'll try your method and see how i get on
Lol

Jul 18 '05 #5

bearophile

Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

Hugs,
bearophile

Jul 18 '05 #6

Josiah Carlson

be************@lycos.com (bearophile) wrote:

Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

There are only 125,970 unique sequences of 12 items from 20 where order
does not matter (20!)/(12!8!), which is what Alex was generating, and
what "Lol McBride" asked for.

Remove the "ordering does not matter" constraint (by removing the
list.sort() call) will complete much faster as the space to select from
becomes 60,339,831,552,000 in size.

Math saves the day!

- Josiah

Jul 18 '05 #7

bearophile

Josiah:

There are only 125,970 unique sequences of 12 items
from 20 where order does not matter (20!)/(12!8!),

Right! I was thinking the same thing *after* writing my last post ^_-

With random sampling the generation of the last 10 combinations is a
bit slow, it require a couple of minutes.
There are direct ways to compute them:
http://www.netlib.org/toms/382

Bye,
bearophile

Jul 18 '05 #8

Paul McGuire

"bearophile" <be************@lycos.com> wrote in message
news:5c**************************@posting.google.c om...

Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

Hugs,
bearophile

Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater than
125,970?

Perhaps some heuristic would opt for using sets instead of random.sample if
rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be a
crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects
an arbitrary element from the set, and removes it).

-- Paul

Jul 18 '05 #9

Paul McGuire

"Paul McGuire" <pt***@austin.rr._bogus_.com> wrote in message
news:sR*******************@fe1.texas.rr.com...

"bearophile" <be************@lycos.com> wrote in message
news:5c**************************@posting.google.c om...
Alex Martelli's solution is *very* good, but there is a sampling
problem: putting a simple printing inside his program:

if not (len(results) % 5000):
print len(results)

You can see that it slows down a lot when the dictionary contains
about 100000-120000 different sequences, because there are many
collisions, and it keeps slowing down. Probably a little speed up of
this code cannot help. A different algoritm can be useful.
I don't know... direct sequences generation doesn't seem possibile.
Maybe a partially direct generation can be okay.

Hugs,
bearophile
Is part of this slowdown the fact that the loop will exit only when rlen
reaches 200,000, when you have just shown that it can never be greater

than 125,970?

Perhaps some heuristic would opt for using sets instead of random.sample if rlen approaches some threshold value of
len(seq)!/picklen!(len(seq)-picklen)! , perhaps .8 or so. There should be a crossover point at which it is cheaper to build a set of all 125,970
combinations, and then select 100,000 of them using set.pop() (which selects an arbitrary element from the set, and removes it).

-- Paul

.... and of course, raising a MathError if rlen >
len(seq)!/picklen!(len(seq)-picklen)!

-- Paul

Jul 18 '05 #10

by: Hugh Cowan | last post by:

Hello, I don't program full-time (anymore), but I do try and stay on-top of the latest technologies and like most are always trying to upgrade my skills and remain current (as much as is...

PHP

Unit Testing Examples

by: Tom Verbeure | last post by:

Hello All, so I'm now convinced that unit testing is really the way to go and I want to add it to an existing project. The first thing that I find out, is that, well, it's hard work. Harder than...

Python

Unit-testing single function with large number of different inputs

by: Edvard Majakari | last post by:

Hi all ya unit-testing experts there :) Code I'm working on has to parse large and complex files and detect equally complex and large amount of errors before the contents of the file is fed to...

Python

Unit testing tools for .NET ?

by: | last post by:

Hi! I'm looking for unit-testing tools for .NET. Somthing like Java has --> http://www.junit.org regards, gicio

C# / C Sharp

what is better - one field or eight - mysql bit testing

by: windandwaves | last post by:

Hi Folk I have to store up to eight boolean bits of information about an item in my database. e.g. with restaurant drive-through facility yellow windows

PHP

HELP: ASP & Javascript testing procedures & methods

by: Andrew Wan | last post by:

I have been developing web applications with ASP & Javascript for a long time. I have been using Visual Studio 2003.NET. While VS2003 is okay for intellisense of ASP & Javascript, it's still not...

ASP / Active Server Pages

Please help with testing & improving a StringValue class

by: Alf P. Steinbach | last post by:

I once suggested in that SomeOne Else(TM) should propose a string value class that accepted literals and char pointers and so on, with possible custom deleter, and in case of literal strings just...

C / C++

Code correctness, and testing strategies

by: David | last post by:

Hi list. What strategies do you use to ensure correctness of new code? Specifically, if you've just written 100 new lines of Python code, then: 1) How do you test the new code? 2) How do...

Python

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

testing for uniquness in a large list

Similar topics