By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,454 Members | 3,103 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,454 IT Pros & Developers. It's quick & easy.

How to select record based on probability?

TheSmileyCoder
Expert Mod 100+
P: 2,321
I have a table of names, with 2 fields in it, dblProbability and txName, where dblProbabilty is the relative probability that the Name in that row should be selected. How can I get 1 random name from my table, taking into account the probability that it should occur?
Jun 5 '12 #1
Share this Question
Share on Google+
25 Replies


Rabbit
Expert Mod 10K+
P: 12,366
Make a call to Rnd() seeded with the date and/or time, depending on how often you want run the query. If the returned value is less than or equal to your probability field, then return that row. Something like this:
Expand|Select|Wrap|Line Numbers
  1. SELECT *
  2. FROM someTable
  3. WHERE probabilityField >= Rnd(Second(Now()))
Now, normally you also seed it with a primary key so that each row gets a different random number. However, it is unnecessary in this instance because you're not comparing between rows but rather within the row.

PS: If you're going to validate, seed it with a negative. Otherwise each call to Rnd will return the next number in the sequence and the results will be out of sync with what you see on the screen.
Jun 5 '12 #2

zmbd
Expert Mod 5K+
P: 5,397
@Rabbit:
Are we sure that TheSmileyCoder isn't comparing between rows?
I took this question to mean: say given "name_alpha/5" vs. "name_beta/2" vs. "name_charlie/1" would state something along the lines that name_alpha would be 5 times more likely than name_charlie and 2.5 times more likely than name_beta to be picked and similarly name_beta would be twice as likely than name_charlie to be picked.
@ TheSmileyCoder:
Are you comparing the data against each other? How are you assigning the probabilities, in as I have given in the example above, which is somewhat arbitrary, or do the probabilities have to add up to 100% (i.e. so the above becomes close to "name_alpha/63" vs. "name_beta/25" vs. "name_charlie/12"… which might be easier to code for...)?
If no there is no requirement for summation to 100% then are you allowing equal probabilities to be assigned… that will toss a monkey-wrench in the logic!

In anticipation of your answer:
So, let’s go with the no monkey-wrench (sets the foundation for the monkey-wrench):
In the second method... use the rnd(1,100) wherein if the rnd(1,100)>=63 then name_alpha, rnd(1,100)=between63and12 then name_beta, rnd(1,100)<=12 then name_charlie. If we trust that the rnd function is truely random then the weighting should work.
Thus should work even if you don't require the sum to be 100 then convert the values to weighting as for weighted average... sum all of the probilities: 5+2+1= 8 then convert to percentage by division 5/8, 2/8, 1/8... gives approx 63, 25, 12.
From here, the query should return all records where the [probability] is greater than or equal to the rnd(1,100) (back convert to the arbitrary value if needed). Now logically we know that these records should be chosen more often than the remaining records. So do we return the record with the max([probability]) or the min([probability])? I opt for the min([probability]) as if the rnd(1,100) was small then max([probability]) would always returned. You have your single name.
Monkey-wrench:
Once again… pull the records as before, look for the min([probability]); however, this time look for count(min([probability]))>1. If you do have more than one record with that probability, then we need to pull all of the records wherein they have min([probability]) then do an un-weighted random pick between records as they would have had an equal chance to have been selected to begin with...
-z
Jun 5 '12 #3

Rabbit
Expert Mod 10K+
P: 12,366
I assumed they were comparing just within the row. If they are trying to pick one out of a cumulative probability, then it's pretty much the same thing except you will need a start range and end range and use an additional comparison.
Expand|Select|Wrap|Line Numbers
  1. SELECT * 
  2. FROM someTable 
  3. WHERE startProb >= Rnd(Second(Now()) * -1)
  4.    AND endProb < Rnd(Second(Now()) * -1)
Jun 5 '12 #4

TheSmileyCoder
Expert Mod 100+
P: 2,321
I currently have a list of 64000 lastnames, and how many of each is used by persons living in Denmark. Example:
Expand|Select|Wrap|Line Numbers
  1. Amount  Name
  2. 460470  Jensen
  3. 446646  Nielsen
  4. 382215  Hansen
  5. 271012  Pedersen
  6. ...
  7. ...
  8. ...
  9. 6       Kausky
  10. 5       Haj
  11. 4       Shorch
  12. 4       Van Sas
  13. 4       Osinka
  14. 4       Devitt
As you can see, some of these have a way higher weight then others, and some will have the same weight.

I haven't converted the Amount to a probability yet, though thats just a formality. I do not plan to continually update the data basis, so if any changes needs to be made now, in order to create a more efficient function that is fine.

I need to generate X random results from the list, where X can be anywhere from 1 to 100.000 or more. As such its important the function is relatively efficient.


EDIT: hmm. cumulative probability, that sounds like a good way to go, I will look into that some more, and report my findings.
Jun 5 '12 #5

zmbd
Expert Mod 5K+
P: 5,397
take a look at this
http://akinas.com/pages/en/blog/mysql_random_row
-z
Jun 5 '12 #6

zmbd
Expert Mod 5K+
P: 5,397
Should have mentioned this in my last post but I was called away....

cumulative probability is a very specific statistical model theory. I don't feel that, that, particular methodology is what you are after and research along that line gets into some very heavy math very quickly...

What I tried to explain in post #3 is a form of partial sums and normalization (mathematical normalization not RDB) skipping the normalization step. This method should work; however, for a very large data-set it might not be the best in terms of speed. Therefor I posted the link to the mysql thread.

What I think you should do is also do a Google for: "Weighted Random Sampling (2005; Efraimidis, Spirakis)"

very good paper... the one I have is copyrighted and the agreement has a re-post/reuse restriction (sucks) or I would excerpt it here; however, you should be able to find something online using that as the search (wink-wink-nudge-nudge-dontchknow).

there are also a few good papers out of Berkly and others on this topic... I'll try and attach an older pdf for a simple random... what I like about this paper is that they attempt to do a random sampling without having to build the entire record-set first! (note: I found this paper some years ago when I was building a random sample dataset, I hope it is in the public domain and is provided here soley for academic purposes. I renamed the file for my own document management... it's not a simple paper!) This paper sent me down the partial-sum-tree method.

put your thiking cap on and brew a stong cup of coffee... you're in for a heavy read.
-wc
Attached Files
File Type: pdf Database_randomsample_simple.pdf (971.9 KB, 949 views)
Jun 6 '12 #7

Rabbit
Expert Mod 10K+
P: 12,366
If your goal is to generate x number of names given a known distribution, meaning you're allowing for "duplicate" names, you can do something like this:

1) Create a table just with the numbers 1 to some arbitrarily high x.

2) In your probability table, have a start and end range for each name.

3) Use a query to generate a random number on the number table seeding with the number and time values.

4) Join that query to the probability table using a query similar to the previous post.
Jun 6 '12 #8

NeoPa
Expert Mod 15k+
P: 31,494
I think Rabbit's initial suggestion can be modified to work simply by comparing the random number generated (0 <= X < 1) against the result of :
Expand|Select|Wrap|Line Numbers
  1. [dblProbability] / Sum([dblProbability])
Clearly, the random number would need to be reset for each record (IE. Rerun, rather than necessarily reseeded).
Jun 6 '12 #9

zmbd
Expert Mod 5K+
P: 5,397
Rabbit:...1) Create a table just with the numbers 1 to some arbitrarily high x...
I understand what you're getting at; however, the probability is already part of the table as [dblProbability]... why not use the variation on partial sums tree method as I've suggested in #3 other than it may not be the fastest…
... sum all of the probabilities: 5+2+1= 8 then convert to percentage by division 5/8, 2/8, 1/8... gives approximately 63, 25, 12....
Now the 63 is actually 63% and the others 25%, 10%... cardinal sin not using the units... my bad.
Notice, this is basically the same as NeoPa's reply in #10.

Now, TheSmileyCoder's original question was for (as I understood it) a specific, single, name to be picked hence my reply and subsequent query... the query can be optimized to pull just the one record.
Now however in post #5 TheSmileyCoder tells us that we need to pull not just one record but create a record-set that may consist of between one and several thousand elements; thus, the two suggested papers and the link to the mysql forum so as to provide a path to the solution to a method to do this efficiently and to provide information leading to the consideration for the need for non-duplication or replacement etc... I am working under the notion that the subset must also be in random order and that duplication is allowed.

hina:...a cumulative probability, then it's pretty much the same thing ...
Hina, you've simply re-stated Rabbit's post in #4. As I said in #3, cumulative probability is a very specific statistical theory that for this question may very well be a red-herring. For the sake of argument, please take a look at this very simple explanation (one should also work thru the associated lessons): http://stattrek.com/probability-dist...tribution.aspx cumulative probability simply doesn’t appear, to me, to be the correct path to the solution.

-z
Jun 6 '12 #10

dsatino
100+
P: 393
I've attached a sample DB that will what you want based on your description.

The basis for the selection code is in Section 4.2.1 of the PDF that zmbd posted. There's actually a ton of other good stuff in there as well, but you don't need that for your purpose.

The only thing related to the sampling that's really worth pointing out is that this method uses replacement so if you want to use this for other purposes, you must consider that.

As for efficiency, I created a sample of 10,000 in about 30 seconds, but you can improve that. It's writing to disk every time it makes a selection so you best bet is probably to capture those and write them en masse.

Hope this helps
Attached Files
File Type: zip SampleSelect.zip (74.9 KB, 115 views)
Jun 6 '12 #11

Rabbit
Expert Mod 10K+
P: 12,366
@zmbd, I think you misunderstood my intention with that second table. I wasn't using it as a way to generate the probabilities. It's there to allow for "duplication".

I think, and this is just my interpretation, that the goal is to generate a random population. So let's say I'm modeling the inhabitants of a city. And I want to generate 1000 people to live in this city. Some of them are going to have the same name. Whichever way they decide to calculate the probability and pick one, they will need a way to allow the same name to pop up more than once. That's all that second table is for.
Jun 6 '12 #12

Rabbit
Expert Mod 10K+
P: 12,366
Here's what I mean.
tblSampleSize
Expand|Select|Wrap|Line Numbers
  1. n
  2. 1
  3. 2
  4. 3
  5. 4
  6. 5
tblProb
Expand|Select|Wrap|Line Numbers
  1. pName aProb bProb
  2. Bob   0     .5
  3. Andy  .5    .8
  4. Jill  .8    .95
  5. Kim   .95   1
Query
Expand|Select|Wrap|Line Numbers
  1. SELECT pName
  2. FROM tblSampleSize, tblProb
  3. WHERE n <= 3
  4.    AND Rnd(-1 * n * Second(Now()) * Minute(Now())) >= aProb
  5.    AND Rnd(-1 * n * Second(Now()) * Minute(Now())) < bProb
The query will generate 3 people, with the option of generating up to 5 people, with any of the names in the table. Allowing for names to be repeated.

I've attached a sample database. It will generate 15000 names from 20 available names in less than 3 seconds.
Attached Files
File Type: zip Random.zip (108.4 KB, 81 views)
Jun 6 '12 #13

100+
P: 759
Sometimes someone not very skilled can have a good idea (agree: very rarely). So:

Create a temporary table with two more fields:
Expand|Select|Wrap|Line Numbers
  1.  Amount      Name            Min                 Max
  2. 460470      Jensen        Max_Nielsen + 1    Max_Nielsen + 460470
  3. 446646      Nielsen       .............      ..................
  4. 382215      Hansen
  5. 271012      Pedersen
  6.     ...
  7.     ...
  8.     ...
  9. 6           Kausky            22                  27
  10. 5           Haj               17                  21
  11. 4           Shorch            13                  16
  12. 4           Van Sas            9                  12
  13. 4           Osinka             5                   8
  14. 4           Devitt             1                   4
Then generate a random number between MIN = 1 and MAX = Max_Nielsen + 460470

Use a query to select the record you need (The random number is between Min and Max)
Jun 7 '12 #14

NeoPa
Expert Mod 15k+
P: 31,494
That scores high on the logic front Mihail, but unfortunately not very high on the practicability front.

We have some other, obviously well educated, contributors in here who have suggested solutions which don't require pre-manipulation of the data.

This was always going to be a tougher than average question though. If Smiley needs to ask it then it's not likely to be too simple or straightforward ;-)

Keep up the good work anyway :-)
Jun 7 '12 #15

100+
P: 759
I know, NeoPa. I know.
But a bit from you, a (very small) bit from me, can give to anyone a better idea. Isn't it ?
Jun 7 '12 #16

zmbd
Expert Mod 5K+
P: 5,397
dsatino: Thank you for putting together the example for the “Accept/Reject” method. The drawback to this method is the time required for the re-iteration thru large I/O data sets. The real advantage here is that there is no need to pre-process the data-set.

Rabbit: Thank you for the example… it does confirm what I thought you were thinking. Your method; however, asks for Smilie to re-create/assign the probabilities which is something I was trying to avoid as I didn’t know the size of Smilie’s raw data table… which appears to be quite large.
I do however like the approach given that there is no requirement for VBA.
So… trying to use the best of both by using your tblSampleSize and the following query to create the weighted probability:
Expand|Select|Wrap|Line Numbers
  1.  SELECT tbl_name.[txtName], tbl_name.dblProbability, [tbl_name.dblProbability]/DSum("tbl_name.dblProbability","tbl_name") AS WProb
  2. FROM tbl_name
  3. GROUP BY tbl_name.[txtName], tbl_name.dblProbability;
(I know… the evil “D-functions” but this should be fairly fast over even several hundred records)
We could also do the A/R by [WProb]=( [tbl_name.dblProbability]/Dmax("tbl_name.dblProbability","tbl_name")
In either case… now working with your bounded query, there should be a modification that will consistently return the required record set. I have not as of yet solved that… normally… I push this to VBA. I’ll have to go back and re-read against the b-tree-partial-sums as it has been a long time since I read it last
-z
Jun 7 '12 #17

Rabbit
Expert Mod 10K+
P: 12,366
@zmbd, I can't say that I understood what was in the article. But if they have frequency counts, and you take the frequency and divide by the total sum of frequencies, you get what you call a weighted probability.

But with just the weighted probability, you can't select one at random, correct? You still need to calculate the probability range.

You can actually use a query to calculate that range. If you join the table to itself, you can create the ranges.
Expand|Select|Wrap|Line Numbers
  1. SELECT T1.Name,
  2.    SUM(T2.Frequency) - T1.Frequency + 1 AS startRange,
  3.    SUM(T2.Frequency) AS endRange
  4. FROM tableName AS T1,
  5.    tableName AS T2
  6. WHERE T1.Frequency >= T2.Frequency
  7.    AND T1.Name >= T2.Name
  8. GROUP BY T1.Name
Don't hold me to that query, it was off the top of my head. It may need some tweaking.
Jun 7 '12 #18

100+
P: 759
Sorry, NeoPa ! Me again :)
I am sure that Smiley is enough skilled to implement any logic if feet his needs.

What about that:
MAX (=460470) .......... 1
X ............RandomNumber

where 0 < RandomNumber <= 1

X = RandomNumber * MAX

Keep only that records where Amount >= X
Using other words, eliminate all records where the amount (as percentage) is less than the RandomNumber.

Any from the remaining records have equal chance to be selected.
So, a selection based on a new RandomNumber applied on the remaining records range, should be (in my opinion) the response.
Jun 7 '12 #19

dsatino
100+
P: 393
zmbd: there is no reiteration through recordsets.
rabbit: you definitely don't need the range.

I've uploaded another version that calculates the same, but takes efficiency into account. It returns 10,000 in under one second and 100,000 in a little over 4 seconds.


As for which one Smiley should use, it doesn't matter. They will both return properly distributed populations.
Attached Files
File Type: zip SampleSelect.zip (1.73 MB, 87 views)
Jun 7 '12 #20

zmbd
Expert Mod 5K+
P: 5,397
dsanto It certainly does re-itereate thru the records:
Expand|Select|Wrap|Line Numbers
  1.             'test for rejection *****
  2.             If Uj >= Pj Then GoTo ProcessStart
wherein "ProcessStart" goes back thru the process of rnd#, seek, get wj, calc wj/Wmax then the above comparison.
The other issue is, that while using the seek method only the first matching is returned so a lower/higher ranked entry may occur after the returned record unless the recordset is ordered on the ranking.
{edit}
Sorry, I should have added that the examples you've given are still very viable and a method that I have used myself in a smaller project (hence the posted document) - I was in a hurry to get to the other lab... no excuse, just an explanation.
{edit}


Rabbit YEP, YEP...
"...total sum of frequencies, you get what you call a weighted probability..."
YES! That is what I've been generating... then pull the normal distribution from the RND against the weighted p.

I';ve got to go to another lab so I'll have to take a closer look at the offered query this evening.

-z
Jun 7 '12 #21

dsatino
100+
P: 393
zmbd: No need to point out pieces of code. I wrote it so I know what it does.

Yes, it goes back through a portion of the 'process', but it does not 're-iterate' through any records. The only 'iteration' is the primary loop based on the sample size. The only movement through records is the .seek of the random record to test. And on that note, it's seeking on a unique index so there is no issue with using .seek in the example I provided.
Jun 7 '12 #22

zmbd
Expert Mod 5K+
P: 5,397
Sorry I haven’t gotten back to this sooner… my day off and I have the three kids. A 9yr and twins 2.5yr. Love them very much; however, they take a lot more of my time!
But it won’t last forever, so I have to get the hugs in now (esp. from my son...) all too soon I won’t be the cool Daddy for several years – then we get smarter and cooler again… when they have their kids! };-)

dsatino
No offense intended by quoting your code. Given, that you wrote the code, you very well understand the code.

I do not understand your anger... o-O

However, not everyone reading this forum will have the same level of understanding of vba nor will they have opened the attachment and have reviewed the code, or it may be the fact that they are unable to open the attachment for various reasons thus will not understand an oblique reference to the code.

How then is one to reference the section of code without quoting it and causing offense? O-o
I was under the impression that this was somewhat of an academic environment where such references as this would not be taken as a condescending action, instead, taken as a point of clarification and point of reference for a common understanding.

As for the question of (re)iteration:
Please read: http://www.merriam-webster.com/dictionary/iteration
1: the action or a process of iterating or repeating: as
a : a procedure in which repetition of a sequence of operations yields results successively closer to a desired result
b : the repetition of a sequence of computer instructions a specified number of times or until a condition is met — compare recursion
Your code meets this definition twice:
The outer "for...next" and then the inner A/R code. The fact that with each rejection of found record triggers another round of "seeking" wherein the command must then (re)read each record's [Probability] value, is a re-iteration not only of the code but of also each record. And while we’re at it, the “Seek” command contains a hidden third iteration of actions as the “Seek” is nothing more than a nice wrapper around the commands to move to a record, read a specific field value, and compare that result with a condition and repeat the process until the condition is met… the first time… then stop at that record’s position in the data table for further manipulation.

One assumes with “Seek” that the first record meeting the criteria is either the only one to do so or it will be “good enough.”

Now that I have established that in fact that there is an actual (re)iteration of the commands within your code lets tweak it.
Instead of using the "goto ProcessStart" to jump back to a label in the code, a do-until loop would be preferred. The “goto” “gosub” commands that were all too common in BASIC are now frowned upon as poor coding within VBA (although, we still use them for error trapping – go figure). So to use the loop one might code as follows (once again, using the code from your example):

dsatino: Please skip this code block as you already know and understand what modifications would be needed to tweak the code for a loop.)


Expand|Select|Wrap|Line Numbers
  1. For sLoop = 1 To smplSz
  2.     Do ''(<<<<<<Start of mod)
  3.         'set j ******
  4.         j = Int(Rnd() * recCt) + 1
  5.         '
  6.         'set Wj *****
  7.         '
  8.         With rst
  9.             '.Index = "nKey"
  10.             .Seek "=", j
  11.             Wj = !probability
  12.         End With
  13.         '
  14.         'set Pj and Uj *****
  15.         Pj = Wj / Wmax
  16.         Uj = Rnd()
  17.         'test for rejection *****
  18.     Loop Until Uj <= Pj ''(<<<<<<end of mod)
  19.     '
  20.     With rst2
  21.         .AddNew
  22.         .Fields(0) = j
  23.         .Update
  24.     End With
  25. '
  26. Next sLoop
Now, one should be more able to more clearly see the two iterations within your code.

Hey... while looking at that... why are we pulling a new random number each time we have a record fail the A/R test... just keep going thru the records until we find a passing test. SO... pull the two RND out of the inner loop to the outer for...next ... so much more I can do here.

On the next note:
While your example might very well use a unquie and indexed [Probability] (oh, I see that in the table it isn't) the [dblProbability] from Smilie's post (#5) very clearly does not require unique values and may not be indexed; thus, seeking on this field may not return a truely random dataset... in fact, using #5 again, if the test is such that a weight of 4 will meet the A/R requirements then when the test is ran the seek command will always return "Shoreh" as the name and will never return "Van Sas" (provided the records are in the order shown).

And while I’m at it:

The query you created returns way too many records in that it returns all of the records from the table “namepicktest” and the subsequent records against table “sample”:
(I pull it out here for those individules that can not open your example… dsatino, once again, please skip this code block as you already know and understand it.)
Expand|Select|Wrap|Line Numbers
  1. SELECT Sample.nKey, NamePickTest.Name, Count(Sample.nKey) AS InSample, NamePickTest.Occurrence, NamePickTest.Probability
  2. FROM Sample RIGHT JOIN NamePickTest ON Sample.nKey = NamePickTest.nKey
  3. GROUP BY Sample.nKey, NamePickTest.Name, NamePickTest.Occurrence, NamePickTest.Probability;
I believe that you wanted:
Expand|Select|Wrap|Line Numbers
  1. SELECT NamePickTest.nKey, NamePickTest.Name, NamePickTest.Occurrence, NamePickTest.Probability
  2. FROM NamePickTest INNER JOIN Sample ON NamePickTest.nKey = Sample.nKey;
The final issue I have with your example is [Probability] wherein you have stored the calculated value (which from post #5 would not be unique)... now this will certainly speed things up; however, IMHO not the best way... what happens when max[Frequency] is no longer the value you've chosen to pre-calculate [Probability]? (Of course, TheSmileyCoder stated that the values would <not> be updated... we must allow that for others with a similuar question may not have such static information. It may also be very well that for Smiley, the current static info may indeed need to be updated at a latter date)

It is a simple matter to find max[Frequency]... even a large dataset, store that value as a variable in the code and then do the calculation.... such as using the dreaded DMax or you could run a query such as (once again using your example):
Expand|Select|Wrap|Line Numbers
  1. SELECT SampleResult.nKey, SampleResult.Name, SampleResult.InSample, SampleResult.Occurrence, SampleResult.Probability
  2. FROM SampleResult 
  3. INNER JOIN 
  4. (SELECT Max(SampleResult.[Occurrence]) AS MaxOfOccurrence
  5. FROM SampleResult) as [maxq]
  6. ON SampleResult.Occurrence = [maxq].MaxOfOccurrence;
for fun, I pulled all of the fields; however, you really only need the "[SampleResult].[Occurrence]" in the outer select.

Actually… in this case, if I were using a query, I’d use just that inner select:
Expand|Select|Wrap|Line Numbers
  1. SELECT Max(SampleResult.[Occurrence]) AS MaxOfOccurrence
  2. FROM SampleResult);
to return the max value as the rest of the information is of no value for the subsequent code.

-z
Jun 8 '12 #23

zmbd
Expert Mod 5K+
P: 5,397
RABBIT
and
TheSmileyCoder
:
The range thing really took some effort...
Rabbit, taking your queries and my weighted probability calculations forcing them together:

Expand|Select|Wrap|Line Numbers
  1. SELECT Q1.txtName, Q1.dblProbability, Q1.frequency, startRange - Q1.frequency AS endRange, (SELECT sum(SQ3.frequency) AS Total
  2. FROM qry_name AS SQ3
  3. WHERE SQ3.dblProbability <= Q1.dblProbability) AS startRange
  4. FROM qry_Name AS Q1, qry_Name AS Q2
  5. WHERE (((Q1.Frequency)>=Q2.Frequency) And ((Q1.txtName)>=Q2.txtName))
  6. GROUP BY Q1.txtName, Q1.dblProbability, Q1.frequency
  7. ORDER BY Q1.frequency DESC;
Of note is the startRange and endRange... I used a variation on a theme for "running totals" for the startRange that I have used in other situations... don't remember where I got the method from; however, it was somewhere out there in the cloud - anyway, not my original idea and hopefully the author will forgive the lack of proper citation.

I'm still having some issues with:
Expand|Select|Wrap|Line Numbers
  1. SELECT qry_weighted_w_range.txtname
  2. FROM qry_weighted_w_range, tbl_SampleSize
  3. WHERE (((tbl_SampleSize.n)<=50) AND ((Rnd(-1*[n]*Second(Now())*Minute(Now())))>=[EndRange] And (Rnd(-1*[n]*Second(Now())*Minute(Now())))<[StartRange]));
consistently returning the desired number of records. Note that (tbl_SampleSize.n)<=50) will occasionally return 47ish to 56ish records; however, I think that is due to the distribution in the names table (basically the frequency of the English alphabet as used in typical American English) allowing for roughly 10 pairs of letters (names) with the same frequency...

Attached the database for your amusement... the query is set to return the 10K as per your original - quite fast.

Now I see why I used the VBA methodology instead of trying the SQL approach; however, I do like the fact that this will run without code enabled.

Hopefully Smilie will be able to use something here.

-z
Attached Files
File Type: zip random_pick.zip (118.3 KB, 91 views)
Jun 12 '12 #24

dsatino
100+
P: 393
zmbd

There was no anger. You directed your response specifically to me and given that much of the code on this site is not original work I thought I would save you the trouble of pointing it out.

As for argument pertaining to 'reiteraton', I do find it humorous you felt the need to post the definition and subsequently posted the definition of 'iteration'. From a computing standpoint, they are not the same.

The For sLoop=...Next Loop is the act of iterating through the values of sLoop. If, for example, the value produced for sLoop=10 is rejected, then the process is reiterated for sLoop=10.

Also, your interpretation of how the code and the .seek method work is not correct. At no time does this code 'read each records probability value'. It randomly picks a single record, navigates to that record using seek, and then captures the probability value of that record.

The .seek method does not move through recordsets making comparisons at each record, that sounds more like the .find methods, rather it uses an index value to move to a specific record. These are not the same thing. You are only correct in that it will stop at the first record to have that index value, but as i previously noted, the index that I'm using is unique so that can't occur in the example I provided.

Then there's this:
"Hey... while looking at that... why are we pulling a new random number each time we have a record fail the A/R test... just keep going thru the records until we find a passing test. SO... pull the two RND out of the inner loop to the outer for...next ... so much more I can do here."

If you understood what the code was doing you'd know why we are picking a new random number. First, the random number "j" is not a random number, rather it's a random record. Second, the code is not iterating through the record set testing for success and even if it was, moving to the next record and testing for a pass/ fail would make the order of the records important which would negate the randomness.

As for the query I created, it's exactly what I wanted. I know it returns all of the names.

Then there's the issue of storing the values. It's true the counts would allow it to be more extensible, but this was an example. At some point, people have to use their own brain. Mostly I find it humorous that your knocking the stored values, but the example you provided requires a table to store the amount of results you want. My example produced a sample of 100,000 in under 4 seconds. Your example can't even produce that without adding 85,000 records to the table. What if I want a random sample size of a half-million or more. In your example I need a table with at least that many records, just for the sake of being able to pull that sample.

Lastly, I have no idea why you felt the need to point out how to store the max value in a variable, since that's what I did.
Jun 19 '12 #25

NeoPa
Expert Mod 15k+
P: 31,494
I will continue to monitor this thread, but I hope that this discussion can get back on track. Involved members should check their PMs first though.
Jun 19 '12 #26

Post your reply

Sign in to post your reply or Sign up for a free account.