473,836 Members | 1,554 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to select record based on probability?

TheSmileyCoder
2,322 Recognized Expert Moderator Top Contributor
I have a table of names, with 2 fields in it, dblProbability and txName, where dblProbabilty is the relative probability that the Name in that row should be selected. How can I get 1 random name from my table, taking into account the probability that it should occur?
Jun 5 '12 #1
25 5645
Rabbit
12,516 Recognized Expert Moderator MVP
Make a call to Rnd() seeded with the date and/or time, depending on how often you want run the query. If the returned value is less than or equal to your probability field, then return that row. Something like this:
Expand|Select|Wrap|Line Numbers
  1. SELECT *
  2. FROM someTable
  3. WHERE probabilityField >= Rnd(Second(Now()))
Now, normally you also seed it with a primary key so that each row gets a different random number. However, it is unnecessary in this instance because you're not comparing between rows but rather within the row.

PS: If you're going to validate, seed it with a negative. Otherwise each call to Rnd will return the next number in the sequence and the results will be out of sync with what you see on the screen.
Jun 5 '12 #2
zmbd
5,501 Recognized Expert Moderator Expert
@Rabbit:
Are we sure that TheSmileyCoder isn't comparing between rows?
I took this question to mean: say given "name_alpha/5" vs. "name_beta/2" vs. "name_charl ie/1" would state something along the lines that name_alpha would be 5 times more likely than name_charlie and 2.5 times more likely than name_beta to be picked and similarly name_beta would be twice as likely than name_charlie to be picked.
@ TheSmileyCoder:
Are you comparing the data against each other? How are you assigning the probabilities, in as I have given in the example above, which is somewhat arbitrary, or do the probabilities have to add up to 100% (i.e. so the above becomes close to "name_alpha/63" vs. "name_beta/25" vs. "name_charl ie/12"… which might be easier to code for...)?
If no there is no requirement for summation to 100% then are you allowing equal probabilities to be assigned… that will toss a monkey-wrench in the logic!

In anticipation of your answer:
So, let’s go with the no monkey-wrench (sets the foundation for the monkey-wrench):
In the second method... use the rnd(1,100) wherein if the rnd(1,100)>=63 then name_alpha, rnd(1,100)=betw een63and12 then name_beta, rnd(1,100)<=12 then name_charlie. If we trust that the rnd function is truely random then the weighting should work.
Thus should work even if you don't require the sum to be 100 then convert the values to weighting as for weighted average... sum all of the probilities: 5+2+1= 8 then convert to percentage by division 5/8, 2/8, 1/8... gives approx 63, 25, 12.
From here, the query should return all records where the [probability] is greater than or equal to the rnd(1,100) (back convert to the arbitrary value if needed). Now logically we know that these records should be chosen more often than the remaining records. So do we return the record with the max([probability]) or the min([probability])? I opt for the min([probability]) as if the rnd(1,100) was small then max([probability]) would always returned. You have your single name.
Monkey-wrench:
Once again… pull the records as before, look for the min([probability]); however, this time look for count(min([probability]))>1. If you do have more than one record with that probability, then we need to pull all of the records wherein they have min([probability]) then do an un-weighted random pick between records as they would have had an equal chance to have been selected to begin with...
-z
Jun 5 '12 #3
Rabbit
12,516 Recognized Expert Moderator MVP
I assumed they were comparing just within the row. If they are trying to pick one out of a cumulative probability, then it's pretty much the same thing except you will need a start range and end range and use an additional comparison.
Expand|Select|Wrap|Line Numbers
  1. SELECT * 
  2. FROM someTable 
  3. WHERE startProb >= Rnd(Second(Now()) * -1)
  4.    AND endProb < Rnd(Second(Now()) * -1)
Jun 5 '12 #4
TheSmileyCoder
2,322 Recognized Expert Moderator Top Contributor
I currently have a list of 64000 lastnames, and how many of each is used by persons living in Denmark. Example:
Expand|Select|Wrap|Line Numbers
  1. Amount  Name
  2. 460470  Jensen
  3. 446646  Nielsen
  4. 382215  Hansen
  5. 271012  Pedersen
  6. ...
  7. ...
  8. ...
  9. 6       Kausky
  10. 5       Haj
  11. 4       Shorch
  12. 4       Van Sas
  13. 4       Osinka
  14. 4       Devitt
As you can see, some of these have a way higher weight then others, and some will have the same weight.

I haven't converted the Amount to a probability yet, though thats just a formality. I do not plan to continually update the data basis, so if any changes needs to be made now, in order to create a more efficient function that is fine.

I need to generate X random results from the list, where X can be anywhere from 1 to 100.000 or more. As such its important the function is relatively efficient.


EDIT: hmm. cumulative probability, that sounds like a good way to go, I will look into that some more, and report my findings.
Jun 5 '12 #5
zmbd
5,501 Recognized Expert Moderator Expert
take a look at this
http://akinas.com/pages/en/blog/mysql_random_row
-z
Jun 5 '12 #6
zmbd
5,501 Recognized Expert Moderator Expert
Should have mentioned this in my last post but I was called away....

cumulative probability is a very specific statistical model theory. I don't feel that, that, particular methodology is what you are after and research along that line gets into some very heavy math very quickly...

What I tried to explain in post #3 is a form of partial sums and normalization (mathematical normalization not RDB) skipping the normalization step. This method should work; however, for a very large data-set it might not be the best in terms of speed. Therefor I posted the link to the mysql thread.

What I think you should do is also do a Google for: "Weighted Random Sampling (2005; Efraimidis, Spirakis)"

very good paper... the one I have is copyrighted and the agreement has a re-post/reuse restriction (sucks) or I would excerpt it here; however, you should be able to find something online using that as the search (wink-wink-nudge-nudge-dontchknow).

there are also a few good papers out of Berkly and others on this topic... I'll try and attach an older pdf for a simple random... what I like about this paper is that they attempt to do a random sampling without having to build the entire record-set first! (note: I found this paper some years ago when I was building a random sample dataset, I hope it is in the public domain and is provided here soley for academic purposes. I renamed the file for my own document management... it's not a simple paper!) This paper sent me down the partial-sum-tree method.

put your thiking cap on and brew a stong cup of coffee... you're in for a heavy read.
-wc
Attached Files
File Type: pdf Database_randomsample_simple.pdf (971.9 KB, 1358 views)
Jun 6 '12 #7
Rabbit
12,516 Recognized Expert Moderator MVP
If your goal is to generate x number of names given a known distribution, meaning you're allowing for "duplicate" names, you can do something like this:

1) Create a table just with the numbers 1 to some arbitrarily high x.

2) In your probability table, have a start and end range for each name.

3) Use a query to generate a random number on the number table seeding with the number and time values.

4) Join that query to the probability table using a query similar to the previous post.
Jun 6 '12 #8
NeoPa
32,584 Recognized Expert Moderator MVP
I think Rabbit's initial suggestion can be modified to work simply by comparing the random number generated (0 <= X < 1) against the result of :
Expand|Select|Wrap|Line Numbers
  1. [dblProbability] / Sum([dblProbability])
Clearly, the random number would need to be reset for each record (IE. Rerun, rather than necessarily reseeded).
Jun 6 '12 #9
zmbd
5,501 Recognized Expert Moderator Expert
Rabbit:...1) Create a table just with the numbers 1 to some arbitrarily high x...
I understand what you're getting at; however, the probability is already part of the table as [dblProbability]... why not use the variation on partial sums tree method as I've suggested in #3 other than it may not be the fastest…
... sum all of the probabilities: 5+2+1= 8 then convert to percentage by division 5/8, 2/8, 1/8... gives approximately 63, 25, 12....
Now the 63 is actually 63% and the others 25%, 10%... cardinal sin not using the units... my bad.
Notice, this is basically the same as NeoPa's reply in #10.

Now, TheSmileyCoder' s original question was for (as I understood it) a specific, single, name to be picked hence my reply and subsequent query... the query can be optimized to pull just the one record.
Now however in post #5 TheSmileyCoder tells us that we need to pull not just one record but create a record-set that may consist of between one and several thousand elements; thus, the two suggested papers and the link to the mysql forum so as to provide a path to the solution to a method to do this efficiently and to provide information leading to the consideration for the need for non-duplication or replacement etc... I am working under the notion that the subset must also be in random order and that duplication is allowed.

hina:...a cumulative probability, then it's pretty much the same thing ...
Hina, you've simply re-stated Rabbit's post in #4. As I said in #3, cumulative probability is a very specific statistical theory that for this question may very well be a red-herring. For the sake of argument, please take a look at this very simple explanation (one should also work thru the associated lessons): http://stattrek.com/probability-dist...tribution.aspx cumulative probability simply doesn’t appear, to me, to be the correct path to the solution.

-z
Jun 6 '12 #10

Sign in to post your reply or Sign up for a free account.

Similar topics

5
14951
by: Derek Cooper | last post by:
I hope you can help me. I posted this in the microsoft sql server newsgroup a few days ago and got no response so I thought I'd try here. If I can provide any clarification I'll be glad to do so. I'm trying to calculate a column based on the value of the previous record. I'm not very experienced with SQL-Server. I'm using the following table: CREATE TABLE tblPayment (
1
2665
by: arthur-e | last post by:
How can you select records based on more than one combo box - I have a combobox that selects records based on name (I'm sure this has been asked a thousand times - web site answer/link could be helpful too; but I'm so bad with syntax that specifics will be MOST helpful) SELECT DISTINCT ., . FROM Union Select "<ALL>" , NULL From ;
1
1135
by: zeeshansohail | last post by:
I have two tables named as “COMPANY and BILL_DETAIL” with the following structure. BILL_DETAIL TABLE Name Type Nullable ------------- -------------- -------- BILL_ID NUMBER(7) S_HEAD_CODE VARCHAR2(12) SR_NO NUMBER(9) BILL_AMOUNT NUMBER(10) Y
4
3574
by: mcca0081 | last post by:
hi - i'm trying to delete one of the duplicate records based on the most current date. here's the code for my access 2000 db. any help would be appreciated!!! - thank you kindly Sub DeleteDuplicateRecords() ' Deletes duplicates from the specified table, keeping the most current received date record. ' No user confirmation is required.
2
2562
by: Shivajirp | last post by:
I have table in which Shift no, Start time. End time. if start time of shift is 6pm and end time of shift is 6 am then I want to select record between start time and end time.I need query to selct such record.
0
2386
by: TD | last post by:
I have a main form with two subforms (both in datasheet view), neither of which are linked to the main form. The main form is based on a query that uses the bound column of a combobox on the main form as the criteria for the query. The combobox is based on a query that retrieves the name and record id of the customers. In the afterUpdate event of the combobox on the main form and both subforms are requeried. The first subform is based...
5
1932
by: agarwasa2008 | last post by:
Hi, I would like to delete a record based on a user entered string. Here are the details. I have a txtFind textbox. A string is entered by the user. Based on that string value it displays that one record in the fields in the frmDeleteComponent form. There is also a lstDelete listBox which displays all the record. But for some reason it does not delete that particular
6
4074
by: mukeshrasm | last post by:
Hi I want to update records in database based on selected records using checkbox. if user does not select a record and clicks update button it should show the message please select record to be update if user selects say 3 records then there should be message are you sure to update 3 records and when clicks ok records should get updated. since records are coming dynamically so how can I get the number of checkboxes selected or not to...
4
2630
by: Manikrag | last post by:
Hi Team, Is it possible to sort select query based on input string? I am looking for somthing like: select TOP 20 PREFERRED_NAME from FRS_TABLE where Lower(PREFERRED_NAME) like Lower('%shar%') order by PREFERRED_NAME LIKE '%shar%'
11
5693
by: prashantdixit | last post by:
Hi, I am developing a stock control software. Iam stuck somewhere. I have a form "Add new stock" consisting of combobox, text boxes etc. which is used to add records in a table. I have another form which has "Add Similar To" button. The table which i used to store records contains a field call "Add More Quantity". If a user enters any numerical value in this field against any record and click on "Add Similar To" button then same "Add New...
0
9818
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
10843
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10589
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10254
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9371
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5648
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5825
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
2
4015
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3112
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.