473,405 Members | 2,415 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

Selecting a random row


Hi!

I have to select a random row from a table where primary key isn't
continuous (some rows have been deleted). Postgres just seems to do
something strange with my method.

--
-- Use the order by desc limit 1 -trick to get maximum value
--
CREATE OR REPLACE FUNCTION max_uid() RETURNS int4 AS
'SELECT uid FROM users WHERE status = ''a'' ORDER BY uid DESC LIMIT 1'
LANGUAGE 'sql';
--
-- Choose a random point between 0 and max_uid and select the first
-- value from the bigger part
--
CREATE OR REPLACE FUNCTION random_uid() RETURNS int4 AS
'SELECT uid FROM users u WHERE u.status = ''a'' AND uid >=
cast(cast(max_uid() - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid
ASC LIMIT 1'
LANGUAGE 'sql';

--
-- testing and looks good
--
galleria=> SELECT max_uid();
max_uid
---------
126263

--
-- testing...
--
galleria=> SELECT random_uid(), random_uid(), random_uid(), random_uid(), random_uid();
random_uid | random_uid | random_uid | random_uid | random_uid
------------+------------+------------+------------+------------
322 | 601 | 266 | 427 | 369

.... but what is this? Values seem to vary from 0 to ~1000.
Not from 0 to 126263!!

How about doing some manual work...

--
-- Testing split point selection
--
galleria=> SELECT cast(cast(max_uid() - 1 AS FLOAT) * random() AS INTEGER);
int4
-------
43279

--
-- And inserting split point manually
--
galleria=> SELECT uid FROM users u WHERE u.status = 'a' AND uid >= 43279
ORDER BY uid ASC LIMIT 1;
uid
-------
43284

Works just fine!
Is there any explanation for this strange behavior or are there better
ways to select a random row?

I'm using PG 8.0 b2. Plan for the query is:

Limit (cost=0.00..5.19 rows=1 width=4)
-> Index Scan using users_pkey on users u (cost=0.00..69145.26 rows=13329 width=4)
Filter: ((status = 'a'::bpchar) AND (uid >= ((((max_uid() - 1))::double precision * random()))::integer))

|\__/|
( oo ) Kari Lavikka - tu***@bdb.fi
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #1
8 3812
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday 04 November 2004 12:36, Kari Lavikka wrote:
Is there any explanation for this strange behavior or are there better
ways to select a random row?


How about

SELECT ...whatever... ORDER BY random() LIMIT 1;

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists <at> klawitter <dot> de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFBiibF1Xdt0HKSwgYRAlJXAJ4nUpDfKBKCigPVMt8WpK G4gZmt4wCcD/ZC
KHBlBl1+5FZ4pgqkZlyzWQA=
=MrrE
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #2

Works but is too slooow. Shuffling whole table and selecting the first
row is not the way to go in this case.

Limit (cost=5340.74..5340.74 rows=1 width=4)
-> Sort (cost=5340.74..5440.70 rows=39986 width=4)
Sort Key: random()
-> Seq Scan on users (cost=0.00..2284.37 rows=39986 width=4)
Filter: (status = 'a'::bpchar)

|\__/|
( oo ) Kari Lavikka - tu***@bdb.fi
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""

On Thu, 4 Nov 2004, Holger Klawitter wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday 04 November 2004 12:36, Kari Lavikka wrote:
Is there any explanation for this strange behavior or are there better
ways to select a random row?


How about

SELECT ...whatever... ORDER BY random() LIMIT 1;

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists <at> klawitter <dot> de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFBiibF1Xdt0HKSwgYRAlJXAJ4nUpDfKBKCigPVMt8WpK G4gZmt4wCcD/ZC
KHBlBl1+5FZ4pgqkZlyzWQA=
=MrrE
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #3
IIRC, this was discussed a few times on this list, searching the
archives might get you some results. AFAIR, the only way to do it
efficiently is to have a column specially assigned for this purpose, and
populate it with random numbers in a big range. The column should be
indexed to assure fast access based on it. Then you can select the first
row with that column's value bigger (or smaller if you like) than a
random number in the same range.

HTH,
Csaba.

On Thu, 2004-11-04 at 14:34, Kari Lavikka wrote:
Works but is too slooow. Shuffling whole table and selecting the first
row is not the way to go in this case.

Limit (cost=5340.74..5340.74 rows=1 width=4)
-> Sort (cost=5340.74..5440.70 rows=39986 width=4)
Sort Key: random()
-> Seq Scan on users (cost=0.00..2284.37 rows=39986 width=4)
Filter: (status = 'a'::bpchar)

|\__/|
( oo ) Kari Lavikka - tu***@bdb.fi
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""

On Thu, 4 Nov 2004, Holger Klawitter wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thursday 04 November 2004 12:36, Kari Lavikka wrote:
Is there any explanation for this strange behavior or are there better
ways to select a random row?


How about

SELECT ...whatever... ORDER BY random() LIMIT 1;

Mit freundlichem Gruß / With kind regards
Holger Klawitter
- --
lists <at> klawitter <dot> de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFBiibF1Xdt0HKSwgYRAlJXAJ4nUpDfKBKCigPVMt8WpK G4gZmt4wCcD/ZC
KHBlBl1+5FZ4pgqkZlyzWQA=
=MrrE
-----END PGP SIGNATURE-----

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #4

Replying to myself..

Actually I found an answer. If a I wrap the split point selection to
subquery then the range of results is from 0 to maximum value (~120k in
this case)

galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >=
(select cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid
DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER)) ORDER BY uid ASC LIMIT 1;
uid
-------
91937
(1 row)

Limit (cost=1.73..3.53 rows=1 width=4)
InitPlan
-> Result (cost=1.71..1.73 rows=1 width=0)
InitPlan
-> Limit (cost=0.00..1.71 rows=1 width=4)
-> Index Scan Backward using users_pkey on users (cost=0.00..68423.70 rows=39986 width=4)
Filter: (status = 'a'::bpchar)
-> Index Scan using users_pkey on users u (cost=0.00..23983.04 rows=13329 width=4)
Index Cond: (uid >= $1)
Filter: (status = 'a'::bpchar)
However, without the additional nothing doing subquery the range of
results is something like 0 to ~1000 which is of course wrong.
galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >=
cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid DESC
LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid ASC LIMIT 1;
uid
-----
587
(1 row)
Examining the query plan reveals that without subquery the random
comparison is made for each row. That causes a kind of "premature
selection".
galleria=> explain SELECT u.uid FROM users u WHERE u.status = 'a' AND uid
= cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid DESC

LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid ASC LIMIT 1;

------------------------------------------------------------------------------------------------------------
Limit (cost=1.71..6.89 rows=1 width=4)
InitPlan
-> Limit (cost=0.00..1.71 rows=1 width=4)
-> Index Scan Backward using users_pkey on users (cost=0.00..68423.70 rows=39986 width=4)
Filter: (status = 'a'::bpchar)
-> Index Scan using users_pkey on users u (cost=0.00..69042.18 rows=13329 width=4)
Filter: ((status = 'a'::bpchar) AND (uid >= (((($0 - 1))::double precision * random()))::integer))
(7 rows)
Well, it works now. Thanks anyway ;)
|\__/|
( oo ) Kari Lavikka - tu***@bdb.fi - (050) 380 3808
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #5
Kari Lavikka <tu***@bdb.fi> writes:
--
-- Choose a random point between 0 and max_uid and select the first
-- value from the bigger part
--
CREATE OR REPLACE FUNCTION random_uid() RETURNS int4 AS
'SELECT uid FROM users u WHERE u.status = ''a'' AND uid >=
cast(cast(max_uid() - 1 AS FLOAT) * random() AS INTEGER) ORDER BY uid
ASC LIMIT 1'
LANGUAGE 'sql';


This isn't going to do what you think because the random() function is
re-evaluated at every row of the table. (For that matter, so is
max_uid(), which means performance would suck even if it worked ...)

I'd suggest rewriting in plpgsql so you can assign the (max_uid-1)*random()
expression to a variable and then just use the variable in the SELECT.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #6
From: "Kari Lavikka" <tu***@bdb.fi>

Actually I found an answer. If a I wrap the split point selection to
subquery then the range of results is from 0 to maximum value (~120k in
this case)

galleria=> SELECT u.uid FROM users u WHERE u.status = 'a' AND uid >=
(select cast(cast((SELECT uid FROM users WHERE status = 'a' ORDER BY uid
DESC LIMIT 1) - 1 AS FLOAT) * random() AS INTEGER)) ORDER BY uid ASC LIMIT 1; uid
-------
91937
(1 row)


Tthe problem with this is that this is not very random.
If the uids 30000 to 39999 have been missing, but
the uids are more or less contiguous apart from that,
the uid 40000 would be 10000 times more likely to be selected
than average.

Maybe using an OFFSET of (count(*) * random()) and a LIMIT 1
could be practical.

gnari

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #7
"gnari" <gn***@simnet.is> writes:
Tthe problem with this is that this is not very random.
If the uids 30000 to 39999 have been missing, but
the uids are more or less contiguous apart from that,
the uid 40000 would be 10000 times more likely to be selected
than average.


There is some discussion of selecting random rows in the list archives.
My recollection is that we decided the only way to be fast *and*
truly random was to dedicate an extra column to be a random key.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #8
Tthe problem with this is that this is not very random.
If the uids 30000 to 39999 have been missing, but
the uids are more or less contiguous apart from that,
the uid 40000 would be 10000 times more likely to be selected
than average.
There are some gaps but distribution of them is quite uniform. And results
seem to be random enuff for this particular purpose.
Maybe using an OFFSET of (count(*) * random()) and a LIMIT 1
could be practical.


Something like OFFSET (random() * 10) could be used for additional
randomness of course.

|\__/|
( oo ) Kari Lavikka - tu***@bdb.fi
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Joe Six-Pack | last post by:
Hi, Im having problems in randomly selecting an element in an array that has not been selected before.. in other words, I have an array of answers to questions, then I want to select 5, with one...
9
by: Martin Christensen | last post by:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Howdy! Suppose I have a list A containing elements that I want to put into list B if they satisfy some property. The obvious way of doing it...
12
by: Alexander Schmolck | last post by:
Quite a few algortihms prominently feature choosing/removing a single random element from a set at a time. On first sight I can't think of anything better to achieve this with `sets.Set` than...
12
by: Joseph Shraibman | last post by:
Is there a way to get random rows besides ORDER BY random()? The problem with ORDER BY random() is that is has to get all the rows from the table before the results are returned. ...
2
by: VB Programmer | last post by:
I am interested to hear your suggestions on this... I have a table full of survey questions. The questions are individually classified as priority 1, 2 or 3. (Priority 1 means the question...
3
by: John Fairhurst | last post by:
Hi, The following code should select the specified number of records randomly from the database <% .... query = "SELECT FROM " Set RS = Server.CreateObject("ADODB.Recordset")
9
by: Generic Usenet Account | last post by:
I had a need to randomly select an element from an STL collection. It does not appear that this functionality is provided out-of-the-box with STL. Here is my crude implementation. I am using...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.