Specifying many rows in a table

Steve Atkins

I have a large table (potentially tens or hundreds of millions of rows) and
I need to extract some number of these rows, defined by an integer primary
key.

So, the obvious answer is

select * from table where id in (1,3,4);

But I may want to extract a large number of rows, many thousands

select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);

This falls over when it exceeds the maximum expression depth of 10,000.
And I have a sneaky feeling that increasing max_expr_depth isn't the
right fix.

Performance is pretty important, so does anyone have a good suggestion
for how to phrase this query so that it'll at worst only be a single
seq-scan through the mondo table, and make a reasonable choice as to
whether to use an index-scan or seq-scan, depending on the number
of rows I'm pulling out?

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 22 '05 #1

Subscribe Post Reply

2664

scott.marlowe

On Wed, 28 Jan 2004, Steve Atkins wrote:

I have a large table (potentially tens or hundreds of millions of rows) and
I need to extract some number of these rows, defined by an integer primary
key.

So, the obvious answer is

select * from table where id in (1,3,4);

But I may want to extract a large number of rows, many thousands

select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);
If the ranges are sequential, then between would work. I have a feeling
that they aren't though, looking at your examples.
This falls over when it exceeds the maximum expression depth of 10,000.
And I have a sneaky feeling that increasing max_expr_depth isn't the
right fix.

The optimizations made for in() queries in the 7.4 branch only really work
when there's a subselect / table in the in. You could try inserting those
numbers into a temp table and subselecting it.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 22 '05 #2

Alvaro Herrera

On Wed, Jan 28, 2004 at 01:15:27PM -0700, scott.marlowe wrote:

On Wed, 28 Jan 2004, Steve Atkins wrote:
select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);

This falls over when it exceeds the maximum expression depth of 10,000.
And I have a sneaky feeling that increasing max_expr_depth isn't the
right fix.

The optimizations made for in() queries in the 7.4 branch only really work
when there's a subselect / table in the in. You could try inserting those
numbers into a temp table and subselecting it.

I think using IN with such a big distinct set would suck performance-
wise, because it would have to resort to the old method which was slow.
I could be mistaken though. If IN (SELECT...) doesn't work, I'd try
rewriting the query to use an EXISTS pulling the values from a temp
table instead ...

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Before you were born your parents weren't as boring as they are now. They
got that way paying your bills, cleaning up your room and listening to you
tell them how idealistic you are." -- Charles J. Sykes' advice to teenagers

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 22 '05 #3

John Sidney-Woollett

If you could insert the "in" values into a temporary table then join
against your *large* table you may find that you getter performance, at
the expense of having to populate the temp tables with your "in" values.

Rather than having to populate and delete your "in" values ready for the
next query, consider adding a search counter column to the table, eg

(in pseudo code)
create temporary table INTABLE (searchid integer, invalue integer);

create a function InsertInValues(searchid integer, values text) which
takes a comma delimited list of "in" values, and uses a loop to insert
them into your (temp) INTABLE.

now your query is

select * from table t, intable it
where t.keyvalue = it.invalue
and searchid = XXXX;

Next query, call InsertInValues(searchid integer, values text) again with
the next set of data, incrementing the searchid value for the function
call, and then perform the next search.

Or write a function SearchUsingInValues(invalue integer) which does the
two steps above within the one function (generating its own internal
searchid), and returns a setof values.

Or do away with the searchid stuff, and either drop/recreate the temp
table between calls, or truncate the data.

Someone else may be able to suggest a better/faster way of generating the
"in" values...

I don't know if that's any use. Hope so.

John Sidney-Woollett

Steve Atkins said:

I have a large table (potentially tens or hundreds of millions of rows)
and
I need to extract some number of these rows, defined by an integer primary
key.

So, the obvious answer is

select * from table where id in (1,3,4);

But I may want to extract a large number of rows, many thousands

select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);

This falls over when it exceeds the maximum expression depth of 10,000.
And I have a sneaky feeling that increasing max_expr_depth isn't the
right fix.

Performance is pretty important, so does anyone have a good suggestion
for how to phrase this query so that it'll at worst only be a single
seq-scan through the mondo table, and make a reasonable choice as to
whether to use an index-scan or seq-scan, depending on the number
of rows I'm pulling out?

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 22 '05 #4

Tom Lane

"scott.marlowe" <sc***********@ihs.com> writes:

On Wed, 28 Jan 2004, Steve Atkins wrote:
But I may want to extract a large number of rows, many thousands
select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);
You could try inserting those
numbers into a temp table and subselecting it.

That's what I'd try too; it should work reasonably well in 7.4. Two
thoughts:

1. Be sure to "analyze" the temp table after filling it. The planner
must know the number of rows in the table to have any shot at choosing a
good plan.

2. If you know that the set of numbers you have emitted are all
different, don't use IN at all, but a straight join to the temp table:

select table.* from table, tmp where id = tmp.id;

With IN the system will have to work harder to eliminate duplicates from
the temp table; why make it do that if you know it's not necessary?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 22 '05 #5

Steve Atkins

On Wed, Jan 28, 2004 at 06:13:36PM -0500, Tom Lane wrote:

"scott.marlowe" <sc***********@ihs.com> writes:
You could try inserting those
numbers into a temp table and subselecting it.

That's what I'd try too; it should work reasonably well in 7.4. Two
thoughts:

1. Be sure to "analyze" the temp table after filling it. The planner
must know the number of rows in the table to have any shot at choosing a
good plan.

2. If you know that the set of numbers you have emitted are all
different, don't use IN at all, but a straight join to the temp table:

select table.* from table, tmp where id = tmp.id;

With IN the system will have to work harder to eliminate duplicates from
the temp table; why make it do that if you know it's not necessary?

Yup.

I'm trying a number of variants (in(list), in(select), exists, join) and
benchmarking all of them across a range of numbers, randomly chosen or
clustered with appropriate splashing of the DB and disk buffers between
runs.

Once the perl script has finished I'll post a pointer to the data.

An interesting thing, though is that for 10,000,000 rows in big and
one row in little, everything indexed and analyzed an exists query
is painfully bad...

select * from big where exists (select 1 from little where s=big.id)

....expands to nested seqscans over big and little, rather than the
seqscan over little with a nested index scan over big I was expecting.

I'm pretty sure that in(list) will be a clear win for a small number
of rows, and a simple join will win for a larger number, but I'm
benchmarking the others out of random interest.

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 22 '05 #6

Tom Lane

Steve Atkins <st***@blighty.com> writes:

An interesting thing, though is that for 10,000,000 rows in big and
one row in little, everything indexed and analyzed an exists query
is painfully bad... select * from big where exists (select 1 from little where s=big.id) ...expands to nested seqscans over big and little, rather than the
seqscan over little with a nested index scan over big I was expecting.

The planner can figure this out for straight join and (as of 7.4) for
IN, but not for EXISTS(). I've forgotten the details, but there were
some semantic fine points that discouraged me from trying to tackle the
EXISTS() case at the same time as IN.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 22 '05 #7

Steve Atkins

On Wed, Jan 28, 2004 at 04:14:21PM -0800, Steve Atkins wrote:

I'm trying a number of variants (in(list), in(select), exists, join) and
benchmarking all of them across a range of numbers, randomly chosen or
clustered with appropriate splashing of the DB and disk buffers between
runs.

Once the perl script has finished I'll post a pointer to the data.

On the off-chance anyone else is interested in the performance
variations (or is searching the archives from the year 2020)...

I have a table of 10,000,000 rows, each row having a few hundred bytes
of data. It has an integer primary key, and is clustered on that key.
Real data, not synthetic, not that it makes a difference for this
test.

I have in the application layer a list of the rows I want to retrieve
(a sorted list of unique integers) and want to retrieve all those rows
from the big table.

The test system is a single processor Athlon 1800 with a Reiser
filesystem on two striped decent scsi drives. PostgreSQL is pretty
well tuned. Between each test run PostgreSQL was shut down, a gig of
data read from the disk to flush OS caches and PostgreSQL restarted.

The big table is indexed and analysed. All temporary tables are analysed
(and times include construction, population, any indexing and analysing
of the table, but not destruction of it, nor time to transfer data to the
client).

The sequence of ids to be retrieved is identical in each run. In the
first batch of tests it's completely random, but sorted. In the second batch
it's a sequence (42, 43, 44, 45...) with a random starting point.
inscalar: select * from big where id in (1, 17, 43 ...., 2071)
inselect: select * from big where id in (select id from small)
exists: select * from big where exists (select 1 from small where id=big.id)
join: select * from big, small where big.id = small.id;

inselect-idx: As inselect, but small is indexed
exists-idx: As exists, but small is indexed
join-idx: As join, but small is indexed

In all but the first case, small is created as a temporary table, populated
with a sequence of inserts (inside a transaction), possibly indexed then
analysed.

number of random rows retrieved
1 10 100 1000 10000 100000
inscalar 0.394 0.818 3.80 23.5
inselect 0.752 0.770 0.751 0.977 0.798 0.790
join 0.728 0.746 0.625 0.766 0.703 0.806
inselect-idx 0.655 0.642 0.748 0.822 0.777 0.820
exists-idx 50.7 49.4 49.4 50.0 49.4 51.2
join-idx 0.657 0.703 0.760 0.730 0.731 0.815

number of sequential rows retrieved
1 10 100 1000 10000 100000
inscalar 0.464 0.474 0.488 0.529
inselect 0.820 0.826 0.778 0.776 0.612 0.770
join 0.754 0.855 0.831 0.829 0.899 0.742
inselect-idx 0.686 0.679 0.703 0.747 0.747 0.729
exists-idx 49.2 79.8 49.5 49.3 49.3 53.7
join-idx 0.625 0.754 0.697 0.770 0.750 0.783

Overall, IN(scalar) looks pretty good for single digit numbers. A simple
join looks nice for anything else - and indexing the temporary table
doesn't seem to buy anything.

Pretty impressive overall, though.

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 22 '05 #8

Greg Stark

"scott.marlowe" <sc***********@ihs.com> writes:

The optimizations made for in() queries in the 7.4 branch only really work
when there's a subselect / table in the in. You could try inserting those
numbers into a temp table and subselecting it.

I'll second this option.

But the other choice is to use a prepared statement like "select * from tab
where tab_id = ?" and simply execute it 10,000 times with different
parameters.

This involves lots of round trips to the server and is a lot less efficient,
but it doesn't require building a temporary table, which might be an
advantage. In fact if the data is coming from outside the database then you'll
probably end up building the temporary table using lots of executions like
this anyways, so it might not really be a disadvantage.

In the future the the ideal solution might be to have array processing support
in postgres, where you prepare a statement then ship the server an array of
parameter lists and the server executes the statement once for each array
element. Ideally it would let you stream the array to the server and it
execute them as fast as you can ship the data, avoiding round trip latency.
But I don't think there's any such support in postgres currently.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 22 '05 #9

NTPT

----- Puvodní zpráva -----
Od: "Steve Atkins" <st***@blighty.com>
Komu: <pg***********@postgresql.org>
Odesláno: 28. ledna 2004 20:32
Predmet: [GENERAL] Specifying many rows in a table

I have a large table (potentially tens or hundreds of millions of rows) and
I need to extract some number of these rows, defined by an integer primary
key.

So, the obvious answer is

select * from table where id in (1,3,4);

Should not it be select * from table where id in ('1','3','4'); ?
ie add an single quote around the numbers ? Think I red some statement in documentation, that without a single quote, index scan may not be always used ?

But I may want to extract a large number of rows, many thousands

select * from table where id in (1, 3, 5, ...., 100000, 100017, 23000);

This falls over when it exceeds the maximum expression depth of 10,000.
And I have a sneaky feeling that increasing max_expr_depth isn't the
right fix.

Performance is pretty important, so does anyone have a good suggestion
for how to phrase this query so that it'll at worst only be a single
seq-scan through the mondo table, and make a reasonable choice as to
whether to use an index-scan or seq-scan, depending on the number
of rows I'm pulling out?

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 22 '05 #10

Steve Atkins

On Fri, Jan 30, 2004 at 12:38:43PM +0100, NTPT wrote:

I have a large table (potentially tens or hundreds of millions of rows) and
I need to extract some number of these rows, defined by an integer primary
key.

So, the obvious answer is

select * from table where id in (1,3,4);

Should not it be select * from table where id in ('1','3','4'); ?
ie add an single quote around the numbers ? Think I red some statement in documentation, that without a single quote, index scan may not be always used ?

That's a good rule in general - as an index will usually only be considered
if the type of the field indexed and the type of the value in the where
clause are the same. The usual example of that is an indexed bigint field
"foo" and a query "select * from table where foo=5". That will not use
the index because "5" will be seen as an integer, and an integer isn't
a bigint.

"select * from table where foo=5::bigint" would work, but is clumsy.
Putting quotes around the "5" - "select * from table where foo='5'"
will delay deciding what type the "5" is late enough in the planning
process that it can be treated as a bigint and the index will be
considered.

That's not the issue here, though.

Cheers,
Steve

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 22 '05 #11

by: rsd | last post by:

how do you specify the "colspan" attribut of a <td> cell when building the table/rows/cells dynamicly? ie: lets say we have the follwoing code: table = document.all; // header row tr1 =...

Javascript

object model, referring to rows

by: jhallbox | last post by:

I am unclear how to test this object I'm sending to my function. It is a cell in a table and I simply want to know which row it is in. I am not a javascript programmer but a javascript cobbler, so...

Javascript

Ordering rows in a table

by: Brendan.Collins | last post by:

Hi I have a javascript problem that has been annoying me for two days now and thought that a javascript expert might have the magic solution. I am populating a table dynamically from the...

Javascript

Specifying tabular position with CSS

by: Steven Simpson | last post by:

Hello! I could arrange some HTML elements so that they have a sensible order without stylesheets: <div id="sections"> <div id="content"> The bulk of the content that people actually want to...

HTML / CSS

Dynamic adding rows into table problem

by: Luis Esteban Valencia | last post by:

Hello I wrote a program with code behind in C# to add row into table dynamically and the program worked very well in .Net Framework 1.1. When I run this program in .Net Framework 2.0 beta...

ASP.NET

Hide/Show Rows based on text box input

by: jimstruckster | last post by:

I have a table with 10 rows, I want all rows except for the first to be hidden when the page first opens up. If the user puts a value in a text box in the first row then I want the second row to...

Javascript

specifying a foreign key by altering a table

by: zufie | last post by:

Hi, I want to specifying a foreign key by altering a table. First, I create an ORDERS table without specifying a foreign key. Here is my code: CREATE TABLE ORDERS (Order_ID integer,...

Microsoft Access / VBA

Specifying width as a percent

by: Taras_96 | last post by:

Hi everyone, Can anyone tell me why the top paragraph block stretches across the screen (as you would expect), while the bottom div doesn't stretch across the entire screen? When I set the width...

HTML / CSS

Cloning and Deleting Multiple Rows with Javascript

by: nigelesquire | last post by:

Please help! I'm trying to clone and delete multiple rows with JavaScript. I need two delete buttons that work...! I only have one for now, but it's not working properly, the output count is...

Javascript

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Similar topics