Trying to get postgres to use an index

Mike Wertheim

Hi,

I'm using PostgreSQL 8.

I have two tables that I am doing a join on, and the join executes very
slowly.

The table called Notification has a text field called NotificationID,
which is its primary key. The Notification table also has an int4 field
called ItemID, and it has an index on the ItemID field. The table
called Item has an int4 field called ItemID, which is its primary key.
If I do a simple select on Notification using just the ItemID field, the
index is used...

explain select notificationID from NOTIFICATION n where n.itemID = 12;
QUERY PLAN

------------------------------------------------------------------------
---------------------
Index Scan using notification_4_idx on notification n
(cost=0.00..129.22 rows=57 width=44)
Index Cond: (itemid = 12)

This query runs in far less than one second.

But if I do a join, the index isn't used...

explain select notificationID from NOTIFICATION n, ITEM i where
n.itemID = i.itemID;
QUERY PLAN

------------------------------------------------------------------------
------
Hash Join (cost=47162.85..76291.32 rows=223672 width=44)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..12023.71 rows=223671
width=48)
-> Hash (cost=42415.28..42415.28 rows=741028 width=4)
-> Seq Scan on item i (cost=0.00..42415.28 rows=741028
width=4)

This query takes about 20 seconds to run.
I have run "vacuum analyze", and it didn't make any difference.

I've seen people say that sometimes the query optimizer will decide to
not use an index if it thinks that doing a sequential scan would be
faster. I don't know if that's what's happening here, but it seems to
me that using the index should be much faster than the performance I'm
getting here.

Does anyone have any suggestions on how to make this query run faster?

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #1

Subscribe Post Reply

9221

Troels Arvin

On Sat, 06 Nov 2004 12:00:02 -0800, Mike Wertheim wrote:

Does anyone have any suggestions on how to make this query run faster?

Does it help if you decrease the value of random_page_cost? - That value
can be changed run-time, from within psql. If you find that a certain,
lower value helps, you can make it permanent in postgresql.conf.

--
Greetings from Troels Arvin, Copenhagen, Denmark

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #2

Pierre-Frédéric Caillaud

> explain select notificationID from NOTIFICATION n, ITEM i where

n.itemID = i.itemID;
QUERY PLAN

------------------------------------------------------------------------
------
Hash Join (cost=47162.85..76291.32 rows=223672 width=44)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..12023.71 rows=223671
width=48)
-> Hash (cost=42415.28..42415.28 rows=741028 width=4)
-> Seq Scan on item i (cost=0.00..42415.28 rows=741028
width=4)

This query takes about 20 seconds to run.

Well, you're joining the entire two tables, so yes, the seq scan might be
faster.
Try your query with enable_seqscan=0 so it'll use an index scan and
compare the times.
You may be surprised to find that the planner has indeed made the right
choice.
This query selects 223672 rows, are you surprised it's slow ?

What are you trying to do with this query ? Is it executed often ?
If you want to select only a subset of this, use an additional where
condition and the planner will use the index.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #3

Joel Stevenson

At 10:11 PM +0100 11/6/04, Pierre-Frédéric Caillaud wrote:

explain select notificationID from NOTIFICATION n, ITEM i where
n.itemID = i.itemID;
QUERY PLAN

------------------------------------------------------------------------
------
Hash Join (cost=47162.85..76291.32 rows=223672 width=44)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..12023.71 rows=223671
width=48)
-> Hash (cost=42415.28..42415.28 rows=741028 width=4)
-> Seq Scan on item i (cost=0.00..42415.28 rows=741028
width=4)

This query takes about 20 seconds to run.

Well, you're joining the entire two
tables, so yes, the seq scan might be faster.
Try your query with enable_seqscan=0 so
it'll use an index scan and compare the times.
You may be surprised to find that the
planner has indeed made the right choice.
This query selects 223672 rows, are you surprised it's slow ?

I'm not a SQL guru by any stretch but would a
constrained sub-select be appropriate here?

e.g. a simple test setup where each record in
table test1 has a FK referenced to an entry in
test:

joels=# \d test
Table "public.test"
Column | Type | Modifiers
--------+--------------+-----------
id | integer | not null
foo | character(3) |
Indexes:
"test_pkey" primary key, btree (id)

joels=# \d test1
Table "public.test1"
Column | Type | Modifiers
---------+---------+-----------
id | integer | not null
test_id | integer |
Indexes:
"test1_pkey" primary key, btree (id)
"test1_test_id_idx" btree (test_id)
Foreign-key constraints:
"$1" FOREIGN KEY (test_id) REFERENCES test(id) ON DELETE CASCADE

joels=# select count(*) from test;
count
-------
10001
(1 row)

joels=# select count(*) from test1;
count
-------
10001
(1 row)

joels=# explain select test_id from test1 t1, test t where t1.test_id =t.id;
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=170.01..495.05 rows=10002 width=4)
Hash Cond: ("outer".test_id = "inner".id)
-> Seq Scan on test1 t1 (cost=0.00..150.01 rows=10001 width=4)
-> Hash (cost=145.01..145.01 rows=10001 width=4)
-> Seq Scan on test t (cost=0.00..145.01 rows=10001 width=4)
(5 rows)

joels=# explain select test_id from test1 t1
where test_id in (select id from test where id =
t1.test_id);
QUERY PLAN
------------------------------------------------------------------------------
Seq Scan on test1 t1 (cost=0.00..15269.02 rows=5001 width=4)
Filter: (subplan)
SubPlan
-> Index Scan using test_pkey on test (cost=0.00..3.01 rows=2 width=4)
Index Cond: (id = $0)
(5 rows)
So with the subselect the query planner would use
the primary key index on test when finding
referencing records in the test1 table.

Pierre, I seen the advice to use an additional
where condition in certain cases to induce an
index scan; how is this done?

my 1.2 pennies,
-Joel

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #4

mike.wertheim

I'm not a SQL guru by any stretch but would a
constrained sub-select be appropriate here? Well, you're joining the entire two tables, so yes, the seq scan might
be faster.

My mistake! When composing the email to state the problem, I accidentally
gave a wrong version of the join query.

Here is the corrected version, which still has the sequential scan...

explain select notificationID from NOTIFICATION n, ITEM i where n.itemID
= i.itemID and i.projectID = 12;
QUERY PLAN
---------------------------------------------------------------------------
-----------------------
Hash Join (cost=2237.54..15382.32 rows=271 width=44)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..12023.71 rows=223671
width=48)
-> Hash (cost=2235.31..2235.31 rows=895 width=4)
-> Index Scan using item_ix_item_4_idx on item i
(cost=0.00..2235.31 rows=895width=4)
Index Cond: (projectid = 12)

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #5

Tom Lane

<mi***********@linkify.com> writes:

Here is the corrected version, which still has the sequential scan... explain select notificationID from NOTIFICATION n, ITEM i where n.itemID
= i.itemID and i.projectID = 12;
QUERY PLAN
---------------------------------------------------------------------------
-----------------------
Hash Join (cost=2237.54..15382.32 rows=271 width=44)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..12023.71 rows=223671
width=48)
-> Hash (cost=2235.31..2235.31 rows=895 width=4)
-> Index Scan using item_ix_item_4_idx on item i
(cost=0.00..2235.31 rows=895width=4)
Index Cond: (projectid = 12)

This seems like a perfectly fine plan to me. If it were turned around
into a nested indexscan as you suggest, there would need to be 895
indexscans of NOTIFICATION (one for each row retrieved from ITEM)
and from your original mail we can see the planner thinks that an
indexscan on NOTIFICATION will take about 129 cost units, for a total
cost of 129 * 895 = 115455 units (and that's not counting the indexscan
on ITEM nor any join overhead). So at least according to these
estimates, using the index would take 10x more time than this plan.

If you want to see whether this costing is accurate, you could do
EXPLAIN ANALYZE for this way and the other (I expect that you'd get the
other if you did "set enable_seqscan = off"). But with a 10x
discrepancy I suspect the planner probably did the right thing.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #6

mike

Well, you're joining the entire two tables, so yes, the seq scan might
be faster.

My mistake. When composing the email to state the problem, I accidentally
gave a wrong versionof the join query.

Here is the corrected version, which still has the sequential scan...

explain select notificationID from NOTIFICATION n, ITEM i where n.itemID
= i.itemID andi.projectID = 12;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Hash Join (cost=2237.54..15382.32 rows=271 width=44)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..12023.71 rows=223671 width=48)
-> Hash (cost=2235.31..2235.31 rows=895 width=4)
-> Index Scan using item_ix_item_4_idx on item i
(cost=0.00..2235.31 rows=895width=4)
Index Cond: (projectid = 12)

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #7

Mike Wertheim

I have some more info on my indexing situation, and a new question.

In my previous email, I told about 2 tables: Notification and Item,
which join on a field called ItemID. The joining query didn't execute
as quickly as I thought it should. I now notice that I have another
table, Folder, which joins with Item in a similar way, and the
performance of that join is excellent.

So my new questions is... What makes the Folder join faster than the
Notification join?
Here is some info on the tables, queries, and "explain analyze"
output...

Item's primary key is ItemID (int4).
Folder's primary key is ItemID (int4). Folder also contains 4 varchar
columns, 2 text columns, 6 bool columns, 7 datetime columns and 1 int4
column.
Notification has an index on its ItemID (int4) field. Notification also
contains 7 text columns (1 of them being the primary key), 3 timestamp
columns and 4 int4 columns.

Folder and Notification have a similar number of rows. "select count(*)
from folder" returns 193043. "select count(*) from notification"
returns 223689.
The first query is: "select count(*) from FOLDER f, ITEM i where
f.itemID = i.itemID and i.projectid=7720". This query returns the
result "5" and executes in less than 1 second.

The second query is: "select count(*) from NOTIFICATION n, ITEM i where
n.itemID = i.itemID and i.projectid=7720". This query returns the
result "2" and executes in about 40 seconds.
Here's the "explain analyze" output...

The Folder query uses the indexes:
explain analyze select count(*) from FOLDER f, ITEM i where f.itemID =
i.itemID and i.projectid=7720; Aggregate (cost=6371.88..6371.88
rows=1 width=0) (actual time=83.557..83.558 rows=1 loops=1)
-> Nested Loop (cost=0.00..6371.31 rows=227 width=0) (actual
time=17.929..83.502 rows=5 loops=1)
-> Index Scan using item_ix_item_4_idx on item i
(cost=0.00..2105.51 rows=869 width=4) (actual time=0.098..19.409 rows=51
loops=1)
Index Cond: (projectid = 7720)
-> Index Scan using folder_pkey on folder f (cost=0.00..4.90
rows=1 width=4) (actual time=1.255..1.255 rows=0 loops=51)
Index Cond: (f.itemid = "outer".itemid)
Total runtime: 92.185 ms
The Notification query does a sequential scan on Notification:
explain analyze select count(*) from NOTIFICATION n, ITEM i where
n.itemID = i.itemID and i.projectid=7720;
Aggregate (cost=38732.31..38732.31 rows=1 width=0) (actual
time=40380.497..40380.498 rows=1 loops=1)
-> Hash Join (cost=2107.69..38731.65 rows=263 width=0) (actual
time=36341.174..40380.447 rows=2 loops=1)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..35502.89
rows=223689 width=4) (actual time=8289.236..40255.341 rows=223689
loops=1)
-> Hash (cost=2105.51..2105.51 rows=869 width=4) (actual
time=0.177..0.177 rows=0 loops=1)
-> Index Scan using item_ix_item_4_idx on item i
(cost=0.00..2105.51 rows=869 width=4) (actual time=0.025..0.127 rows=51
loops=1)
Index Cond: (projectid = 7720)
Total runtime: 40380.657 ms
So my question is... What difference do you see between the Folder and
Notification tables that would account for such a big difference in
query performance? And how can I make the Notification query run about
as fast as the Folder query?

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #8

Martijn van Oosterhout

Firstly, check that all your columns are actually of the same type and
the indexes are where you say they are. Using \d will show this.
Secondly, if you do the EXPLAIN ANALYZE with "set enable_seqscan=off",
what is the output?

Hope this helps,

On Tue, Nov 09, 2004 at 05:02:01PM -0800, Mike Wertheim wrote:

I have some more info on my indexing situation, and a new question.

In my previous email, I told about 2 tables: Notification and Item,
which join on a field called ItemID. The joining query didn't execute
as quickly as I thought it should. I now notice that I have another
table, Folder, which joins with Item in a similar way, and the
performance of that join is excellent.

So my new questions is... What makes the Folder join faster than the
Notification join?

Here is some info on the tables, queries, and "explain analyze"
output...

Item's primary key is ItemID (int4).
Folder's primary key is ItemID (int4). Folder also contains 4 varchar
columns, 2 text columns, 6 bool columns, 7 datetime columns and 1 int4
column.
Notification has an index on its ItemID (int4) field. Notification also
contains 7 text columns (1 of them being the primary key), 3 timestamp
columns and 4 int4 columns.

Folder and Notification have a similar number of rows. "select count(*)
from folder" returns 193043. "select count(*) from notification"
returns 223689.

The first query is: "select count(*) from FOLDER f, ITEM i where
f.itemID = i.itemID and i.projectid=7720". This query returns the
result "5" and executes in less than 1 second.

The second query is: "select count(*) from NOTIFICATION n, ITEM i where
n.itemID = i.itemID and i.projectid=7720". This query returns the
result "2" and executes in about 40 seconds.

Here's the "explain analyze" output...

The Folder query uses the indexes:
explain analyze select count(*) from FOLDER f, ITEM i where f.itemID =
i.itemID and i.projectid=7720; Aggregate (cost=6371.88..6371.88
rows=1 width=0) (actual time=83.557..83.558 rows=1 loops=1)
-> Nested Loop (cost=0.00..6371.31 rows=227 width=0) (actual
time=17.929..83.502 rows=5 loops=1)
-> Index Scan using item_ix_item_4_idx on item i
(cost=0.00..2105.51 rows=869 width=4) (actual time=0.098..19.409 rows=51
loops=1)
Index Cond: (projectid = 7720)
-> Index Scan using folder_pkey on folder f (cost=0.00..4.90
rows=1 width=4) (actual time=1.255..1.255 rows=0 loops=51)
Index Cond: (f.itemid = "outer".itemid)
Total runtime: 92.185 ms

The Notification query does a sequential scan on Notification:
explain analyze select count(*) from NOTIFICATION n, ITEM i where
n.itemID = i.itemID and i.projectid=7720;
Aggregate (cost=38732.31..38732.31 rows=1 width=0) (actual
time=40380.497..40380.498 rows=1 loops=1)
-> Hash Join (cost=2107.69..38731.65 rows=263 width=0) (actual
time=36341.174..40380.447 rows=2 loops=1)
Hash Cond: ("outer".itemid = "inner".itemid)
-> Seq Scan on notification n (cost=0.00..35502.89
rows=223689 width=4) (actual time=8289.236..40255.341 rows=223689
loops=1)
-> Hash (cost=2105.51..2105.51 rows=869 width=4) (actual
time=0.177..0.177 rows=0 loops=1)
-> Index Scan using item_ix_item_4_idx on item i
(cost=0.00..2105.51 rows=869 width=4) (actual time=0.025..0.127 rows=51
loops=1)
Index Cond: (projectid = 7720)
Total runtime: 40380.657 ms

So my question is... What difference do you see between the Folder and
Notification tables that would account for such a big difference in
query performance? And how can I make the Notification query run about
as fast as the Folder query?

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFBkfmNY5Twig3Ge+YRAmqwAKCrZCLZAOf1ls309w8ITM NhdwDhJgCeM5BF
NbpZi90BDGeVFCQumU5Wd/o=
=NzQs
-----END PGP SIGNATURE-----

Nov 23 '05 #9

Similar topics

Trying to create a GiST index in 7.3

by: Dmitry Tkach | last post by:

Hi, everybody! I am trying to create a custom GiST index in 7.3, but getting an error, that I don't know how to interpret: testdb=# create table gist_test (field int8); CREATE TABLE testdb=#...

PostgreSQL Database

Any way to Convince postgres to push join clause inside subquery aggregate?

by: Greg Stark | last post by:

I find I often want to be able to do joins against views where the view are aggregates on a column that has an index. Ie, something like SELECT a.*, v.n FROM a JOIN (select a_id,count(*) as n...

PostgreSQL Database

Feature Request for 7.5

by: Chris Travers | last post by:

Hi all; I have been looking into how to ensure that synchronous replication, etc. could best be implimented. To date, I see only two options: incorporate the replication code into the database...

PostgreSQL Database

XML & Postgres Functions

by: zerobearing2 | last post by:

Hi all- I'm migrating to postgres from the MS SQL Server land, as I can see a great potential with postgres, I was wondering if anyone has experimented or started a project with XML inside user...

PostgreSQL Database

Crash in postgres/linux on verly large database

by: Bernhard Ankenbrand | last post by:

Hi, we have a table width about 60.000.000 entrys and about 4GB storage size. When creating an index on this table the whole linux box freezes and the reiser-fs file system is corrupted on not...

PostgreSQL Database

trying to restore after a server failure...need some help

by: David A. Ulevitch | last post by:

Hi fellow pgsql users, I am helping my university's student union get back up and running after some major server issues they had. They had serious disk issues on a server, but not on the /var...

PostgreSQL Database

New to Postgres

by: Darkcamel | last post by:

Hello all, I am new to postgres and don't really understand how the database is set-up. I am very fluent with mysql and sql2000, but postgres is new to me. If anyone can point me to some good...

PostgreSQL Database

any solution for full text search in Postgres for partial words

by: Liu, Mingyi | last post by:

Sorry if this question has been asked before. I tried to search in postgres mailing lists at http://archives.postgresql.org/pgsql-general/ just now and it gave me error "An error occured! Can not...

PostgreSQL Database

making another super user other than postgres

by: Bob Powell | last post by:

Hello everyone; My systems admin says that he needs to have use of the Postgres user without a password. His Debian package manager requires this. He tells me that he can lock down that user...

PostgreSQL Database

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing