performance of IN (subquery) - Page 3

Kevin Murphy

I'm using PG 7.4.3 on Mac OS X.

I am disappointed with the performance of queries like 'select foo from
bar where baz in (subquery)', or updates like 'update bar set foo = 2
where baz in (subquery)'. PG always seems to want to do a sequential
scan of the bar table. I wish there were a way of telling PG, "use the
index on baz in your plan, because I know that the subquery will return
very few results". Where it really matters, I have been constructing
dynamic queries by looping over the values for baz and building a
separate query for each one and combining with a UNION (or just
directly updating, in the update case). Depending on the size of the
bar table, I can get speedups of hundreds or even more than a thousand
times, but it is a big pain to have to do this.

Any tips?

Thanks,
Kevin Murphy

Illustrated:

The query I want to do is very slow:

select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.ele ment_id
FROM superlocs_2 NATURAL JOIN bundle_superloc s_2
WHERE bundle_superloc s_2.protobundle _id = 1);
-----------
7644
7644
(2 rows)
Time: 518.242 ms
The subquery is fast:

SELECT superlocs_2.ele ment_id
FROM superlocs_2 NATURAL JOIN bundle_superloc s_2
WHERE bundle_superloc s_2.protobundle _id = 1;
------------
41209
25047
(2 rows)
Time: 3.268 ms
And using indexes on the main table is fast:

select bundle_id from build.elements
where elementid in (41209, 25047);
-----------
7644
7644
(2 rows)
Time: 2.468 ms

The plan for the slow query:

egenome_test=# explain analyze select bundle_id from build.elements
where elementid in (
SELECT superlocs_2.ele ment_id
FROM superlocs_2 NATURAL JOIN bundle_superloc s_2
WHERE bundle_superloc s_2.protobundle _id = 1);
egenome_test-# egenome_test(# egenome_test(# egenome_test(#
QUERY PLAN
\

------------------------------------------------------------------------
-------------------------------------------------------------
Hash Join (cost=70.33..72 .86 rows=25 width=4) (actual
time=583.051..5 83.059 rows=2 loops=1)
Hash Cond: ("outer".elemen t_id = "inner".element id)
-> HashAggregate (cost=47.83..47 .83 rows=25 width=4) (actual
time=0.656..0.6 58 rows=2 loops=1)
-> Hash Join (cost=22.51..47 .76 rows=25 width=4) (actual
time=0.615..0.6 25 rows=2 loops=1)
Hash Cond: ("outer".superl oc_id = "inner".superlo c_id)
-> Seq Scan on superlocs_2 (cost=0.00..20. 00 rows=1000
width=8) (actual time=0.004..0.0 12 rows=9 loops=1)
-> Hash (cost=22.50..22 .50 rows=5 width=4) (actual
time=0.076..0.0 76 rows=0 loops=1)
-> Seq Scan on bundle_superloc s_2
(cost=0.00..22. 50 rows=5 width=4) (actual time=0.024..0.0 33 rows=2
loops=1)
Filter: (protobundle_id = 1)
-> Hash (cost=20.00..20 .00 rows=1000 width=8) (actual
time=581.802..5 81.802 rows=0 loops=1)
-> Seq Scan on elements (cost=0.00..20. 00 rows=1000 width=8)
(actual time=0.172..405 .243 rows=185535 loops=1)
Total runtime: 593.843 ms
(12 rows)
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05

Subscribe Reply

3353

Tom Lane

Greg Stark <gs*****@mit.ed u> writes:

It's orthogonal. My point was that I have a bigger problem, but even if I
address it by switching away from plpgsql, or I guess by using EXECUTE, I
would still have a problem. I didn't realize you could run analyze in a
transaction, but even being able to I wouldn't really want to have to do that
repeatedly during the job.
Why not? Given the sampling behavior that's been in there for a release
or two, ANALYZE is pretty cheap on large tables; certainly much cheaper
than any processing you might be doing that's going to grovel over the
whole table.
Except that the first thing the job does is delete all the old records. This
is inside a transaction. So an estimate based on the heap size would be off by
a factor of two by the time the job is done.
Could you use TRUNCATE? I dunno if locking the table is okay for you.
It is transaction safe though.
With analyze in a transaction I'm not clear what the semantics should be
though. I suppose it should only count tuples visible to the transaction
analyze?
It currently uses SnapshotNow, so would see committed tuples of other
transactions plus uncommitted ones of the present transaction. This is
not exactly the same thing as the transaction's snapshot, but close.
A sudden degradation is much more dangerous. Even if it's rare, a sudden
degradation means an outage in prime time.

[ shrug ] You can get a sudden degradation with fixed plans, too. All
it takes is an addition of a lot of rows in some table that had been
small.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #21

Gaetano Mendola

Greg Stark wrote:

Ideally I would want a guarantee that every query would *always* result in the
same plan. Once I've tested them and approved the plans I want to know that
only those approved plans will ever run, and I want to be present and be able
to verify new plans before they go into production.
What you are saying is "never run an ANALYZE" or if you do it you have to re-test
all your plans. "*always*" the same plan is a non sense because the plan depends on the
data distribution, do you test your plans for each given histogram slice ?
I doubt I'm going to convince anyone today...

For sure not me.
Regards
Gaetano Mendola

Nov 23 '05 #22

Greg Stark

Tom Lane <tg*@sss.pgh.pa .us> writes:

Except that the first thing the job does is delete all the old records. This
is inside a transaction. So an estimate based on the heap size would be off by
a factor of two by the time the job is done.
Could you use TRUNCATE? I dunno if locking the table is okay for you.
It is transaction safe though.

Well, if necessary I could, but if I can do it without downtime all the
better. In any case I think I'll be ok with a factor of 2 misestimation. I was
just giving an example use case for you to chew on when analyzing this new
proposal.

I'm not sure where I stand with the idea. I like the idea that table sizes
would always be fairly reasonable even without statistics. But I also have a
really strong desire for plan stability.
[ shrug ] You can get a sudden degradation with fixed plans, too. All
it takes is an addition of a lot of rows in some table that had been
small.

Well, presumably I should be aware if my data distribution is changing
drastically. That's under my control. At least the performance change will be
proportionate to the distribution change.

With plans changing on the fly I could have a query that degrades 1% for every
row added and then suddenly becomes 10x slower when I add a 17th extra row. Of
course such a system isn't perfectly tuned, or the optimizer issue should be
found and fixed. But I would rather find out about it without having my
application fail.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #23

Similar topics

4578

subquery performance mystery

by: lev | last post by:

CREATE TABLE . ( NULL , , (44) ) ID is non-unique. I want to select all IDs where the last entry for that ID is of type 11.

Microsoft SQL Server

8350

Performance tuning for a stored procedure

by: serge | last post by:

I have an SP that is big, huge, 700-800 lines. I am not an expert but I need to figure out every possible way that I can improve the performance speed of this SP. In the next couple of weeks I will work on preparing SQL statements that will create the tables, insert sample record and run the SP. I would hope people will look at my SP and give me any hints on how I can better write the SP.

Microsoft SQL Server

25520

DB2 vs MySQL - performance on large tables

by: Bing Wu | last post by:

Hi all, I am running a database containing large datasets: frames: 20 thousand rows, coordinates: 170 million row. The database has been implemented with: IBM DB2 v8.1

DB2 Database

5413

Better index access = worse performance??

by: Sean C. | last post by:

Helpful folks, Most of my previous experience with DB2 was on s390 mainframe systems and the optimizer on this platform always seemed very predictable and consistent. Since moving to a WinNT/UDB 7.2 environment, the choices the optimizer makes often seem flaky. But this last example really floored me. I was hoping someone could explain why I get worse response time when the optimizer uses two indexes, than when it uses one. Some context:

DB2 Database

1765

Federated Database: performance problem with outer-joins

by: Joerg Ammann | last post by:

hi, os: aix 4.3.3 DB2: version 7 FP3 we are using a federated DB setup, datasource and fed-Db are both V7FP3 (in fact they are on the same server) and are having massiv performance problems. i tracked it back to the way the queries are push-downed to the

DB2 Database

82034

SQL performance: Nested SELECT vs. INNER JOIN

by: Brian | last post by:

Hello All - I am wondering if anyone has any thoughts on which is better from a performance perspective: a nested Select statement or an Inner Join. For example, I could do either of the following: SELECT supplier_name FROM supplier

Microsoft Access / VBA

3299

performance of query for normalized DB

by: phlype.johnson | last post by:

I'm struggling to find the best query from performance point of view and readability for a normalized DB design. To illustrate better my question on whether normalized designs lead to more complex queries yes or no, I have prepared an example. The example is a database with the following tables: *table person with fields: -persid: autoincrement id -name: name of the person *table material with fields: -materialid: autoincrement id

MySQL Database

4297

Performance issue in a partitioned database

by: shsandeep | last post by:

The ETL application loaded around 3000 rows in 14 seconds in a Development database while it took 2 hours to load in a UAT database. UAT db is partitioned. Dev db is not partitioned. the application looks for existing rows in the table...if they already exist then it updates otherwise inserts them. The table is pretty large, around 6.5 million rows.

DB2 Database

1941

Subquery Performance

by: bharadwajrv | last post by:

Hi... i need to know which approach is good in-terms of performance while deleting the records from two tables... Here is my table struct: Master table (table1) ------------------------------- MasterTable_ID (PK) App_ID (FK)

DB2 Database

8761

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

9426

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9280

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9200

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9142

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8144

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6016

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4525

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

2162

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General