By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,809 Members | 1,997 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,809 IT Pros & Developers. It's quick & easy.

Why does adding SUM and GROUP BY destroy performance?

P: n/a
Hi,

Why does adding SUM and GROUP BY destroy performance?
details follow.
Thanks, David Link

s1.sql:
SELECT
t.tid, t.title,
COALESCE(s0c100r100.units, 0) as w0c100r100units,
(COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0))
as r0c2r100units
FROM
title t
JOIN upc u1 ON t.tid = u1.tid
LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc
AND s0c100r100.week = 200331 AND s0c100r100.channel = 100
AND s0c100r100.region = 100
LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc
AND r1c2r100.year = 2002 AND r1c2r100.channel = 2
AND r1c2r100.region = 100
LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc
AND y0c2r100.week = 200331 AND y0c2r100.channel = 2
AND y0c2r100.region = 100
LEFT OUTER JOIN media m ON t.media = m.key
LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key
WHERE
t.distributor != 'CONTROL LABEL'
ORDER BY
t.title ASC
LIMIT 50
;
s2.sql:
SELECT
t.tid, t.title,
SUM(COALESCE(s0c100r100.units, 0)) as w0c100r100units,
SUM((COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0)))
as r0c2r100units
FROM
title t
JOIN upc u1 ON t.tid = u1.tid
LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc
AND s0c100r100.week = 200331 AND s0c100r100.channel = 100
AND s0c100r100.region = 100
LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc
AND r1c2r100.year = 2002 AND r1c2r100.channel = 2
AND r1c2r100.region = 100
LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc
AND y0c2r100.week = 200331 AND y0c2r100.channel = 2
AND y0c2r100.region = 100
LEFT OUTER JOIN media m ON t.media = m.key
LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key
WHERE
t.distributor != 'CONTROL LABEL'
GROUP BY
t.tid, t.title
ORDER BY
t.title ASC
LIMIT 50
;
Times:
s1.sql takes 0m0.124s
s2.sql takes 1m1.450s

Stats:
title table: 68,000 rows
sale_200331 table: 150,000 rows
ytd_200331 table: 0 rows
rtd table: 650,000 rows

Indexes are in place.

s1 explain plan:
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..65105.51 rows=50 width=132)
-> Nested Loop (cost=0.00..91726868.54 rows=70445 width=132)
Join Filter: ("outer".screen_format = "inner"."key")
-> Nested Loop (cost=0.00..91651668.74 rows=70445 width=127)
Join Filter: ("outer".media = "inner"."key")
-> Nested Loop (cost=0.00..91578053.95 rows=70445
width=122)
-> Nested Loop (cost=0.00..91236359.89
rows=70445 width=98)
-> Nested Loop (cost=0.00..90894665.82
rows=70445 width=74)
-> Nested Loop
(cost=0.00..90539626.76 rows=70445 width=50)
-> Index Scan using
title_title_ind on title t (cost=0.00..193051.67 rows=68775 width=38)
Filter: (distributor <>
'CONTROL LABEL'::character varying)
-> Index Scan using
davids_tid_index on upc u1 (cost=0.00..1309.24 rows=353 width=12)
Index Cond: ("outer".tid =
u1.tid)
-> Index Scan using
sale_200331_upc_wk_chl_reg_ind on sale_200331 s0c100r100
(cost=0.00..5.02 rows=1 width=24)
Index Cond: (("outer".upc =
s0c100r100.upc) AND (s0c100r100.week = 200331) AND (s0c100r100.channel
= 100) AND (s0c100r100.region = 100))
-> Index Scan using
rtd_upc_year_chl_reg_ind on rtd r1c2r100 (cost=0.00..4.83 rows=1
width=24)
Index Cond: (("outer".upc =
r1c2r100.upc) AND (r1c2r100."year" = 2002) AND (r1c2r100.channel = 2)
AND (r1c2r100.region = 100))
-> Index Scan using ytd_200331_upc_wkchlreg_ind
on ytd_200331 y0c2r100 (cost=0.00..4.83 rows=1 width=24)
Index Cond: (("outer".upc = y0c2r100.upc)
AND (y0c2r100.week = 200331) AND (y0c2r100.channel = 2) AND
(y0c2r100.region = 100))
-> Seq Scan on media m (cost=0.00..1.02 rows=2
width=5)
-> Seq Scan on screen_format sf (cost=0.00..1.03 rows=3
width=5)
(21 rows)
s2 explain plan:

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=403996.99..403997.11 rows=50 width=132)
-> Sort (cost=403996.99..404014.60 rows=7044 width=132)
Sort Key: t.title
-> Aggregate (cost=402393.74..403274.30 rows=7044 width=132)
-> Group (cost=402393.74..402922.08 rows=70445
width=132)
-> Sort (cost=402393.74..402569.86 rows=70445
width=132)
Sort Key: t.tid, t.title
-> Hash Join (cost=375382.76..392011.46
rows=70445 width=132)
Hash Cond: ("outer".screen_format =
"inner"."key")
-> Hash Join
(cost=375381.72..390997.78 rows=70445 width=127)
Hash Cond: ("outer".media =
"inner"."key")
-> Merge Join
(cost=375380.70..390057.49 rows=70445 width=122)
Merge Cond: ("outer".upc =
"inner".upc)
Join Filter:
(("inner".week = 200331) AND ("inner".channel = 2) AND ("inner".region
= 100))
-> Merge Join
(cost=375380.70..382782.40 rows=70445 width=98)
Merge Cond:
("outer".upc = "inner".upc)
Join Filter:
(("inner"."year" = 2002) AND ("inner".channel = 2) AND ("inner".region
= 100))
-> Sort
(cost=375310.87..375486.98 rows=70445 width=74)
Sort Key:
u1.upc
-> Nested
Loop (cost=6348.20..367282.53 rows=70445 width=74)
-> Hash
Join (cost=6348.20..12243.46 rows=70445 width=50)

Hash Cond: ("outer".tid = "inner".tid)
->
Seq Scan on upc u1 (cost=0.00..2795.28 rows=70628 width=12)
->
Hash (cost=4114.93..4114.93 rows=68775 width=38)

-> Seq Scan on title t (cost=0.00..4114.93 rows=68775 width=38)

Filter: (distributor <> 'CONTROL LABEL'::character varying)
->
Index Scan using sale_200331_upc_wk_chl_reg_ind on sale_200331
s0c100r100 (cost=0.00..5.02 rows=1 width=24)

Index Cond: (("outer".upc = s0c100r100.upc) AND (s0c100r100.week =
200331) AND (s0c100r100.channel = 100) AND (s0c100r100.region = 100))
-> Sort
(cost=69.83..72.33 rows=1000 width=24)
Sort Key:
r1c2r100.upc
-> Seq Scan
on rtd r1c2r100 (cost=0.00..20.00 rows=1000 width=24)
-> Index Scan using
ytd_200331_upc_wkchlreg_ind on ytd_200331 y0c2r100 (cost=0.00..52.00
rows=1000 width=24)
-> Hash (cost=1.02..1.02
rows=2 width=5)
-> Seq Scan on media m
(cost=0.00..1.02 rows=2 width=5)
-> Hash (cost=1.03..1.03 rows=3
width=5)
-> Seq Scan on screen_format sf
(cost=0.00..1.03 rows=3 width=5)
(36 rows)


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 11 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On Wed, 2003-09-17 at 12:51, David Link wrote:
Hi,

Why does adding SUM and GROUP BY destroy performance?
details follow.
Thanks, David Link


PostgreSQL's generalized UDFs makes optimizing for the standard
aggregates very hard to do. There should be some improvement in
v7.4, though.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ro***********@cox.net
Jefferson, LA USA

"Those who would give up essential Liberty to purchase a little
temporary safety, deserve neither Liberty nor safety." or
something like that
Ben Franklin, maybe
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 11 '05 #2

P: n/a
In the last exciting episode, dv****@yahoo.com (David Link) wrote:
Why does adding SUM and GROUP BY destroy performance?


When you use SUM (or other aggregates), there are no short cuts to
walking through each and every tuple specified by the WHERE clause.

On some systems there are statistics mechanisms that can short-circuit
that. On PostgreSQL, the use of MVCC to let new data "almost
magically appear" :-) has the demerit, in the case of aggregates, of
not leaving much opening for short cuts.

There are some cases where you CAN do much better than the aggregates
do.

SELECT MAX(FIELD) FROM TABLE WHERE A='THIS' and B='THAT';

may be replaced with the likely-to-be-faster:

select field from table where a = 'THIS' and b='THAT' order by field
desc limit 1;

MIN() admits a similar rewriting. If there is an index on FIELD, this
will likely be _way_ faster than using MIN()/MAX().

In a sense, it's not that aggregates "destroy" performance; just that
there are no magical shortcuts to make them incredibly fast.
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org').
http://www.ntlug.org/~cbbrowne/multiplexor.html
"And 1.1.81 is officially BugFree(tm), so if you receive any bug
reports on it, you know they are just evil lies." -- Linus Torvalds
Nov 11 '05 #3

P: n/a
Christopher Browne wrote:
In the last exciting episode, dv****@yahoo.com (David Link) wrote:
Why does adding SUM and GROUP BY destroy performance?

When you use SUM (or other aggregates), there are no short cuts to
walking through each and every tuple specified by the WHERE clause.


Er... not in this case, if I read David's email correctly.

His first query is walking through every tuple anyway.
His second query is the one summing them up, AFTER, here's the critical
part, GROUPing them by t.tid.

I suspect 7.4 (now in beta), or rewriting the query for <7.4 would speed
thing up. 7.4's Hash Aggregate would be the winner here.

As for rewriting this, David, try:

SELECT t.tid, t.title,
(select the stuff you want from lots of tables where something = t.tid)
FROM
title t;

Doubt it'll be as fast as using 7.4 though.

--
Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386
GNU/Linux
3:00pm up 267 days, 6:26, 4 users, load average: 5.44, 5.26, 5.17

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/aq6DNYbTUIgzwfARAqOwAJkBlUboVeUzzfzb46LtyGqNGkeyiQ CfROqY
mTkCLVPwCABnniHh7FSAqS8=
=Cy6c
-----END PGP SIGNATURE-----

Nov 11 '05 #4

P: n/a
Ang Chin Han <an***@bytecraft.com.my> writes:
Christopher Browne wrote:
When you use SUM (or other aggregates), there are no short cuts to
walking through each and every tuple specified by the WHERE clause.
Er... not in this case, if I read David's email correctly. His first query is walking through every tuple anyway.


No, it isn't, because he had a LIMIT. I think the real point is that
computing the first fifty groups requires sucking in a lot more tuples
than just computing the first fifty rows.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 11 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.