sign in | join about | help | sitemap
Connecting Tech Pros Worldwide
David Link's Avatar

Why does adding SUM and GROUP BY destroy performance?


Question posted by: David Link (Guest) on November 11th, 2005 11:53 PM
Hi,

Why does adding SUM and GROUP BY destroy performance?
details follow.
Thanks, David Link

s1.sql:
SELECT
t.tid, t.title,
COALESCE(s0c100r100.units, 0) as w0c100r100units,
(COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0))
as r0c2r100units
FROM
title t
JOIN upc u1 ON t.tid = u1.tid
LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc
AND s0c100r100.week = 200331 AND s0c100r100.channel = 100
AND s0c100r100.region = 100
LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc
AND r1c2r100.year = 2002 AND r1c2r100.channel = 2
AND r1c2r100.region = 100
LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc
AND y0c2r100.week = 200331 AND y0c2r100.channel = 2
AND y0c2r100.region = 100
LEFT OUTER JOIN media m ON t.media = m.key
LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key
WHERE
t.distributor != 'CONTROL LABEL'
ORDER BY
t.title ASC
LIMIT 50
;


s2.sql:
SELECT
t.tid, t.title,
SUM(COALESCE(s0c100r100.units, 0)) as w0c100r100units,
SUM((COALESCE(r1c2r100.units, 0) + COALESCE(y0c2r100.units, 0)))
as r0c2r100units
FROM
title t
JOIN upc u1 ON t.tid = u1.tid
LEFT OUTER JOIN sale_200331 s0c100r100 ON u1.upc = s0c100r100.upc
AND s0c100r100.week = 200331 AND s0c100r100.channel = 100
AND s0c100r100.region = 100
LEFT OUTER JOIN rtd r1c2r100 ON u1.upc = r1c2r100.upc
AND r1c2r100.year = 2002 AND r1c2r100.channel = 2
AND r1c2r100.region = 100
LEFT OUTER JOIN ytd_200331 y0c2r100 ON u1.upc = y0c2r100.upc
AND y0c2r100.week = 200331 AND y0c2r100.channel = 2
AND y0c2r100.region = 100
LEFT OUTER JOIN media m ON t.media = m.key
LEFT OUTER JOIN screen_format sf ON t.screen_format = sf.key
WHERE
t.distributor != 'CONTROL LABEL'
GROUP BY
t.tid, t.title
ORDER BY
t.title ASC
LIMIT 50
;


Times:
s1.sql takes 0m0.124s
s2.sql takes 1m1.450s

Stats:
title table: 68,000 rows
sale_200331 table: 150,000 rows
ytd_200331 table: 0 rows
rtd table: 650,000 rows

Indexes are in place.

s1 explain plan:
QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..65105.51 rows=50 width=132)
-> Nested Loop (cost=0.00..91726868.54 rows=70445 width=132)
Join Filter: ("outer".screen_format = "inner"."key")
-> Nested Loop (cost=0.00..91651668.74 rows=70445 width=127)
Join Filter: ("outer".media = "inner"."key")
-> Nested Loop (cost=0.00..91578053.95 rows=70445
width=122)
-> Nested Loop (cost=0.00..91236359.89
rows=70445 width=98)
-> Nested Loop (cost=0.00..90894665.82
rows=70445 width=74)
-> Nested Loop
(cost=0.00..90539626.76 rows=70445 width=50)
-> Index Scan using
title_title_ind on title t (cost=0.00..193051.67 rows=68775 width=38)
Filter: (distributor <>
'CONTROL LABEL'::character varying)
-> Index Scan using
davids_tid_index on upc u1 (cost=0.00..1309.24 rows=353 width=12)
Index Cond: ("outer".tid =
u1.tid)
-> Index Scan using
sale_200331_upc_wk_chl_reg_ind on sale_200331 s0c100r100
(cost=0.00..5.02 rows=1 width=24)
Index Cond: (("outer".upc =
s0c100r100.upc) AND (s0c100r100.week = 200331) AND (s0c100r100.channel
= 100) AND (s0c100r100.region = 100))
-> Index Scan using
rtd_upc_year_chl_reg_ind on rtd r1c2r100 (cost=0.00..4.83 rows=1
width=24)
Index Cond: (("outer".upc =
r1c2r100.upc) AND (r1c2r100."year" = 2002) AND (r1c2r100.channel = 2)
AND (r1c2r100.region = 100))
-> Index Scan using ytd_200331_upc_wkchlreg_ind
on ytd_200331 y0c2r100 (cost=0.00..4.83 rows=1 width=24)
Index Cond: (("outer".upc = y0c2r100.upc)
AND (y0c2r100.week = 200331) AND (y0c2r100.channel = 2) AND
(y0c2r100.region = 100))
-> Seq Scan on media m (cost=0.00..1.02 rows=2
width=5)
-> Seq Scan on screen_format sf (cost=0.00..1.03 rows=3
width=5)
(21 rows)


s2 explain plan:

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=403996.99..403997.11 rows=50 width=132)
-> Sort (cost=403996.99..404014.60 rows=7044 width=132)
Sort Key: t.title
-> Aggregate (cost=402393.74..403274.30 rows=7044 width=132)
-> Group (cost=402393.74..402922.08 rows=70445
width=132)
-> Sort (cost=402393.74..402569.86 rows=70445
width=132)
Sort Key: t.tid, t.title
-> Hash Join (cost=375382.76..392011.46
rows=70445 width=132)
Hash Cond: ("outer".screen_format =
"inner"."key")
-> Hash Join
(cost=375381.72..390997.78 rows=70445 width=127)
Hash Cond: ("outer".media =
"inner"."key")
-> Merge Join
(cost=375380.70..390057.49 rows=70445 width=122)
Merge Cond: ("outer".upc =
"inner".upc)
Join Filter:
(("inner".week = 200331) AND ("inner".channel = 2) AND ("inner".region
= 100))
-> Merge Join
(cost=375380.70..382782.40 rows=70445 width=98)
Merge Cond:
("outer".upc = "inner".upc)
Join Filter:
(("inner"."year" = 2002) AND ("inner".channel = 2) AND ("inner".region
= 100))
-> Sort
(cost=375310.87..375486.98 rows=70445 width=74)
Sort Key:
u1.upc
-> Nested
Loop (cost=6348.20..367282.53 rows=70445 width=74)
-> Hash
Join (cost=6348.20..12243.46 rows=70445 width=50)

Hash Cond: ("outer".tid = "inner".tid)
->
Seq Scan on upc u1 (cost=0.00..2795.28 rows=70628 width=12)
->
Hash (cost=4114.93..4114.93 rows=68775 width=38)

-> Seq Scan on title t (cost=0.00..4114.93 rows=68775 width=38)

Filter: (distributor <> 'CONTROL LABEL'::character varying)
->
Index Scan using sale_200331_upc_wk_chl_reg_ind on sale_200331
s0c100r100 (cost=0.00..5.02 rows=1 width=24)

Index Cond: (("outer".upc = s0c100r100.upc) AND (s0c100r100.week =
200331) AND (s0c100r100.channel = 100) AND (s0c100r100.region = 100))
-> Sort
(cost=69.83..72.33 rows=1000 width=24)
Sort Key:
r1c2r100.upc
-> Seq Scan
on rtd r1c2r100 (cost=0.00..20.00 rows=1000 width=24)
-> Index Scan using
ytd_200331_upc_wkchlreg_ind on ytd_200331 y0c2r100 (cost=0.00..52.00
rows=1000 width=24)
-> Hash (cost=1.02..1.02
rows=2 width=5)
-> Seq Scan on media m
(cost=0.00..1.02 rows=2 width=5)
-> Hash (cost=1.03..1.03 rows=3
width=5)
-> Seq Scan on screen_format sf
(cost=0.00..1.03 rows=3 width=5)
(36 rows)




__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

4 Answers Posted
Ron Johnson's Avatar
Guest - n/a Posts
#2: Re: Why does adding SUM and GROUP BY destroy performance?

On Wed, 2003-09-17 at 12:51, David Link wrote:[color=blue]
> Hi,
>
> Why does adding SUM and GROUP BY destroy performance?
> details follow.
> Thanks, David Link[/color]

PostgreSQL's generalized UDFs makes optimizing for the standard
aggregates very hard to do. There should be some improvement in
v7.4, though.

--
-----------------------------------------------------------------
Ron Johnson, Jr. Join Bytes!
Jefferson, LA USA

"Those who would give up essential Liberty to purchase a little
temporary safety, deserve neither Liberty nor safety." or
something like that
Ben Franklin, maybe


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to Join Bytes!)

Christopher Browne's Avatar
Christopher Browne November 11th, 2005 11:54 PM
Guest - n/a Posts
#3: Re: Why does adding SUM and GROUP BY destroy performance?

In the last exciting episode, Join Bytes! (David Link) wrote:[color=blue]
> Why does adding SUM and GROUP BY destroy performance?[/color]

When you use SUM (or other aggregates), there are no short cuts to
walking through each and every tuple specified by the WHERE clause.

On some systems there are statistics mechanisms that can short-circuit
that. On PostgreSQL, the use of MVCC to let new data "almost
magically appear" :-) has the demerit, in the case of aggregates, of
not leaving much opening for short cuts.

There are some cases where you CAN do much better than the aggregates
do.

SELECT MAX(FIELD) FROM TABLE WHERE A='THIS' and B='THAT';

may be replaced with the likely-to-be-faster:

select field from table where a = 'THIS' and b='THAT' order by field
desc limit 1;

MIN() admits a similar rewriting. If there is an index on FIELD, this
will likely be _way_ faster than using MIN()/MAX().

In a sense, it's not that aggregates "destroy" performance; just that
there are no magical shortcuts to make them incredibly fast.
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','acm.org').
http://www.ntlug.org/~cbbrowne/multiplexor.html
"And 1.1.81 is officially BugFree(tm), so if you receive any bug
reports on it, you know they are just evil lies." -- Linus Torvalds
Ang Chin Han's Avatar
Guest - n/a Posts
#4: Re: Why does adding SUM and GROUP BY destroy performance?

Christopher Browne wrote:[color=blue]
> In the last exciting episode, Join Bytes! (David Link) wrote:
>[color=green]
>>Why does adding SUM and GROUP BY destroy performance?[/color]
>
>
> When you use SUM (or other aggregates), there are no short cuts to
> walking through each and every tuple specified by the WHERE clause.[/color]

Er... not in this case, if I read David's email correctly.

His first query is walking through every tuple anyway.
His second query is the one summing them up, AFTER, here's the critical
part, GROUPing them by t.tid.

I suspect 7.4 (now in beta), or rewriting the query for <7.4 would speed
thing up. 7.4's Hash Aggregate would be the winner here.

As for rewriting this, David, try:

SELECT t.tid, t.title,
(select the stuff you want from lots of tables where something = t.tid)
FROM
title t;

Doubt it'll be as fast as using 7.4 though.

--
Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386
GNU/Linux
3:00pm up 267 days, 6:26, 4 users, load average: 5.44, 5.26, 5.17

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE/aq6DNYbTUIgzwfARAqOwAJkBlUboVeUzzfzb46LtyGqNGkeyiQ CfROqY
mTkCLVPwCABnniHh7FSAqS8=
=Cy6c
-----END PGP SIGNATURE-----

Tom Lane's Avatar
Guest - n/a Posts
#5: Re: Why does adding SUM and GROUP BY destroy performance?

Ang Chin Han <angch@bytecraft.com.my> writes:[color=blue]
> Christopher Browne wrote:[color=green]
>> When you use SUM (or other aggregates), there are no short cuts to
>> walking through each and every tuple specified by the WHERE clause.[/color][/color]
[color=blue]
> Er... not in this case, if I read David's email correctly.[/color]
[color=blue]
> His first query is walking through every tuple anyway.[/color]

No, it isn't, because he had a LIMIT. I think the real point is that
computing the first fifty groups requires sucking in a lot more tuples
than just computing the first fifty rows.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

 
Not the answer you were looking for? Post your question . . .
196,839 members ready to help you find a solution.
Join Bytes.com

What is Bytes?

We are a network of experts and professionals in IT and software development that help one another with answers to tough questions and share insights. Get the best answers to your questions from over 196,839 network members.
Post your question now . . .
It's fast and it's free

Popular Articles

Top Community Contributors