Speeding up LIKE with placeholders?

Dan Sugalski

Is there any good way to speed up SQL that uses like and has placeholders?

Here's the scoop. I've got a system that uses a lot of pre-generated
SQL with placeholders in it. At runtime these SQL statements are
fired off (through the C PQexecParams function, if that matters) for
execution. No prepares or anything, just bare statements with $1 and
friends, with the values passed in as parameters. Straightforward,
and no big deal.

Unfortunately, performance is horrible. And when I mean horrible,
we're talking 6 orders of magnitude (101355.884 ms vs 0.267 ms) when
checked out via EXPLAIN ANALYZE. The slow version has the SQL defined
as a function with the parameters passed in, while the fast way has
the parameters substituted in, and the query plan for the slow
version notes that it's doing a sequential scan, while the fast
version uses one of the indexes. (And the field being LIKEd has a
b-tree index on it) The LIKE condition always has a constant prefix
-- it's 'S%' or 'S42343%' -- so it fits the index.

Now, I'd not be surprised for a generic function to do this, if the
plan is created when the function is created, and I can deal with
that. I'd figure, though, that since the parameters are being passed
into PQexecParams basically to get them out of band so I don't have
to deal with escaping, quoting, and suchlike things, that the
optimizer would look at things *after* the substitution was done.

Is there anything I can do to speed this up, short of doing the
parameter substitution myself and skipping PQexecParams here? (Which
I'd rather not, since it's a pain and somewhat error-prone (for me,
at least))
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
da*@sidhe.org have teddy bears and even
teddy bears get drunk

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #1

Subscribe Post Reply

2407

Tom Lane

Dan Sugalski <da*@sidhe.org> writes:

I'd figure, though, that since the parameters are being passed
into PQexecParams basically to get them out of band so I don't have
to deal with escaping, quoting, and suchlike things, that the
optimizer would look at things *after* the substitution was done.

You'd figure wrong :-(. The present mechanism for the LIKE-to-index
optimization requires the LIKE pattern to be a pure, unadulterated constant.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 23 '05 #2

Dan Sugalski

At 5:19 PM -0400 9/10/04, Tom Lane wrote:

Dan Sugalski <da*@sidhe.org> writes:
I'd figure, though, that since the parameters are being passed
into PQexecParams basically to get them out of band so I don't have
to deal with escaping, quoting, and suchlike things, that the
optimizer would look at things *after* the substitution was done.

You'd figure wrong :-(. The present mechanism for the LIKE-to-index
optimization requires the LIKE pattern to be a pure, unadulterated constant.

Well. Darn.

Would I regret it if I asked where in the source this lies so I could
go fix it?
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
da*@sidhe.org have teddy bears and even
teddy bears get drunk

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #3

Tom Lane

Dan Sugalski <da*@sidhe.org> writes:

Would I regret it if I asked where in the source this lies so I could
go fix it?

If it were easy to fix it would have been fixed before now ...

I have toyed with the notion of converting "var LIKE pattern" to
"var LIKE pattern AND var >= lowbound(pattern) AND var < highbound(pattern)"
where lowbound() and highbound() are actual functions that we leave in
the generated plan, rather than insisting that the planner derive these
bounds before making the plan at all. Then the pattern wouldn't have
to be a true constant. However, it falls down on this problem: what
shall those functions do if the supplied pattern isn't left-anchored at
all? highbound in particular doesn't have a valid result it can give
that's guaranteed larger than all possible values of var. Not to
mention that a full-table index scan is the very last thing you want ---
I think the planner would really be abdicating its responsibilities to
generate a plan with that kind of downside risk.

You could possibly sidestep this argument by envisioning a query like
var LIKE ('^' || $1)
but I doubt that anyone actually writes such things. In the end, LIKE
is the sort of thing that you really have to run a planning cycle for
in order to get a reasonable plan.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #4

Dan Sugalski

At 5:55 PM -0400 9/10/04, Tom Lane wrote:

Dan Sugalski <da*@sidhe.org> writes:
Would I regret it if I asked where in the source this lies so I could
go fix it?

If it were easy to fix it would have been fixed before now ...

Oh, I wasn't expecting it to be an *easy* fix... :) The question is
whether it's less work to make the fix in Postgres or in my back-end
library code. Worst case I fix it in my code and submit a doc patch,
but I'm up for at least investigating the general fix.

Since the only difference in this case is that the parameters are
pulled out for transport rather than being in band (a
properly-escaped string substitution could turn this case from a
PQexecParams call into a PQexec call) I was thinking the thing to do
would be to either teach the planner to look in the parameter list
when it gets handed $xxx variables, or have the back-end do the
substitution to the SQL before handing the code to the planner.

I'm not sure I like either option (the pre-substitution's got some
issues when handed large parameters, while teaching the planner to
use a parameter list may require thumping a lot of back-end code) and
it may amount to nothing, but I figured it was worth a look, if for
no other reason than to find a big mass of code I can safely run
screaming from. ;-)
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
da*@sidhe.org have teddy bears and even
teddy bears get drunk

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #5

Tom Lane

Dan Sugalski <da*@sidhe.org> writes:

Since the only difference in this case is that the parameters are
pulled out for transport rather than being in band (a
properly-escaped string substitution could turn this case from a
PQexecParams call into a PQexec call) I was thinking the thing to do
would be to either teach the planner to look in the parameter list
when it gets handed $xxx variables, or have the back-end do the
substitution to the SQL before handing the code to the planner.

This has already been considered and rejected. Oliver Jowett did the
part that is safe, which is to use the parameter values for estimation
purposes in other contexts, but pre-substituting a parameter value for
LIKE calls the mere correctness of the plan into question.

What it would take to make it workable is a change in the semantics of
the v3 protocol messages, such that there is no re-use of a plan. That,
no one is up for quite yet, when we just hacked the protocol last year ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #6

Dan Sugalski

At 6:33 PM -0400 9/10/04, Tom Lane wrote:

Dan Sugalski <da*@sidhe.org> writes:
Since the only difference in this case is that the parameters are
pulled out for transport rather than being in band (a
properly-escaped string substitution could turn this case from a
PQexecParams call into a PQexec call) I was thinking the thing to do
would be to either teach the planner to look in the parameter list
when it gets handed $xxx variables, or have the back-end do the
substitution to the SQL before handing the code to the planner.
This has already been considered and rejected. Oliver Jowett did the
part that is safe, which is to use the parameter values for estimation
purposes in other contexts, but pre-substituting a parameter value for
LIKE calls the mere correctness of the plan into question.

Ouch. Okay, fair 'nuff. (I figured the parameters could be factored
in before the plan was made. Wrongly, I see, now that I poke around
in the code a bit :) Plan B for me it is.
What it would take to make it workable is a change in the semantics of
the v3 protocol messages, such that there is no re-use of a plan. That,
no one is up for quite yet, when we just hacked the protocol last year ...

It might be possible with a backwards-compatible protocol change, but
that's more work than I'm up for, and this is the wrong list for it
anyway.
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
da*@sidhe.org have teddy bears and even
teddy bears get drunk

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #7

Greg Stark

Tom Lane <tg*@sss.pgh.pa.us> writes:

I think the planner would really be abdicating its responsibilities to
generate a plan with that kind of downside risk.
Sure, but what about the risk of using a sequential scan the other 99% of the
time? The downside risk of the index scan is a 5x slowdown or so. The downside
risk of the sequential scan is unbounded.
You could possibly sidestep this argument by envisioning a query like
var LIKE ('^' || $1)
but I doubt that anyone actually writes such things. In the end, LIKE
is the sort of thing that you really have to run a planning cycle for
in order to get a reasonable plan.

Actually ^ doesn't mean anything to LIKE. There's no way to anchor a LIKE
pattern except by ensuring it doesn't start with % or _.

I don't know. I wrote code that did "LIKE ?||'%'" on Oracle tons of times and
it always used an index scan. I was really impressed when I first checked
whether that worked and really happy when it did. And it always ran just fine.

In retrospect I would have done something like "LIKE escape(?)||'%'". Except
there's no such function. And if I had to write it myself I would do it in the
application. String manipulation in SQL always being such a pain. And in any
case I would have to check for an empty argument and handle that with some
friendly UI message, which can't be done with a simple function in the query.
So the database would be none the wiser and I still would have been
disappointed if it didn't use the index scan.

In the end it's always possible to fool the planner into producing a bad plan.
It's just got to pick the plan that's most likely to be the one the user
intended and least dangerous. It's hard to picture someone intentionally doing
?||'%' without thinking it would use an index scan. If they didn't check for
leading %s and _s or empty parameters then it was their oversight or they were
expecting it to be slow.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #8

Martijn van Oosterhout

On Sat, Sep 11, 2004 at 02:30:35AM -0400, Greg Stark wrote:

I don't know. I wrote code that did "LIKE ?||'%'" on Oracle tons of timesand
it always used an index scan. I was really impressed when I first checked
whether that worked and really happy when it did. And it always ran just fine.
Note, this would have worked fine in previous versions of postgresql,
because there was no such thing as prepared statements so the plan
needed to be recreated for each query, thus it could take advantage of
all knowledge available in the query itself.

Kinda ironic that a seemingly very nice feature (seperating query from
arguments) has such a side-effect (less information for planner).
--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFBQsPvY5Twig3Ge+YRAvbhAJ9GJS5szbEaiZEfzasA7p tMoSaoFgCggkW2
AIEwYmv6OsF+XHBoaaUeYd4=
=NvJz
-----END PGP SIGNATURE-----

Nov 23 '05 #9

Pierre-Frédéric Caillaud

If I understand correctly your problem is that the plan for your prepared
query is bad because the LIKE parameter is not known at prepare time...

Immediate solutions :

Is it possible to use EXECUTE in your SQL to sidestep this and replan the
query with the right values ?
Can you put your query in a function so that it's planned on first
execution with real parameter values ?

There is still the old :
SELECT ... WHERE field BETWEEN 'prefix' AND 'prefiy'; but it kinda sucks
;)

---------------------------
Maybe this just indicates a lack of user-friendliness in the API ?
Suggestions for improvement (in random order):
Don't flame me for suggesting impossible things... I'm just wishing.

* Maybe a new operator like <column STARTSWITH 'prefix'> could replace
<column LIKE 'prefix%'> and tell the planner with absolute certainty that,
whatever the value is, use of an index is possible ? It's possible to
create user defined operators in Postgres... so...

* In a future version, when planning prepared queries, maybe the planner
could flag a query as "replan according to real parameter values every
time it's executed" when it detects a hard to plan condition like "LIKE
$1" or other hard to predict cases ? In this case the cost of planning
would occur at each execution, but the query parsing time would still be
saved. Or maybe the planner could write itself a note saying : "I planned
the query for the generic LIKE, but if $1 does not start with a %, I
should replan it when it's executed"

* Maybe there should be a keyword like "REPLAN" to allow the user to
specify that the query should be replanned every time ?

* Maybe there should be a keyword like "DEFER PLAN" to allow the user to
specify that the query should be not be planned when PREPARE is called,
but rather on its first execution, with real parameters, and the plan then
stored and used for subsequent queries ?
This has the disadvantage of depending on the next query to specify
adequate parameters.

Maybe we could tell the planner which parameters to use for preparing the
plan :
PREPARE myquery( text, integer )
WITH PARAMETERS ('aa%', 5)
AS SELECT * FROM mytable WHERE ...

so the plan stored for this prepared query would be generated using the
specified parameter values instead of just placeholders (so the planner
may generate a better plan, in this case it would notice the prefix nature
of the LIKE 'aa%').

* Maybe we could give the planner hints about the placeholders in a
prepared query. Something like

PREPARE myquery( text, integer )
AS SELECT * FROM mytable WHERE ...

would become :

PREPARE myquery( text, integer )
WHERE expression
AS SELECT * FROM mytable WHERE ...

where expression would be any boolean expression giving hints to the
planner, like :
$2 NOT NULL
in this case the planner would know that it should take into account only
the stats for $2 not null,10 ; and ignore the non-indexed NULLs for
example. However this is redundant as the "$2 NOT NULL" can as well be put
in the WHERE part of the query.

This could be used in the 'LIKE $1' case by specifying that $1 cannot
start with a '%'. The planner would have to be pretty smart to take this
into account...
When the expression is false, the query would be re-planned instead of
using the stored plan.

* Maybe we could add a clause to this poster's SELECT :

PREPARE myquery AS SELECT * FROM mytable
WHERE myfield LIKE $1 AND NOT ($1 LIKE '%%%');

In this case the planner would have to recognize the "NOT ($1 LIKE
'%%%')" part and take appropriate measures... When executing the query
this would not add any cost as it is a constant. However if the user
messes up and sends $1='%something', he'll get no results.

Anyway,
have a nice day !

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #10

Dan Sugalski

At 12:30 PM +0200 9/11/04, Pierre-Frédéric Caillaud wrote:

If I understand correctly your problem is
that the plan for your prepared query is bad
because the LIKE parameter is not known at
prepare time...

Almost. My problem is that queries made with the
PQexecParams C call are automatically split into
a prepare and execute pair. So while as far as my
code's concerned everything goes over to the
server in one big wad, the same as if I'd sent it
over with a PQexec call, the server sees it in
several pieces.

The only way to fix this is to change the way
that PQexecParams is handled, which from what I
can see of the postgres code would require a
protocol change and corresponding extension of
the server back end, or a change in the way
prepared queries are executed by the server,
neither of which is a particularly simple thing
to do. (Which I didn't realize initially)
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
da*@sidhe.org have teddy bears and even
teddy bears get drunk

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #11

Greg Stark

Martijn van Oosterhout <kl*****@svana.org> writes:

Kinda ironic that a seemingly very nice feature (seperating query from
arguments) has such a side-effect (less information for planner).

It was an known and expected downside.

But there is an upside too. For me it's actually a benefit, because it means
I'm less likely to called in the middle of the night because the web site when
someone entered some unusual parameters and caused a bad plan.

Whatever plan it's using when I test it will be the same as the plan it uses
for others, which may not be ideal but it won't be a surprise and I can
anticipate whether it'll be a reasonable plan in the future.

I can't anticipate all the potential plans the database might come up with for
all the possible parameters.

--
greg
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #12

Speeding up LIKE with placeholders?

Similar topics