Scalability Code question - PHP vs MySQL

rich

I'm having a tough time figuring out which of these two options are
best. This is a matter of processing my data in PHP, vs MySQL.
Usually that's a no brainer, but I have a couple gotchyas here and
would love any and all opinions here. I'm going to make this as short
and simple as I can...

This is for an e-commerce site with very high traffic, and the choice
will probably not be based on speed, but which is more scalable. I
need this to last. So here's my test code.. you may not know all
these functions, but I think they're very straight forward:

// 2 ways of doing this.. 1 query or more?
$start = microtime(true);
$productSql = "SELECT * FROM $searchTemp $productWhere $sort"; //
searchTemp is a large table of denormalized data
$searchResults = $my->returnTableAssoc($productSql,
$selectFromSlave); // this just returns a multidimensional array of
the results

// this is an array_unique for a multidimensional array and will
essentially be like group_by productid
$products = remove_dups($searchResults, 'productid');
// get the other columns of data needed
$brands = array();
$cats = array();
$colors = array();
$years = array();
$bootWidth = array();
$flex = array();
foreach($searchResults as $sr)
{
$brands[] = $sr['manufacturer'];
$cats[] = $sr['categoryid'];
$colors[] = $sr['colorcode'];
$years[] = $sr['modelYear'];
$bootWidth[] = $sr['bootWidth'];
$flex[] = $sr['flexRating'];
}
$brands = array_unique($brands);
$cats = array_unique($cats);
$colors = array_unique($colors);
$years = array_unique($years);
$bootWidth = array_unique($bootWidth);
$flex = array_unique($flex);
$end = microtime(true);
echo "Did first in " . ($end - $start) . " seconds ";

// try again - just do a bunch of queries and let mysql do all the
work
$productSql = "SELECT * FROM $searchTemp $productWhere GROUP BY
productid $sort";
$products = $my->returnTableAssoc($productSql, $selectFromSlave);
$productSql = "SELECT distinct manufacturer FROM $searchTemp
$productWhere";
$brands = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct categoryid FROM $searchTemp
$productWhere";
$cats = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct colorcode FROM $searchTemp
$productWhere";
$colors = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct modelYear FROM $searchTemp
$productWhere";
$years = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct bootWidth FROM $searchTemp
$productWhere";
$bootWidth = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct flexRating FROM $searchTemp
$productWhere";
$flex = $my->returnArray($productSql, $selectFromSlave);
$end = microtime(true);
echo "Did second in " . ($end - $start) . " seconds ";
So, on my development server, #1 runs in .9 seconds, and #2 runs in
3.7 seconds. However in my live production environment with 2
webservers and 2 database servers, they run at approx 1.1 seconds
each. It's essentially a tie.

Another thing to keep in mind is whichever option I choose, I'll be
using memcache to speed things along also.

So, in short, both run at the same speed, but which one is more
scalable?

Thanks.

Aug 1 '08 #1

Subscribe Post Reply

1573

Dale

"rich" <rb*****@gmail.comwrote in message
news:df**********************************@b1g2000h sg.googlegroups.com...

I'm having a tough time figuring out which of these two options are
best. This is a matter of processing my data in PHP, vs MySQL.
Usually that's a no brainer, but I have a couple gotchyas here and
would love any and all opinions here. I'm going to make this as short
and simple as I can...

<snip>

So, in short, both run at the same speed, but which one is more
scalable?

you've got to be kidding, right? if not, you're overlooking a lot of obvious
things. as much as possible, let a db do what it was designed to do. you
know very well your php scenario won't fly and has no chance of scaling!
right?

Aug 1 '08 #2

rich

On Aug 1, 10:58*am, "Dale" <the....@example.comwrote:

you've got to be kidding, right? if not, you're overlooking a lot of obvious
things. as much as possible, let a db do what it was designed to do. you
know very well your php scenario won't fly and has no chance of scaling!
right?

Well yeah, I know to let mysql do the work. But I think this requires
more thought than a textbook answer. This is probably (from a
performance standpoint) the most important spot on the whole site and
I want to make sure this is done right. You could argue that from a
query queue point of view, running the 1 query is way faster. Really
I guess the ONLY question here, who gets to figure out distinct
values.

Also I think memcache would be more useful in the PHP scenario. Can
just store the main query and sort it out from there.

Don't get me wrong though... I'm leaning toward the MySqQL way, I just
think this is important enough to get more opinions on before I charge
ahead.

Aug 1 '08 #3

rich

On Aug 1, 11:31*am, rich <rbro...@gmail.comwrote:

Don't get me wrong though... I'm leaning toward the MySqQL way, I just
think this is important enough to get more opinions on before I charge
ahead.

Couple more things if anyone else wants to chime in here...

I ran the test again just now, and I got:

Did first in 1.1379570960999 seconds
Did second in 5.2290420532227 seconds

When I said before the times were tied, I think I was being dumb and
the queries were cached. I'm pretty sure there's no way I can count
on these being cached live because of all the possible combinations.
So again.. option 1 uses far less mysql time.. AND if i stored that
main query in memcache, it'd be even faster. BUT - that sounds like a
lot of webserver memory usage. Not great. Ugh.

Aug 1 '08 #4

Dale

"rich" <rb*****@gmail.comwrote in message
news:bd**********************************@56g2000h sm.googlegroups.com...
On Aug 1, 11:31 am, rich <rbro...@gmail.comwrote:

Don't get me wrong though... I'm leaning toward the MySqQL way, I just
think this is important enough to get more opinions on before I charge
ahead.

Couple more things if anyone else wants to chime in here...

I ran the test again just now, and I got:

Did first in 1.1379570960999 seconds
Did second in 5.2290420532227 seconds

When I said before the times were tied, I think I was being dumb and
the queries were cached. I'm pretty sure there's no way I can count
on these being cached live because of all the possible combinations.

== think again! and, those are pretty simple queries in your example. what
does your criteria look like?

Aug 1 '08 #5

rich

On Aug 1, 12:00*pm, "Dale" <the....@example.comwrote:

== think again! and, those are pretty simple queries in your example.what
does your criteria look like?

Aug 1 '08 #6

Jerry Stuckle

rich wrote:

I'm having a tough time figuring out which of these two options are
best. This is a matter of processing my data in PHP, vs MySQL.
Usually that's a no brainer, but I have a couple gotchyas here and
would love any and all opinions here. I'm going to make this as short
and simple as I can...

This is for an e-commerce site with very high traffic, and the choice
will probably not be based on speed, but which is more scalable. I
need this to last. So here's my test code.. you may not know all
these functions, but I think they're very straight forward:

// 2 ways of doing this.. 1 query or more?
$start = microtime(true);
$productSql = "SELECT * FROM $searchTemp $productWhere $sort"; //
searchTemp is a large table of denormalized data
$searchResults = $my->returnTableAssoc($productSql,
$selectFromSlave); // this just returns a multidimensional array of
the results

// this is an array_unique for a multidimensional array and will
essentially be like group_by productid
$products = remove_dups($searchResults, 'productid');
// get the other columns of data needed
$brands = array();
$cats = array();
$colors = array();
$years = array();
$bootWidth = array();
$flex = array();
foreach($searchResults as $sr)
{
$brands[] = $sr['manufacturer'];
$cats[] = $sr['categoryid'];
$colors[] = $sr['colorcode'];
$years[] = $sr['modelYear'];
$bootWidth[] = $sr['bootWidth'];
$flex[] = $sr['flexRating'];
}
$brands = array_unique($brands);
$cats = array_unique($cats);
$colors = array_unique($colors);
$years = array_unique($years);
$bootWidth = array_unique($bootWidth);
$flex = array_unique($flex);
$end = microtime(true);
echo "Did first in " . ($end - $start) . " seconds ";

// try again - just do a bunch of queries and let mysql do all the
work
$productSql = "SELECT * FROM $searchTemp $productWhere GROUP BY
productid $sort";
$products = $my->returnTableAssoc($productSql, $selectFromSlave);
$productSql = "SELECT distinct manufacturer FROM $searchTemp
$productWhere";
$brands = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct categoryid FROM $searchTemp
$productWhere";
$cats = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct colorcode FROM $searchTemp
$productWhere";
$colors = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct modelYear FROM $searchTemp
$productWhere";
$years = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct bootWidth FROM $searchTemp
$productWhere";
$bootWidth = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct flexRating FROM $searchTemp
$productWhere";
$flex = $my->returnArray($productSql, $selectFromSlave);
$end = microtime(true);
echo "Did second in " . ($end - $start) . " seconds ";
So, on my development server, #1 runs in .9 seconds, and #2 runs in
3.7 seconds. However in my live production environment with 2
webservers and 2 database servers, they run at approx 1.1 seconds
each. It's essentially a tie.

Another thing to keep in mind is whichever option I choose, I'll be
using memcache to speed things along also.

So, in short, both run at the same speed, but which one is more
scalable?

Thanks.

Rich,

Let the database do its job.

In general, you will get the best performance with a single SQL call
returning all of the data.

But you indicate the database is denormalized. Although denormalizing a
database can at times improve speed, it cuts the scalability of the
application, and as your database grows, it can actually slow down
performance because duplicate data being returned uses up the caches
much more quickly. So the absolute last thing you should do to improve
performance is to denormalize your database.

But this is getting too much off topic here. You can get more info on
this in comp.databases.mysql, as well as help in tuning your mysql system.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
js*******@attglobal.net
==================

Aug 1 '08 #7

Dale

"rich" <rb*****@gmail.comwrote in message
news:e3**********************************@m36g2000 hse.googlegroups.com...
On Aug 1, 12:00 pm, "Dale" <the....@example.comwrote:

== think again! and, those are pretty simple queries in your example. what
does your criteria look like?

Since the table is just denormalized data, the queries are very
simple. Right now there's only one where clause to just pull product
for the current category. From there, the worst it'll get is give me
anything in red and size 11, etc. No multiple tables or joins here.
But this is why I say I don't think I can rely on the query cache.
For all the attributes we allow the user to filter for, there's
probably millions of combinations across all possible outcomes.

== perhaps not for the millions, but i'm sure there will be several that are
recurring. even still, the question is scaling. as i said before, php cannot
cache and cannot index. php also has zero execution plan. further, you can
adjust a db's execution plan based on those recurring patterns of
user-defined criteria! if you're still talking about scaling and you're
still thinking php, you've just opted-out of three very key performance
enhancers.

also, do you need to select * from? why not optimize your query?

// first, you've specified only the rows you really need
// second, you've asked the db to make them as distinct
// as it can get it...

SELECT DISTINCT
manufacturer brand ,
categoryid category ,
colorCode color ,
modelYear modelYear ,
bootWidth width ,
flexRating rating
FROM products
<< search criteria >>

// at this point, there is far less data
// that php has to churn through
$manufacturers = array();
$categories = array();
$colors = array();
$modelYears = array();
$bootWidths = array();
$flexRatings = array();
$records = db::execute($sql);
foreach ($records as $record)
{
$manufacturers[$record['BRAND']] = $record['BRAND'];
$categories[$record['CATEGORY']] = $record['CATEGORY'];
$colors[$record['COLOR']] = $record['COLOR'];
$modelYears[$record['MODELYEAR']] = $record['MODELYEAR'];
$bootWidths[$record['WIDTH']] = $record['WIDTH'];
$flexRatings[$record['RATING']] = $record['RATING'];
}

now, what are you test results with these changes?

Aug 1 '08 #8

Dale

"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:0Y******************************@comcast.com. ..

rich wrote:

<snip>

But you indicate the database is denormalized. Although denormalizing a
database can at times improve speed, it cuts the scalability of the
application, and as your database grows, it can actually slow down
performance because duplicate data being returned uses up the caches much
more quickly. So the absolute last thing you should do to improve
performance is to denormalize your database.

But this is getting too much off topic here. You can get more info on
this in comp.databases.mysql, as well as help in tuning your mysql system.

in case you missed it, jerry-berry, his POV is that he's got a PHP system
and not a mysql one, as you've put it. his post directly deals with PHP, and
it just so happens that mysql and apache often come up in the course of
discussion. get used to it. better yet, IGNORE OT CONTENT. god knows you go
OT at every possible opportunity!

Aug 1 '08 #9

Dale

"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:0Y******************************@comcast.com. ..

rich wrote:

<snip>

Let the database do its job.

In general, you will get the best performance with a single SQL call
returning all of the data.

But you indicate the database is denormalized. Although denormalizing a
database can at times improve speed, it cuts the scalability of the
application, and as your database grows, it can actually slow down
performance because duplicate data being returned uses up the caches much
more quickly. So the absolute last thing you should do to improve
performance is to denormalize your database.

i will say however, jerry, that i couldn't agree more here. rich should
notice that having 'manufacturers', 'categories', 'colors', and 'flex
ratings' in their own tables would reduce the number of rows to be scanned
by a ton. for the other columns they may be non-standard lookups, it means
that a SELECT DISTINCT over the product table just for those columns would
greatly increase the number of 'duplicates' returned to php. his individual
selects for mfg's, cat's, etc. should be lightning fast at that point too.

it doesn't happen often, but you actually gave good advice here. i'm
shocked! :^)

Aug 1 '08 #10

Dale

"Dale" <th*****@example.comwrote in message
news:2u*****************@newsfe09.iad...

>
"Jerry Stuckle" <js*******@attglobal.netwrote in message
news:0Y******************************@comcast.com. ..
>rich wrote:

<snip>

>Let the database do its job.

In general, you will get the best performance with a single SQL call
returning all of the data.

But you indicate the database is denormalized. Although denormalizing a
database can at times improve speed, it cuts the scalability of the
application, and as your database grows, it can actually slow down
performance because duplicate data being returned uses up the caches much
more quickly. So the absolute last thing you should do to improve
performance is to denormalize your database.

i will say however, jerry, that i couldn't agree more here. rich should
notice that having 'manufacturers', 'categories', 'colors', and 'flex
ratings' in their own tables would reduce the number of rows to be scanned
by a ton. for the other columns they may be non-standard lookups, it means
that a SELECT DISTINCT over the product table just for those columns would
greatly increase the number of 'duplicates' returned to php

errrr...should read, 'greatly *decrease*'.

:)

Aug 1 '08 #11

petersprc

Creating indexes for frequently-used combinations of search and result
columns will speed up case #2 and would be the most scalable approach.
Otherwise, a full table scan would be needed for each query. You can
use "explain" to check if your indexes are being used. An optional
refinement would be to create a stored proc that returns multiple sets
to avoid additional round trips between the DB and application.

In both cases you're storing all the matching products in an array, so
it looks like you expect this list to be a manageable size anyway and
presumably you'll be iterating over it. If so, and provided there
aren't too many duplicate entries in the DB or you can rework your DB
to eliminate these, approach #1 would be fine. You don't need
array_unique if you store the value in the key, e.g.:
"$brands[$sr['manufacturer']] = true;"

A third way, more common in batch apps, is to create a temp table with
the base results, then query this subset for each distinct field.

Regards,

John Peters

On Aug 1, 10:32 am, rich <rbro...@gmail.comwrote:

I'm having a tough time figuring out which of these two options are
best. This is a matter of processing my data in PHP, vs MySQL.
Usually that's a no brainer, but I have a couple gotchyas here and
would love any and all opinions here. I'm going to make this as short
and simple as I can...

This is for an e-commerce site with very high traffic, and the choice
will probably not be based on speed, but which is more scalable. I
need this to last. So here's my test code.. you may not know all
these functions, but I think they're very straight forward:

// 2 ways of doing this.. 1 query or more?
$start = microtime(true);
$productSql = "SELECT * FROM $searchTemp $productWhere $sort"; //
searchTemp is a large table of denormalized data
$searchResults = $my->returnTableAssoc($productSql,
$selectFromSlave); // this just returns a multidimensional array of
the results

// this is an array_unique for a multidimensional array and will
essentially be like group_by productid
$products = remove_dups($searchResults, 'productid');
// get the other columns of data needed
$brands = array();
$cats = array();
$colors = array();
$years = array();
$bootWidth = array();
$flex = array();
foreach($searchResults as $sr)
{
$brands[] = $sr['manufacturer'];
$cats[] = $sr['categoryid'];
$colors[] = $sr['colorcode'];
$years[] = $sr['modelYear'];
$bootWidth[] = $sr['bootWidth'];
$flex[] = $sr['flexRating'];}

$brands = array_unique($brands);
$cats = array_unique($cats);
$colors = array_unique($colors);
$years = array_unique($years);
$bootWidth = array_unique($bootWidth);
$flex = array_unique($flex);
$end = microtime(true);
echo "Did first in " . ($end - $start) . " seconds ";

// try again - just do a bunch of queries and let mysql do all the
work
$productSql = "SELECT * FROM $searchTemp $productWhere GROUP BY
productid $sort";
$products = $my->returnTableAssoc($productSql, $selectFromSlave);
$productSql = "SELECT distinct manufacturer FROM $searchTemp
$productWhere";
$brands = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct categoryid FROM $searchTemp
$productWhere";
$cats = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct colorcode FROM $searchTemp
$productWhere";
$colors = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct modelYear FROM $searchTemp
$productWhere";
$years = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct bootWidth FROM $searchTemp
$productWhere";
$bootWidth = $my->returnArray($productSql, $selectFromSlave);
$productSql = "SELECT distinct flexRating FROM $searchTemp
$productWhere";
$flex = $my->returnArray($productSql, $selectFromSlave);
$end = microtime(true);
echo "Did second in " . ($end - $start) . " seconds ";

So, on my development server, #1 runs in .9 seconds, and #2 runs in
3.7 seconds. However in my live production environment with 2
webservers and 2 database servers, they run at approx 1.1 seconds
each. It's essentially a tie.

Another thing to keep in mind is whichever option I choose, I'll be
using memcache to speed things along also.

So, in short, both run at the same speed, but which one is more
scalable?

Thanks.

Aug 1 '08 #12

Scalability Code question - PHP vs MySQL

Similar topics