Urgent Questions regarding Index usage & VIEWs, sorting, etc.

Mike

Hello,

I have a few rather urgent questions that I hope someone can help with (I
need to figure this out prior to a meeting tomorrow.) First, a bit of
background: The company I work for is developing a web-based application,
one part of which involves allowing the user the ability to page through
transaction "history" information.

The _summary_ history table will have the following fields: ServiceName,
Date, User-Ref1, User-Ref2, User-Ref3, TransactionID. The "User-Ref" fields
are supplied by the user to help them identify records of interest.

The user will be allowed to choose from several different, pre-defined sort
orders. For example, the default sort order might be: Date, ServiceName,
User-Ref1, User-Ref2, User-Ref3, TransactionID. The user can click on the
headings "Date", "ServiceName", and "User-Ref1", to make those columns,
respectively, the primary sort column. So, if they click on "ServiceName",
then the pre-defined sort order would be: ServiceName, Date, User-Ref1,
User-Ref2, User-Ref3, TransactionID. If they click on "User-Ref1", then the
pre-defined sort order would be: User-Ref1, Date, ServiceName, User-Ref2,
User-Ref3, TransactionID. (That may not be exactly right, but the main
point is that the sort orders the user can choose from are pre-defined,
based on the primary sort column - they can't change the order of the other
columns.)

The other item of importance is that the user needs to be able to page
through the data. We're planning on using an identity column on the table
in support of this - i.e., to pick up from where the last UID left off.

Finally, the user will be allowed to subset the data by date range.

Okay, here goes my questions:

Regarding VIEWs
------------------

1. One of my questions relates to the use of VIEWs. I was thinking that, in
order to make the stored procedure logic more concise, that we could create
a separate VIEW for each sort order. However, someone said that VIEWs
involve the use of dynamic SQL that must be prepared each time by DB/2. Is
that correct? I would think that, since the query that represents the VIEW
doesn't change, it would be static.

2. The same person also said that VIEWs are used to subset data (i.e., with
a WHERE clause), not for simple ordering. Is that correct? Again, my
thinking was to create a VIEW for each sort order - i.e., a SELECT with a
different ORDER BY for each VIEW.

Basically, I figured that if we used VIEWs for the sorting, then the actual
stored procedure code would be identical for each query (it would just
specify a different VIEW name for each.)

Regarding Indexes
-------------------

3. It was also stated that any indexes on the table would only be used by
the WHERE clause, not by the ORDER BY clause. Is that right? I was
thinking that we would need an index for the subsetting (date range) WHERE
clause, plus indexes for each pre-defined sort order (or at least the most
popular sort orders.) But it was implied that you can't do this.

Regarding Limiting the Result Set Size
--------------------------------------

4. The front-end web application has memory constraints, and so can only
handle a certain maximum number of rows in a result set. I was thinking
that we could use the "FETCH x ROWS ONLY" clause to limit the size of the
result set, and also limit the processing that DB/2 has to perform.
However, I was told that DB/2 would still need to process the entire result
set, and that "further constraints" would be necessary. Is that correct?
In particular, it was stated that if a sort (ORDERY BY) is involved, then
DB/2 builds the complete result set after processing all qualifying (WHERE)
records. So, basically, if I have a SELECT with both WHERE and ORDER BY,
will FETCH x ROWS ONLY limit the amount of work that DB/2 has to perform or
not?
Thank you VERY MUCH for any help with these questions! If possible and
convenient, it would be great if you could also e-mail any responses to my
address below (please remove NOSPAM from the address.)

Regards,

Mike
mm********@NOSPAMyahoo.com

Nov 12 '05 #1

Subscribe Reply

5208

Ian

Mike wrote:

Hello,

I have a few rather urgent questions that I hope someone can help with (I
need to figure this out prior to a meeting tomorrow.) First, a bit of
background: The company I work for is developing a web-based application,
one part of which involves allowing the user the ability to page through
transaction "history" information.
You don't say which platform DB2 is running on, so I will assume Linux,
UNIX or Windows.

The other item of importance is that the user needs to be able to page
through the data. We're planning on using an identity column on the table
in support of this - i.e., to pick up from where the last UID left off.
You should probably search in this newsgroup (and in the DB2 UDB SQL
reference) for OLAP functions like rownumber().

This has been talked about extensively in this group with regard to
fetching "pages" of data at a time.

Regarding VIEWs
------------------

1. One of my questions relates to the use of VIEWs. I was thinking that, in
order to make the stored procedure logic more concise, that we could create
a separate VIEW for each sort order. However, someone said that VIEWs
involve the use of dynamic SQL that must be prepared each time by DB/2. Is
that correct? I would think that, since the query that represents the VIEW
doesn't change, it would be static.
Views don't have anything to do with static/dynamic SQL. Static/Dynamic
defines when the access path is chosen -- either at build time (static)
or run-time (dynamic).
2. The same person also said that VIEWs are used to subset data (i.e., with
a WHERE clause), not for simple ordering. Is that correct? Again, my
thinking was to create a VIEW for each sort order - i.e., a SELECT with a
different ORDER BY for each VIEW.
Views represent an unordered set of data (i.e., just like a table). You
can't specify an order-by in the view definition.
Regarding Indexes
-------------------

3. It was also stated that any indexes on the table would only be used by
the WHERE clause, not by the ORDER BY clause. Is that right? I was
thinking that we would need an index for the subsetting (date range) WHERE
clause, plus indexes for each pre-defined sort order (or at least the most
popular sort orders.) But it was implied that you can't do this.
Indexes may or may not be used for predicates (where clause) or sorting.
Indexes on columns referenced in the predicate are (in my book) much
more important (and more likely to be used).

The SQL compiler will make the decision on which index(es) to use based
on filter factors on the predicates, cardinality of the table(s),
columns in the query, etc.
Regarding Limiting the Result Set Size
--------------------------------------

4. The front-end web application has memory constraints, and so can only
handle a certain maximum number of rows in a result set. I was thinking
that we could use the "FETCH x ROWS ONLY" clause to limit the size of the
result set, and also limit the processing that DB/2 has to perform.
However, I was told that DB/2 would still need to process the entire result
set, and that "further constraints" would be necessary. Is that correct?
In particular, it was stated that if a sort (ORDERY BY) is involved, then
DB/2 builds the complete result set after processing all qualifying (WHERE)
records. So, basically, if I have a SELECT with both WHERE and ORDER BY,
will FETCH x ROWS ONLY limit the amount of work that DB/2 has to perform or
not?

Unless your application is doing something strange (like trying to store
the entire result set in memory on the client), the memory usage will be
primarily on the database server, not on the client, regardless of the
size of your result set. There are communication buffers between the
server and client that control how much data is transferred / buffered,
but this shouldn't be a big deal (the default is 32kb).
Good luck,

-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 100,000 Newsgroups - 19 Different Servers! =-----

Nov 12 '05 #2

Rhino

"Mike" <mw*********@NOSPAMyahoo.com> wrote in message
news:yq********************@giganews.com...

Hello,

I have a few rather urgent questions that I hope someone can help with (I
need to figure this out prior to a meeting tomorrow.)
It's probably not a good idea to leave urgent questions to the last minute
and then insist that they are urgent....
First, a bit of
background: The company I work for is developing a web-based application,
one part of which involves allowing the user the ability to page through
transaction "history" information.

The _summary_ history table will have the following fields: ServiceName,
Date, User-Ref1, User-Ref2, User-Ref3, TransactionID. The "User-Ref" fields are supplied by the user to help them identify records of interest.

The user will be allowed to choose from several different, pre-defined sort orders. For example, the default sort order might be: Date, ServiceName,
User-Ref1, User-Ref2, User-Ref3, TransactionID. The user can click on the
headings "Date", "ServiceName", and "User-Ref1", to make those columns,
respectively, the primary sort column. So, if they click on "ServiceName", then the pre-defined sort order would be: ServiceName, Date, User-Ref1,
User-Ref2, User-Ref3, TransactionID. If they click on "User-Ref1", then the pre-defined sort order would be: User-Ref1, Date, ServiceName, User-Ref2,
User-Ref3, TransactionID. (That may not be exactly right, but the main
point is that the sort orders the user can choose from are pre-defined,
based on the primary sort column - they can't change the order of the other columns.)

The other item of importance is that the user needs to be able to page
through the data. We're planning on using an identity column on the table
in support of this - i.e., to pick up from where the last UID left off.

Finally, the user will be allowed to subset the data by date range.
You haven't told us what platform you are on and what version of DB2 you are
using. This makes it rather difficult to answer your question since each
platform and version can be different in some particulars. For example,
ORDER BY is not permitted in a view definition in DB2 V7.2 for Windows,
Linux, and Unix. ORDER BY may or may not be allowed in view definitions for
other platforms and versions.
Okay, here goes my questions:

Regarding VIEWs
------------------

1. One of my questions relates to the use of VIEWs. I was thinking that, in order to make the stored procedure logic more concise, that we could create a separate VIEW for each sort order. However, someone said that VIEWs
involve the use of dynamic SQL that must be prepared each time by DB/2. Is that correct? I would think that, since the query that represents the VIEW doesn't change, it would be static.
NOTE: the product we are discussing on this newsgroup is DB2, not DB/2. DB2
is IBM's flagship relational database. If you are actually asking about
DBASE II, which some people abbreviate 'DB/2', you should disregard any
answers you get on this newsgroup because they are almost certainly not
applicable to you.

Your first question is probably moot since I believe that all or at least
most of the recent versions of DB2 don't permit ORDER BY in a view
definition. As noted above though, some platforms and versions may permit
this so you should check CREATE VIEW in the SQL Reference for your
particular platform and version to be sure.
2. The same person also said that VIEWs are used to subset data (i.e., with a WHERE clause), not for simple ordering. Is that correct? Again, my
thinking was to create a VIEW for each sort order - i.e., a SELECT with a
different ORDER BY for each VIEW.
Views are used for two primary reasons: to provide a subset of rows and
columns of a table or a join of tables and to give you more granular control
over security.
Basically, I figured that if we used VIEWs for the sorting, then the actual stored procedure code would be identical for each query (it would just
specify a different VIEW name for each.)
As noted above, views will not help you by providing alternate sort orders
for your data unless you happen to be using a DB2 version and platform that
supports ORDER BY in a view definition. I'm not sure if there is such a
version and platform; I have used DB2 on most platforms at one point or
another but in some cases, like DB2 on AS/400 or z/OS, I haven't used the
current versions.
Regarding Indexes
-------------------

3. It was also stated that any indexes on the table would only be used by
the WHERE clause, not by the ORDER BY clause. Is that right? I was
thinking that we would need an index for the subsetting (date range) WHERE
clause, plus indexes for each pre-defined sort order (or at least the most
popular sort orders.) But it was implied that you can't do this.
It is not true that the WHERE clause is the only clause that can use an
index. ORDER BY can also use an index. (I think GROUP BY can use indexes too
in some cases, although I'm not sure.)

You can create indexes for each sort order if you like but bear in mind that
DB2 never has to use any given index (or any index at all) for an access
path; it is always DB2's choice. It's the old "you can lead a horse to water
but you can't make him drink" thing: you can create any number of indexes
but DB2 can decline to use these indexes. So don't be angry if you create an
index but DB2 doesn't use it.

There are two main reasons for DB2 declining to use an index: the query will
not benefit from the index (in DB2's estimation) or your SQL is poorly
written.
Regarding Limiting the Result Set Size
--------------------------------------

4. The front-end web application has memory constraints, and so can only
handle a certain maximum number of rows in a result set. I was thinking
that we could use the "FETCH x ROWS ONLY" clause to limit the size of the
result set, and also limit the processing that DB/2 has to perform.
However, I was told that DB/2 would still need to process the entire result set, and that "further constraints" would be necessary. Is that correct?
In particular, it was stated that if a sort (ORDERY BY) is involved, then
DB/2 builds the complete result set after processing all qualifying (WHERE) records. So, basically, if I have a SELECT with both WHERE and ORDER BY,
will FETCH x ROWS ONLY limit the amount of work that DB/2 has to perform or not?
I haven't done this sort of programming for a while so my memory on this
point is fuzzy and possibly out-of-date so take what I say with a grain of
salt; maybe someone who is current can confirm or correct what I'm saying on
this point.

Essentially, your friend told you the truth: FETCH x ROWS ONLY does not
limit the size of the result set; it is primarily a hint to the optimizer
that you want the first 'x' rows of the result as quickly as possible. If
possible, DB2 will try to do some "tricks" to get you those rows a little
faster than it otherwise would but there is no guarantee it will be
successful. Also, even if it is successful, if you then go on to read more
rows than the 'x' which you specified, the additional rows may come back
slower than they otherwise would have. FETCH x ROWS ONLY is intended
primarily for an application that will do paging of large result sets and
would mostly be used where the result might be very large while most users
would normally only want 'x' of the rows. If you have an application which
has to page through a lot of rows and you actually expect users to look at
most of those rows most of the time, FETCH x ROWS ONLY probably won't help
you.

If you need to page both forwards and backwards within a result set that is
(potentially) many pages in size, the technique for doing so tends to depend
on the language you are using. For instance, if you use a relatively recent
version of Java and JDBC, you can use scrollable cursors to move backwards
and forwards fairly easily. I just did a very superficial look through the
DB2 V7.2 Application Development manual trying to find a detailed
explanation of paging techniques for each language and didn't find one but I
know the IBM DB2 courses used to explain this technique so there may be
another manual somewhere that explains this. If you can't find anything like
that, post here again and perhaps someone can give you an example that
illustrates the technique for your language and platform.

Thank you VERY MUCH for any help with these questions! If possible and
convenient, it would be great if you could also e-mail any responses to my
address below (please remove NOSPAM from the address.)

Posted and emailed.

Rhino

Nov 12 '05 #3

Mike

Just a follow-up to answer some questions others have raised:

1. The DB2 (sorry, not "DB/2") platform in question is v7 (soon v8), running
on z/OS.
2. The front-end will be written in Java. The stored procedures will
probably be written in COBOL.
3. Regarding the use of cursors for paging: AFAIK, this would require
keeping one connection allocated to each user. We have to plan for high
concurrency, and so this would not be acceptable. (Yes, we will of course
pool the database connections.)
4. Regarding result set size: Two objectives here: 1) Minimize the number
of rows that gets sent to the front-end in any single query (so that we
don't tax memory constraints on the front-end), and 2) Minimize the number
of rows that DB2 has to process (if the front-end can only handle, say, 100
rows, then we would like for DB2 to stop processing the result set after 100
rows - if possible).
5. Regarding Rhino's comments about FETCH FIRST x ROWS ONLY being a 'hint' -
actually, that sounds like like you are talking about the OPTIMIZE FOR x
ROWS clause. According to the SQL Reference manual, FETCH FIRST x ROWS ONLY
can save work for DB2 - but I'm not sure under what circumstances, and if
those apply in my case.
6. Regarding "waiting to the last minute" - I haven't done that. I can't
help that this was discussed at a meeting just today! My intention is to be
well-prepared for tomorrow's meeting.

Let me know if there are any other questions. Your comments are
appreciated!

Regards,

Mike

Nov 12 '05 #4

Anton Versteeg

It depends if an index is used or not with an ORDER BY.
When there are fields in your select that are not part of the index,
DB2 might decide a table scan is better so it can do sequential
prefetch. Specially if the index sequence is not the same as the
physical row sequence, retrieving rows via the index might be much more
expensive than retrieving them sequentially. It also depends on the size
of your table of course.

Mike wrote:

Regarding Indexes
-------------------

3. It was also stated that any indexes on the table would only be used by
the WHERE clause, not by the ORDER BY clause. Is that right? I was
thinking that we would need an index for the subsetting (date range) WHERE
clause, plus indexes for each pre-defined sort order (or at least the most
popular sort orders.) But it was implied that you can't do this.
Regards,

Mike
mm********@NOSPAMyahoo.com

--
Anton Versteeg
IBM Certified DB2 Specialist
IBM Netherlands

Nov 12 '05 #5

Buck Nuggets

"Mike" <mw*********@NOSPAMyahoo.com> wrote in message news:<c7********************@giganews.com>...

3. Regarding the use of cursors for paging: AFAIK, this would require
keeping one connection allocated to each user. We have to plan for high
concurrency, and so this would not be acceptable. (Yes, we will of course
pool the database connections.)

4. Regarding result set size: Two objectives here: 1) Minimize the number
of rows that gets sent to the front-end in any single query (so that we
don't tax memory constraints on the front-end), and 2) Minimize the number
of rows that DB2 has to process (if the front-end can only handle, say, 100
rows, then we would like for DB2 to stop processing the result set after 100
rows - if possible).

Your solutions are pretty restricted if you want separate connections
for each paging query: either you only have one sorting order, or you
cut separate tables each period (day, hour, whatever) that have
separate orders & separate ids in each pre-calculated.

However, if you do keep the single connection for this activity, you
do have other options than a cursor: you can also toss the result set
into a temporary table and use an identity or row_number to populate a
field used for paging. Whether or not that works better than using
row_number() on the original table would depending on factors such as
whether or not you can use an index or have to do a page scan of the
original table, how many next page queries are likely to occur, etc.

Nov 12 '05 #6

Mike

Thanks for the info. Just to add a bit more info, here is what I'm thinking
the query would look like - note that my original idea was to use a VIEW for
each different ORDERY BY clause, but since that doesn't appear to be
possible, this is how I believe it will look:
Page *Forward* Through Transaction History Summary - Sort Seq #1
-----------------------------------------------------------------------
SELECT * FROM TRANHIST
WHERE UID > :startUID AND
DATE >= :startDate AND
DATE <= :endDate
ORDER BY UID ASC, DATE DESC, SERVICENAME ASC, USERREF1 ASC,
USERREF2 ASC, USERREF3 ASC, TRANID ASC
FETCH FIRST 200 ROWS ONLY;
Page *Backward* Through Transaction History Summary - Sort Seq #1
-------------------------------------------------------------------------
SELECT * FROM TRANHIST
WHERE UID < :startUID AND
DATE >= :startDate AND
DATE <= :endDate
ORDER BY UID DESC, DATE DESC, SERVICENAME ASC, USERREF1 ASC,
USERREF2 ASC, USERREF3 ASC, TRANID ASC
FETCH FIRST 200 ROWS ONLY;
There would then be similar queries for paging forward & backward through
the history, but with different ORDER BY clauses. There would be about 3 or
4 basic sort orders (such as SERVICENAME first, or USERREF1 first), and for
each one of these the user will have the ability to select ASCENDING or
DESCENDING sort order on the *primary* sort column (not the subordinate
columns; those will be pre-defined.) Note that the WHERE clause only
includes the DATE and UID fields.

Anyway, as I mentioned before, one key question is whether the FETCH FIRST x
ROWS ONLY will save DB2 from doing extraneous work or not - i.e., will it
"short-circuit" the query at 200 rows, or will it have to process the entire
result set first? And if it won't save any extra work, is there anything
else I can do to help it?

Again, due to volume & concurrency, we want to save memory utilization on
the front-end, as well as the amount of work DB2 has to do on the back-end.
If the front-end can only cache 200 rows at a time, I'd like to prevent DB2
from having to process more than that, if possible.

Thanks,

Mike

"Anton Versteeg" <an************@nnll.ibm.com> wrote in message
news:cd***********@sp15en20.hursley.ibm.com...

It depends if an index is used or not with an ORDER BY.
When there are fields in your select that are not part of the index,
DB2 might decide a table scan is better so it can do sequential
prefetch. Specially if the index sequence is not the same as the
physical row sequence, retrieving rows via the index might be much more
expensive than retrieving them sequentially. It also depends on the size
of your table of course.

Mike wrote:
Regarding Indexes
-------------------

3. It was also stated that any indexes on the table would only be used by the WHERE clause, not by the ORDER BY clause. Is that right? I was
thinking that we would need an index for the subsetting (date range) WHERE clause, plus indexes for each pre-defined sort order (or at least the most popular sort orders.) But it was implied that you can't do this.

Regards,

Mike
mm********@NOSPAMyahoo.com

--
Anton Versteeg
IBM Certified DB2 Specialist
IBM Netherlands

Nov 12 '05 #7

Mike

"Buck Nuggets" <bu*********@yahoo.com> wrote in message
news:66**************************@posting.google.c om...

"Mike" <mw*********@NOSPAMyahoo.com> wrote in message news:<c7********************@giganews.com>...
3. Regarding the use of cursors for paging: AFAIK, this would require
keeping one connection allocated to each user. We have to plan for high
concurrency, and so this would not be acceptable. (Yes, we will of course pool the database connections.)

4. Regarding result set size: Two objectives here: 1) Minimize the number of rows that gets sent to the front-end in any single query (so that we
don't tax memory constraints on the front-end), and 2) Minimize the number of rows that DB2 has to process (if the front-end can only handle, say, 100 rows, then we would like for DB2 to stop processing the result set after 100 rows - if possible).

Your solutions are pretty restricted if you want separate connections
for each paging query: either you only have one sorting order, or you
cut separate tables each period (day, hour, whatever) that have
separate orders & separate ids in each pre-calculated.

Thanks for the comments. But, do you agree that if high concurrency is a
concern, that we can't really allocate a connection to a single user, while
the user "thinks" about paging forward/backward?

About the temporary table - wouldn't that be on a per-connection basis,
though? Or could I access the same temporary table from *any* connection?
(Of course I realize that I could optimize the connection pool such that the
front-end will *try* to re-use the same connection, if available - but I
don't want to count on this.)

Thanks,

Mike

Nov 12 '05 #8

Buck Nuggets

"Mike" <mw*********@NOSPAMyahoo.com> wrote in message news:<Lf********************@giganews.com>...

Thanks for the comments. But, do you agree that if high concurrency is a
concern, that we can't really allocate a connection to a single user, while
the user "thinks" about paging forward/backward?
sure, i can see how this would be a concern.
About the temporary table - wouldn't that be on a per-connection basis,
though? Or could I access the same temporary table from *any* connection?
(Of course I realize that I could optimize the connection pool such that the
front-end will *try* to re-use the same connection, if available - but I
don't want to count on this.)

no, you can't share it across connections. however, there are still
quite a few other, similar options: create a permanent table instead.
You'd have to figure out when to asyncronously drop it, it's slower to
create, and you'd have to figure out what to name it, but it could
work the same way as the temp table otherwise.

Still, perhaps the first query retrieval time would be expensive. Not
sure about what kind of insertion rate you're looking at, or what the
model looks like (unique key?), but...have you considered using
something like multiple MQTs with ids representing various sort
orders? Would be impossible to keep perfectly up to date (update
hourly?), but would be very fast on retrieval since you could easily
rely on an index to grab a small number of rows at a time...

Nov 12 '05 #9

Urgent Questions regarding Index usage & VIEWs, sorting, etc.

Similar topics