By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,007 Members | 1,248 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,007 IT Pros & Developers. It's quick & easy.

Best practice

P: n/a
I'm building a web based auction system for multiple clients. I have
auctions, items, and invoices I need to assign generated keys for. Since
each of these entities is represented by a table, the keys are assigned
as GENERATED BY DEFAULT. While this works, and is easy to reference
programmatically, I wonder if it is the best technique. All of the keys
are drawn from one sequence, not depending on the customer or the
specific auction. Would it be better to have an auction_id sequence for
each client, an item_id sequence for each auction (the client never sees
either one of these columns, but there is no natural key for these
tables that is ready and known when they are instantiated) and
invoice_no sequence for each client or auction? If so, how should I
implement these to both minimize SQL calls and ensure no duplication?
With sequences, I can do something like:

INSERT INTO auctions (client_id, auction_id, ...)
VALUES ($client, (SELECT NEXTVAL
FOR auction_is_seq_$client), ...)

If instead I use the MAX(auction_id), what is the syntax to select the
MAX and INSERT the new row in one atomic operation?

If I use a column associated with each user for next_auction_id, the
same question occurs.

The same question, only increased in difficulty occurs when I have many
rows to add at once with one INSERT statement.
Jul 22 '06 #1
Share this Question
Share on Google+
18 Replies


P: n/a
"Bob Stearns" <rs**********@charter.netwrote in message
news:Et*****************@fe06.lga...
I'm building a web based auction system for multiple clients. I have
auctions, items, and invoices I need to assign generated keys for. Since
each of these entities is represented by a table, the keys are assigned as
GENERATED BY DEFAULT. While this works, and is easy to reference
programmatically, I wonder if it is the best technique. All of the keys
are drawn from one sequence, not depending on the customer or the specific
auction. Would it be better to have an auction_id sequence for each
client, an item_id sequence for each auction (the client never sees either
one of these columns, but there is no natural key for these tables that is
ready and known when they are instantiated) and invoice_no sequence for
each client or auction? If so, how should I implement these to both
minimize SQL calls and ensure no duplication? With sequences, I can do
something like:

INSERT INTO auctions (client_id, auction_id, ...)
VALUES ($client, (SELECT NEXTVAL
FOR auction_is_seq_$client), ...)

If instead I use the MAX(auction_id), what is the syntax to select the MAX
and INSERT the new row in one atomic operation?

If I use a column associated with each user for next_auction_id, the same
question occurs.

The same question, only increased in difficulty occurs when I have many
rows to add at once with one INSERT statement.
This will generate a unique value for auction_id within a client:

INSERT INTO auctions (client_id, auction_id, ...)
VALUES
($client, (SELECT MAX(auction_id) + 1), ...);

If you want to insert multiple rows with one insert statement, you will have
to increment the +1 in your code::

INSERT INTO auctions (client_id, auction_id, ...)
VALUES
($client, (SELECT MAX(auction_id) + 1), ...)
($client, (SELECT MAX(auction_id) + 2), ...)
($client, (SELECT MAX(auction_id) + 3), ...)
($client, (SELECT MAX(auction_id) + 4), ...) ;

If you need to know what the value of the auction_id you just inserted, you
can use this statement to do the insert and find the value all in one
statement:

SELECT auction_id FROM FINAL TABLE (INSERT INTO auctions (client_id,
auction_id, ...)
VALUES ($client, (SELECT MAX(auction_id) + 1), ...));

If the PK is (client_id, auction_id) then you probably want to make the PK
index the clustering index so that all auctions for particular client will
be physically close together. That means you will have to create the unique
index on these columns before you define the PK.

IBM: When are we going to get ALTER INDEX to specify the clustering index?
DB2 for z/OS has had this for at least 15 years!!!
Jul 22 '06 #2

P: n/a
Mark A wrote:
IBM: When are we going to get ALTER INDEX to specify the clustering index?
DB2 for z/OS has had this for at least 15 years!!!
My personal opinion:
MDC is much more powerful than clustering indexes.
Makes much more sense to invest into that area, IMHO.

Cheers
Serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 22 '06 #3

P: n/a
Bob Stearns wrote:
I'm building a web based auction system for multiple clients. I have
auctions, items, and invoices I need to assign generated keys for. Since
each of these entities is represented by a table, the keys are assigned
as GENERATED BY DEFAULT. While this works, and is easy to reference
programmatically, I wonder if it is the best technique. All of the keys
are drawn from one sequence, not depending on the customer or the
specific auction. Would it be better to have an auction_id sequence for
each client, an item_id sequence for each auction (the client never sees
either one of these columns, but there is no natural key for these
tables that is ready and known when they are instantiated) and
invoice_no sequence for each client or auction? If so, how should I
implement these to both minimize SQL calls and ensure no duplication?
With sequences, I can do something like:

INSERT INTO auctions (client_id, auction_id, ...)
VALUES ($client, (SELECT NEXTVAL
FOR auction_is_seq_$client), ...)

If instead I use the MAX(auction_id), what is the syntax to select the
MAX and INSERT the new row in one atomic operation?

If I use a column associated with each user for next_auction_id, the
same question occurs.

The same question, only increased in difficulty occurs when I have many
rows to add at once with one INSERT statement.
Many questions answered by Mark. I'll just add delta:
Don't use MAX. Sequences and IDENTITY were invented to get around the
limitations of MAX.
My recommendation is to use a sequence such as:
SELECT pk FROM NEW TABLE(INSERT INTO T INCLUDE (i INTEGER)
SELECT NEXT VALUES FOR S1 AS PK, V.*
FROM VALUES(CAST(? AS INT),
CAST(? AS DOUBLE), 1),
(?, ?, 2), ....) AS V(c1, c2, i)
ORDER BY i;

Now you also know which input goes with which sequence.

Cheers
Serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 22 '06 #4

P: n/a
"Serge Rielau" <sr*****@ca.ibm.comwrote in message
news:4i************@individual.net...
My personal opinion:
MDC is much more powerful than clustering indexes.
Makes much more sense to invest into that area, IMHO.

Cheers
Serge
MDC is overkill in some situations, especially if the desired clustering
column has very high cardinality.

For example, I have a hotel_property table with card of 75,000, row length
300, and I want a clustering index on (LONGITUDE, LATITUDE) with the
following very high frequency search (both are defined a DECIMAL (9,6)):

SELECT hotel_nbr, hotel_name, ...
FROM hotel_property
WHERE longitude between ? and ?
AND latitude between ? and ?;

MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE).
Jul 22 '06 #5

P: n/a
Mark A wrote:
"Serge Rielau" <sr*****@ca.ibm.comwrote in message
news:4i************@individual.net...
>My personal opinion:
MDC is much more powerful than clustering indexes.
Makes much more sense to invest into that area, IMHO.

Cheers
Serge

MDC is overkill in some situations, especially if the desired clustering
column has very high cardinality.

For example, I have a hotel_property table with card of 75,000, row length
300, and I want a clustering index on (LONGITUDE, LATITUDE) with the
following very high frequency search (both are defined a DECIMAL (9,6)):

SELECT hotel_nbr, hotel_name, ...
FROM hotel_property
WHERE longitude between ? and ?
AND latitude between ? and ?;

MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE).
You roll it up on 2 dimensions, e.g.: LONGITUDE / 10 and LATITUDE / 10
Now the beats can serve up BOTH diemensions. clustering index required
reorg to keep it clustered and is only single dimension.

Cheers
Serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 22 '06 #6

P: n/a
"Serge Rielau" <sr*****@ca.ibm.comwrote in message
news:4i************@individual.net...
>>
MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE).
You roll it up on 2 dimensions, e.g.: LONGITUDE / 10 and LATITUDE / 10
Now the beats can serve up BOTH diemensions. clustering index required
reorg to keep it clustered and is only single dimension.

Cheers
Serge
I don't quite understand the "/ 10". Every longitude and latitude (taken to
6 decimal places) is unique (except in rare cases).

My Hotel table has very few updates and does not need to be reorged often.

In the example of the Original Poster, I would think that a lot of
client_id's have only one or a few auctions, so it seems that there would be
a lot of wasted space, and a big penalty if a table scan was ever needed.
Jul 22 '06 #7

P: n/a
Mark A wrote:
"Serge Rielau" <sr*****@ca.ibm.comwrote in message
news:4i************@individual.net...
>>MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE).
You roll it up on 2 dimensions, e.g.: LONGITUDE / 10 and LATITUDE / 10
Now the beats can serve up BOTH diemensions. clustering index required
reorg to keep it clustered and is only single dimension.

Cheers
Serge

I don't quite understand the "/ 10". Every longitude and latitude (taken to
6 decimal places) is unique (except in rare cases).

My Hotel table has very few updates and does not need to be reorged often.

In the example of the Original Poster, I would think that a lot of
client_id's have only one or a few auctions, so it seems that there would be
a lot of wasted space, and a big penalty if a table scan was ever needed.
CREATE TABLE places(id INT NOT NULL PRIMARY KEY,
name VARCHAR(100),
latitude DECIMAL(8, 6), -- +/-90 Decrees
longitude DECIMAL(9, 6), -- +/-180 Degrees
latstripe INTEGER GENERATED ALWAYS
AS (INTEGER(latitude))
longstripe INTEGER GENERATED ALWAYS
AS (INTEGER(longitude) / 2))
ORGANIZE BY DIMENSIONS (latstripe, longstripe);

Now each row gets bucketized into a 180 * 180 grid.
The block indexes will be a single page each.
To find any place by longitude and/or latitude DB2 will index and teh
block indices and then scan within the grid sector.
Of course you can add regular indices as well.
You never need to reorg the beast except for page overflow due to updates.
Want to find all the places in Europe? No problem, box Europe in and go
from there.
If your table gets really big you may want to allow for finer granularity.

I love MDC :-)

Cheers
Serge
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 22 '06 #8

P: n/a
"Serge Rielau" <sr*****@ca.ibm.comwrote in message
news:4i************@individual.net...
CREATE TABLE places(id INT NOT NULL PRIMARY KEY,
name VARCHAR(100),
latitude DECIMAL(8, 6), -- +/-90 Decrees
longitude DECIMAL(9, 6), -- +/-180 Degrees
latstripe INTEGER GENERATED ALWAYS
AS (INTEGER(latitude))
longstripe INTEGER GENERATED ALWAYS
AS (INTEGER(longitude) / 2))
ORGANIZE BY DIMENSIONS (latstripe, longstripe);

Now each row gets bucketized into a 180 * 180 grid.
The block indexes will be a single page each.
To find any place by longitude and/or latitude DB2 will index and teh
block indices and then scan within the grid sector.
Of course you can add regular indices as well.
You never need to reorg the beast except for page overflow due to updates.
Want to find all the places in Europe? No problem, box Europe in and go
from there.
If your table gets really big you may want to allow for finer granularity.

I love MDC :-)

Cheers
Serge
I will try this and run some explains and maybe test execution time, but I
am still skeptical. All searches for hotels are within a relatively small
radius, usually about 25 miles (of course it is actually a rectangle and not
a radius).
Jul 22 '06 #9

P: n/a
Mark A wrote:
"Serge Rielau" <sr*****@ca.ibm.comwrote in message
news:4i************@individual.net...
>CREATE TABLE places(id INT NOT NULL PRIMARY KEY,
name VARCHAR(100),
latitude DECIMAL(8, 6), -- +/-90 Decrees
longitude DECIMAL(9, 6), -- +/-180 Degrees
latstripe INTEGER GENERATED ALWAYS
AS (INTEGER(latitude))
longstripe INTEGER GENERATED ALWAYS
AS (INTEGER(longitude) / 2))
ORGANIZE BY DIMENSIONS (latstripe, longstripe);

Now each row gets bucketized into a 180 * 180 grid.
The block indexes will be a single page each.
To find any place by longitude and/or latitude DB2 will index and teh
block indices and then scan within the grid sector.
Of course you can add regular indices as well.
You never need to reorg the beast except for page overflow due to updates.
Want to find all the places in Europe? No problem, box Europe in and go
from there.
If your table gets really big you may want to allow for finer granularity.

I love MDC :-)

Cheers
Serge

I will try this and run some explains and maybe test execution time, but I
am still skeptical. All searches for hotels are within a relatively small
radius, usually about 25 miles (of course it is actually a rectangle and not
a radius).
The biggest mistake done with MDC is to have too fine a granularity.
Think of range-partitioning...squared and automatic.

Cheers
Serge
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 23 '06 #10

P: n/a
Mark A wrote:
For example, I have a hotel_property table with card of 75,000, row length
300, and I want a clustering index on (LONGITUDE, LATITUDE) with the
following very high frequency search (both are defined a DECIMAL (9,6)):

SELECT hotel_nbr, hotel_name, ...
FROM hotel_property
WHERE longitude between ? and ?
AND latitude between ? and ?;

MDC would not be good on (LONGITUDE), or (LONGITUDE, LATITUDE).
Agreed. But in such cases, a spatial index would be better in any case.

--
Knut Stolze
DB2 Information Integration Development
IBM Germany
Jul 24 '06 #11

P: n/a
"Knut Stolze" <st****@de.ibm.comwrote in message
news:ea**********@lc03.rz.uni-jena.de...
Agreed. But in such cases, a spatial index would be better in any case.
--
Knut Stolze
DB2 Information Integration Development
IBM Germany
In theory yes, but in practice no. Most of the time someone is searching for
a hotel within 5 miles or less of reference point and that means a (for 5
mile radius) a square of 10 miles in each dimension (figuring out the true
radius is not worth the effort in most cases).

On average, there are maybe 25 hotels or less that would returned in such a
query. Taking in to account other reference points with the same longitude
coordinates, but that are in a totally different latitude range (also 10
miles for a 5 mile radius) the number is usually only doubled, or tripled at
worst. It is very fast to just filter those out by including latitude as the
second column in the index (along with longitude).

Certainly, some geo applications require more sophisticated search
algorithms for maximum efficiency and performance, but this one doesn't.
Jul 24 '06 #12

P: n/a
Raj
MDC is bad with columns having high card...
I have tried MDC on a table on item_key and cust_key, each has a card
of 10,000. and about 80 million records.
mdc { slice} [ of card = 1000] was made up of gen_item_key ( =
item_key/1000), gen_cust_key( = cust_key/1000 )

The mdc table size was comparable to the original table, but the index
was not being used because of non-uniform density of the generated
columns?
Also table scans take 1-2 min longer , and sorts after a table scan
take almost twice the time compared to the original table.

I will try this and run some explains and maybe test execution time, but I
am still skeptical. All searches for hotels are within a relatively small
radius, usually about 25 miles (of course it is actually a rectangle and not
a radius).
The biggest mistake done with MDC is to have too fine a granularity.
Think of range-partitioning...squared and automatic.

Cheers
Serge
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 24 '06 #13

P: n/a
Mark A wrote:
"Knut Stolze" <st****@de.ibm.comwrote in message
news:ea**********@lc03.rz.uni-jena.de...
>Agreed. But in such cases, a spatial index would be better in any case.
--
Knut Stolze
DB2 Information Integration Development
IBM Germany

In theory yes, but in practice no. Most of the time someone is searching for
a hotel within 5 miles or less of reference point and that means a (for 5
mile radius) a square of 10 miles in each dimension (figuring out the true
radius is not worth the effort in most cases).

On average, there are maybe 25 hotels or less that would returned in such a
query. Taking in to account other reference points with the same longitude
coordinates, but that are in a totally different latitude range (also 10
miles for a 5 mile radius) the number is usually only doubled, or tripled at
worst. It is very fast to just filter those out by including latitude as the
second column in the index (along with longitude).

Certainly, some geo applications require more sophisticated search
algorithms for maximum efficiency and performance, but this one doesn't.
OK, back to the original debate on clustering indexes.
The difference between a clustered index and a non clustered index (to
me as an optimizer amateur) is whether DB2 bothers to throw in a
RIDSORT/FETCH compared to a straight FETCH. Sorting 25 RIDS (4 Byte in
V8, 6 bytes each in V9) is truly not anything that sounds worth
troubling myself for with a clustering index :-)

Cheers
Serge
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 24 '06 #14

P: n/a
Raj wrote:
MDC is bad with columns having high card...
I have tried MDC on a table on item_key and cust_key, each has a card
of 10,000. and about 80 million records.
mdc { slice} [ of card = 1000] was made up of gen_item_key ( =
item_key/1000), gen_cust_key( = cust_key/1000 )

The mdc table size was comparable to the original table, but the index
was not being used because of non-uniform density of the generated
columns?
Also table scans take 1-2 min longer , and sorts after a table scan
take almost twice the time compared to the original table.
It took more than a minute to scan 80M rows? What kind of system was
that? I have worked on a Linux BCU (a dozen data nodes) a couple of
weeks ago where we burned through 1.4B rows in the fact plus several
dimensions, rolled up the intermediate result set from the join of 140M
rows with some aggregation in some 20 seconds (no MQTs of course).
MDC was used on the fact table.

I don't see a reason why sort would care for the tables structure at
all. Could it be MDC was the scapegoat here?
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 24 '06 #15

P: n/a
Serge Rielau wrote:
OK, back to the original debate on clustering indexes.
The difference between a clustered index and a non clustered index (to
me as an optimizer amateur) is whether DB2 bothers to throw in a
RIDSORT/FETCH compared to a straight FETCH. Sorting 25 RIDS (4 Byte in
V8, 6 bytes each in V9) is truly not anything that sounds worth
troubling myself for with a clustering index :-)

Cheers
Serge
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab
In the example of the Hotel table, after the hotels within a particular
latitude and longitude range are determined via the index scan, I want
to be able to retrieve the data rows on the table with as few data page
fetches, and using as few bufferpool pages for those data pages, as
possible.

Jul 24 '06 #16

P: n/a
Mark A wrote:
In the example of the Hotel table, after the hotels within a particular
latitude and longitude range are determined via the index scan, I want
to be able to retrieve the data rows on the table with as few data page
fetches, and using as few bufferpool pages for those data pages, as
possible.
OK, that sounds good, but how much slower is your system going to be if
your buffer pool hit ratio drops by 1%? Won't you have the most popular
locations in the buffer pool anyway?
I confess I'm pushing you here a bit. The point I'm trying to make is
that there is a point where the effort doesn't justify the return.
When I look at troubled apps clustered index is one of the less popular
arrows in my quiver.
Whenever I see a post from a novice or casual user I try to KISS and
assume that 85% of theoretical best throughput will make that customer
quite happy as long as it's easy to get there.

Cheers
serge

--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 24 '06 #17

P: n/a
"Serge Rielau" <sr*****@ca.ibm.comwrote in message
news:4i************@individual.net...
OK, that sounds good, but how much slower is your system going to be if
your buffer pool hit ratio drops by 1%? Won't you have the most popular
locations in the buffer pool anyway?
I confess I'm pushing you here a bit. The point I'm trying to make is that
there is a point where the effort doesn't justify the return.
When I look at troubled apps clustered index is one of the less popular
arrows in my quiver.
Whenever I see a post from a novice or casual user I try to KISS and
assume that 85% of theoretical best throughput will make that customer
quite happy as long as it's easy to get there.

Cheers
serge
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab
To me, it is easier to design things in the most optimal manner, than it is
to design in a less optimal manner and then try to measure or speculate
about whether there is any negative impact, or how much difference there
would be.. I don't see any real cost of clustering tables that have
relatively low insert or update activity. Once a person learns how to do it
properly, it becomes second nature. I sleep better at night knowing that DB2
is accessing the least number of pages necessary to process the SQL
statements.

If like the OP, one has web application to keep track of a relatively small
number of auctions, then maybe the performance difference does not mater
much or at all.

If I have a worldwide air, hotel, and rental car reservation system with
hundreds of transactions per second, and hundreds of tables that I can make
clustering decisions on, then it could make a difference. Especially if I am
memory constrained with a 32-bit instance and can't allocate bufferpools
over 2GB (I inherited this system with XML Extender, which is not supported
on 64-bit).

Clustering is especially important on DB2 for z/OS because if you don't
choose a clustering index, DB2 will choose one for you (the first index
created). So it is not always a question of clustering or not clustering,
but of bad clustering versus good clustering. I can guarantee you that when
I started working with DB2 systems on MVS in the late 1980's, proper
clustering of tables made a huge difference in performance in most cases. I
still believe that it makes a difference on many systems.

When talking about natural keys versus dumb keys (the subject of the OP),
clustering is a relevant consideration because natural keys provide a viable
means of clustering dependent tables with composite PK's. It makes no sense
to cluster a single column PK with a generated or random number. Using
natural keys provides a candidate clustering index that theoretically
improves performance (the degree to which varies) and therefore it is one
advantage of natural keys that should be mentioned.
Jul 25 '06 #18

P: n/a
Raj
No... thats not what i said, table scan on the original table takes
around 11 mins for the non-MDC table & it takes about 13 mins on the
MDC table. There is atleast 100,000 timerons increase in the access
plan.
I now have the table built on a non unique column, (card = 80). It
works great with that
It took more than a minute to scan 80M rows? What kind of system was
that? I have worked on a Linux BCU (a dozen data nodes) a couple of
weeks ago where we burned through 1.4B rows in the fact plus several
dimensions, rolled up the intermediate result set from the join of 140M
rows with some aggregation in some 20 seconds (no MQTs of course).
MDC was used on the fact table.

I don't see a reason why sort would care for the tables structure at
all. Could it be MDC was the scapegoat here?
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab

IOD Conference
http://www.ibm.com/software/data/ond...ness/conf2006/
Jul 25 '06 #19

This discussion thread is closed

Replies have been disabled for this discussion.