By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,836 Members | 2,056 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,836 IT Pros & Developers. It's quick & easy.

How does DB2's estimate size tool calculate avg row size when the table is empty?

P: n/a
Does DB2 just fudge it when it is an empty table? Is there a "formula"
for average row size when you have variable length records. Or you
really have to know what your application is packing into those
varchar columns.

Bill Leung
le****@aptea.com
Nov 12 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
Until you've run runstats, most of the stats will = -1 - which is saying
DB2 does not know. So, no fudging or guessing - the optimizer just
proceeds on assuming that the data is unavailable. Example: cardinality
(number of rows in the table). For a new table, DB2 could assume 0 or 1
- but the -1 value for cardinality says DB2 does not know - maybe the
table was created with an import statement that populated the table with
10,000 rows.

sql-db2-dba wrote:
Does DB2 just fudge it when it is an empty table? Is there a "formula"
for average row size when you have variable length records. Or you
really have to know what your application is packing into those
varchar columns.

Bill Leung
le****@aptea.com


Nov 12 '05 #2

P: n/a
For most datatypes, this is calcuated by adding AVGCOLLEN in
syscat.columns (which is updated when you RUNSTATS). Long columns
space usage is based on the defined size of long columns. And on top
of everything else a 10 byte row overhead is added. So even though the
table may be empty, average row length will have some value.
The query below calculates the avg row length for my SQL version of
reorgchk - I use it in a common table expression in my SQL version of
REORGCHK, the results it produces match those from reorgchk in DB2 v8
fp 4.
Btw, I've noticed in the past that some of the calculations for
REORGCHK output are done differently from size estimator...looks like
the two tools were developed independent of each other

SELECT TABSCHEMA, TABNAME,
CASE
WHEN MIN (
CASE
WHEN A.LOGGED<>''
THEN 0
ELSE A.COLCARD
END) >= 0
THEN (SUM(
CASE
WHEN A.LOGGED = ''
THEN A.AVGCOLLEN
WHEN A.LENGTH <= 1024
THEN 72
WHEN A.LENGTH <= 8192
THEN 96
WHEN A.LENGTH <= 65536
THEN 120
WHEN A.LENGTH <= 524000
THEN 144
WHEN A.LENGTH <= 4190000
THEN 168
WHEN A.LENGTH <= 134000000
THEN 200
WHEN A.LENGTH <= 536000000
THEN 224
WHEN A.LENGTH <= 1070000000
THEN 256
WHEN A.LENGTH <= 1470000000
THEN 280
WHEN A.LENGTH <= 2147483647
THEN 316
END +
CASE
WHEN LOGGED <> ''
AND NULLS = 'Y'
THEN 1
ELSE 0
END ) + 10 )
ELSE -1
END AS AVGROWLEN
FROM SYSCAT.COLUMNS A
GROUP BY TABSCHEMA, TABNAME
Nov 12 '05 #3

P: n/a
Doen't the optimizer use "guesstimates" in this case?
If you run an explain on select * from tablename, if you have no data and no
stats, visual explain will show you default values that it calculates.
These are not stored anywhere but generated when you explain as the
optimizer has nothing to work with. So I would guess in this case it does
some fudging because, when looking at the output of show statistics, there
are values!!
Does it make sense, Pierre.

"Blair Adamache" <ba*******@2muchspam.yahoo.com> a écrit dans le message de
news:c0**********@hanover.torolab.ibm.com...
Until you've run runstats, most of the stats will = -1 - which is saying
DB2 does not know. So, no fudging or guessing - the optimizer just
proceeds on assuming that the data is unavailable. Example: cardinality
(number of rows in the table). For a new table, DB2 could assume 0 or 1
- but the -1 value for cardinality says DB2 does not know - maybe the
table was created with an import statement that populated the table with
10,000 rows.

sql-db2-dba wrote:
Does DB2 just fudge it when it is an empty table? Is there a "formula"
for average row size when you have variable length records. Or you
really have to know what your application is packing into those
varchar columns.

Bill Leung
le****@aptea.com


Nov 12 '05 #4

P: n/a
How did you come up with those numbers in your case statement for long
columns? I have a table with a CLOB ( 1M ) column. If I do a
AVG(LENGTH(CLOB_COL)) I get about 2000. But if I were to use your
formula, it would be 168.

Bill Leung
le****@aptea.com

aa***@hotmail.com (Amit) wrote in message news:<85**************************@posting.google. com>...
For most datatypes, this is calcuated by adding AVGCOLLEN in
syscat.columns (which is updated when you RUNSTATS). Long columns
space usage is based on the defined size of long columns. And on top
of everything else a 10 byte row overhead is added. So even though the
table may be empty, average row length will have some value.
The query below calculates the avg row length for my SQL version of
reorgchk - I use it in a common table expression in my SQL version of
REORGCHK, the results it produces match those from reorgchk in DB2 v8
fp 4.
Btw, I've noticed in the past that some of the calculations for
REORGCHK output are done differently from size estimator...looks like
the two tools were developed independent of each other

SELECT TABSCHEMA, TABNAME,
CASE
WHEN MIN (
CASE
WHEN A.LOGGED<>''
THEN 0
ELSE A.COLCARD
END) >= 0
THEN (SUM(
CASE
WHEN A.LOGGED = ''
THEN A.AVGCOLLEN
WHEN A.LENGTH <= 1024
THEN 72
WHEN A.LENGTH <= 8192
THEN 96
WHEN A.LENGTH <= 65536
THEN 120
WHEN A.LENGTH <= 524000
THEN 144
WHEN A.LENGTH <= 4190000
THEN 168
WHEN A.LENGTH <= 134000000
THEN 200
WHEN A.LENGTH <= 536000000
THEN 224
WHEN A.LENGTH <= 1070000000
THEN 256
WHEN A.LENGTH <= 1470000000
THEN 280
WHEN A.LENGTH <= 2147483647
THEN 316
END +
CASE
WHEN LOGGED <> ''
AND NULLS = 'Y'
THEN 1
ELSE 0
END ) + 10 )
ELSE -1
END AS AVGROWLEN
FROM SYSCAT.COLUMNS A
GROUP BY TABSCHEMA, TABNAME

Nov 12 '05 #5

P: n/a
The table only stores the descriptors for LOB columns. These
descriptors point to the location of the LOB data. The size of the row
includes the size of the descriptor and not the actual LOB data. The
Descriptor sizes are given in DB2 SQL Reference under CREATE TABLE
section.
Maximum LOB Length LOB Descriptor Size
1 024 72
8 192 96
65 536 120
524 000 144
4 190 000 168
134 000 000 200
536 000 000 224
1 070 000 000 256
1 470 000 000 280
2 147 483 647 316


le****@aptea.com (sql-db2-dba) wrote in message news:<c8************************@posting.google.co m>...
How did you come up with those numbers in your case statement for long
columns? I have a table with a CLOB ( 1M ) column. If I do a
AVG(LENGTH(CLOB_COL)) I get about 2000. But if I were to use your
formula, it would be 168.

Bill Leung
le****@aptea.com

aa***@hotmail.com (Amit) wrote in message news:<85**************************@posting.google. com>...
For most datatypes, this is calcuated by adding AVGCOLLEN in
syscat.columns (which is updated when you RUNSTATS). Long columns
space usage is based on the defined size of long columns. And on top
of everything else a 10 byte row overhead is added. So even though the
table may be empty, average row length will have some value.
The query below calculates the avg row length for my SQL version of
reorgchk - I use it in a common table expression in my SQL version of
REORGCHK, the results it produces match those from reorgchk in DB2 v8
fp 4.
Btw, I've noticed in the past that some of the calculations for
REORGCHK output are done differently from size estimator...looks like
the two tools were developed independent of each other

SELECT TABSCHEMA, TABNAME,
CASE
WHEN MIN (
CASE
WHEN A.LOGGED<>''
THEN 0
ELSE A.COLCARD
END) >= 0
THEN (SUM(
CASE
WHEN A.LOGGED = ''
THEN A.AVGCOLLEN
WHEN A.LENGTH <= 1024
THEN 72
WHEN A.LENGTH <= 8192
THEN 96
WHEN A.LENGTH <= 65536
THEN 120
WHEN A.LENGTH <= 524000
THEN 144
WHEN A.LENGTH <= 4190000
THEN 168
WHEN A.LENGTH <= 134000000
THEN 200
WHEN A.LENGTH <= 536000000
THEN 224
WHEN A.LENGTH <= 1070000000
THEN 256
WHEN A.LENGTH <= 1470000000
THEN 280
WHEN A.LENGTH <= 2147483647
THEN 316
END +
CASE
WHEN LOGGED <> ''
AND NULLS = 'Y'
THEN 1
ELSE 0
END ) + 10 )
ELSE -1
END AS AVGROWLEN
FROM SYSCAT.COLUMNS A
GROUP BY TABSCHEMA, TABNAME

Nov 12 '05 #6

P: n/a
Thanks Amit.

Not to beat this to a dead horse, but IBM probably should have left it
to be minus 1 or 0 when there is zero row in the table. I have
descibed my CLOB table below. To use Amit's calculation, it would
yeild a AVGROLLEN of 210 or 211 (13+4+168+10)+10. Control Center's
estimator gives 193. I am beginning to think that it might have
factored in the "Nulls" attribute. Perhaps nullable columns would
account for 80% or 90% of the allowable col-length. Unless Toronto
developers divulge their formula, I am not convinced.
$ db2 "describe table TNX_DOC"

Column Type
name name Length Scale Nulls
--------------------------- ------------------ -------- ----- ------
TNX_ID VARCHAR 13 0 No
TNX_DOC_CNT INTEGER 4 0 No
TNX_PY_CLB CLOB 1048576 0 Yes
TNX_INS_DTTIME TIMESTAMP 10 0 Yes

4 record(s) selected.

$ db2 "select SUBSTR(TABSCHEMA,1,8) as tabschema,
SUBSTR(TABNAME,1,16) as tabname, SUBSTR(COLNAME,1,16) as colname,
LOGGED, AVGCOLLEN from SYSCAT.COLUMNS WHERE TABSCHEMA='LDEV' and
TABNAME='TNX_DOC'" | pg

TABSCHEMA TABNAME COLNAME LOGGED AVGCOLLEN
--------- ------------ ---------------- ------ ---------
LDEV TNX_DOC TNX_ID 17
LDEV TNX_DOC TNX_INS_DTTIME 11
LDEV TNX_DOC TNX_PY_CLB Y -1
LDEV TNX_DOC TNX_DOC_CNT 4

4 record(s) selected.


aa***@hotmail.com (Amit) wrote in message news:<85**************************@posting.google. com>...
The table only stores the descriptors for LOB columns. These
descriptors point to the location of the LOB data. The size of the row
includes the size of the descriptor and not the actual LOB data. The
Descriptor sizes are given in DB2 SQL Reference under CREATE TABLE
section.
Maximum LOB Length LOB Descriptor Size
1 024 72
8 192 96
65 536 120
524 000 144
4 190 000 168
134 000 000 200
536 000 000 224
1 070 000 000 256
1 470 000 000 280
2 147 483 647 316


le****@aptea.com (sql-db2-dba) wrote in message news:<c8************************@posting.google.co m>...
How did you come up with those numbers in your case statement for long
columns? I have a table with a CLOB ( 1M ) column. If I do a
AVG(LENGTH(CLOB_COL)) I get about 2000. But if I were to use your
formula, it would be 168.

Bill Leung
le****@aptea.com

aa***@hotmail.com (Amit) wrote in message news:<85**************************@posting.google. com>...
For most datatypes, this is calcuated by adding AVGCOLLEN in
syscat.columns (which is updated when you RUNSTATS). Long columns
space usage is based on the defined size of long columns. And on top
of everything else a 10 byte row overhead is added. So even though the
table may be empty, average row length will have some value.
The query below calculates the avg row length for my SQL version of
reorgchk - I use it in a common table expression in my SQL version of
REORGCHK, the results it produces match those from reorgchk in DB2 v8
fp 4.
Btw, I've noticed in the past that some of the calculations for
REORGCHK output are done differently from size estimator...looks like
the two tools were developed independent of each other

SELECT TABSCHEMA, TABNAME,
CASE
WHEN MIN (
CASE
WHEN A.LOGGED<>''
THEN 0
ELSE A.COLCARD
END) >= 0
THEN (SUM(
CASE
WHEN A.LOGGED = ''
THEN A.AVGCOLLEN
WHEN A.LENGTH <= 1024
THEN 72
WHEN A.LENGTH <= 8192
THEN 96
WHEN A.LENGTH <= 65536
THEN 120
WHEN A.LENGTH <= 524000
THEN 144
WHEN A.LENGTH <= 4190000
THEN 168
WHEN A.LENGTH <= 134000000
THEN 200
WHEN A.LENGTH <= 536000000
THEN 224
WHEN A.LENGTH <= 1070000000
THEN 256
WHEN A.LENGTH <= 1470000000
THEN 280
WHEN A.LENGTH <= 2147483647
THEN 316
END +
CASE
WHEN LOGGED <> ''
AND NULLS = 'Y'
THEN 1
ELSE 0
END ) + 10 )
ELSE -1
END AS AVGROWLEN
FROM SYSCAT.COLUMNS A
GROUP BY TABSCHEMA, TABNAME

Nov 12 '05 #7

P: n/a
For your example, DB2 RUNSTATS sets the avg col lengths to 4,4,-1,11
even when the table is empty. My query is based on avg col len and
will yield 198 ( 5 more than control center).
My query is actually a subquery I use for doing REORGCHK via sql.
Using the results the avg row length query gives, I can mimic the
results of TSIZE(estimated table size = avg row length * cardinality)
in DB2 REORGCHK utility. I have no clue why IBM chose to use different
formulas for Control Center and REORGCHK. I used the REORGCHK formula
since it is documented.
le****@aptea.com (sql-db2-dba) wrote in message news:<c8*************************@posting.google.c om>...
Thanks Amit.

Not to beat this to a dead horse, but IBM probably should have left it
to be minus 1 or 0 when there is zero row in the table. I have
descibed my CLOB table below. To use Amit's calculation, it would
yeild a AVGROLLEN of 210 or 211 (13+4+168+10)+10. Control Center's
estimator gives 193. I am beginning to think that it might havereorgchk
factored in the "Nulls" attribute. Perhaps nullable columns would
account for 80% or 90% of the allowable col-length. Unless Toronto
developers divulge their formula, I am not convinced.
$ db2 "describe table TNX_DOC"

Column Type
name name Length Scale Nulls
--------------------------- ------------------ -------- ----- ------
TNX_ID VARCHAR 13 0 No
TNX_DOC_CNT INTEGER 4 0 No
TNX_PY_CLB CLOB 1048576 0 Yes
TNX_INS_DTTIME TIMESTAMP 10 0 Yes

4 record(s) selected.

$ db2 "select SUBSTR(TABSCHEMA,1,8) as tabschema,
SUBSTR(TABNAME,1,16) as tabname, SUBSTR(COLNAME,1,16) as colname,
LOGGED, AVGCOLLEN from SYSCAT.COLUMNS WHERE TABSCHEMA='LDEV' and
TABNAME='TNX_DOC'" | pg

TABSCHEMA TABNAME COLNAME LOGGED AVGCOLLEN
--------- ------------ ---------------- ------ ---------
LDEV TNX_DOC TNX_ID 17
LDEV TNX_DOC TNX_INS_DTTIME 11
LDEV TNX_DOC TNX_PY_CLB Y -1
LDEV TNX_DOC TNX_DOC_CNT 4

4 record(s) selected.


aa***@hotmail.com (Amit) wrote in message news:<85**************************@posting.google. com>...
The table only stores the descriptors for LOB columns. These
descriptors point to the location of the LOB data. The size of the row
includes the size of the descriptor and not the actual LOB data. The
Descriptor sizes are given in DB2 SQL Reference under CREATE TABLE
section.
Maximum LOB Length LOB Descriptor Size
1 024 72
8 192 96
65 536 120
524 000 144
4 190 000 168
134 000 000 200
536 000 000 224
1 070 000 000 256
1 470 000 000 280
2 147 483 647 316


le****@aptea.com (sql-db2-dba) wrote in message news:<c8************************@posting.google.co m>...
How did you come up with those numbers in your case statement for long
columns? I have a table with a CLOB ( 1M ) column. If I do a
AVG(LENGTH(CLOB_COL)) I get about 2000. But if I were to use your
formula, it would be 168.

Bill Leung
le****@aptea.com

aa***@hotmail.com (Amit) wrote in message news:<85**************************@posting.google. com>...
> For most datatypes, this is calcuated by adding AVGCOLLEN in
> syscat.columns (which is updated when you RUNSTATS). Long columns
> space usage is based on the defined size of long columns. And on top
> of everything else a 10 byte row overhead is added. So even though the
> table may be empty, average row length will have some value.
> The query below calculates the avg row length for my SQL version of
> reorgchk - I use it in a common table expression in my SQL version of
> REORGCHK, the results it produces match those from reorgchk in DB2 v8
> fp 4.
> Btw, I've noticed in the past that some of the calculations for
> REORGCHK output are done differently from size estimator...looks like
> the two tools were developed independent of each other
>
> SELECT TABSCHEMA, TABNAME,
> CASE
> WHEN MIN (
> CASE
> WHEN A.LOGGED<>''
> THEN 0
> ELSE A.COLCARD
> END) >= 0
> THEN (SUM(
> CASE
> WHEN A.LOGGED = ''
> THEN A.AVGCOLLEN
> WHEN A.LENGTH <= 1024
> THEN 72
> WHEN A.LENGTH <= 8192
> THEN 96
> WHEN A.LENGTH <= 65536
> THEN 120
> WHEN A.LENGTH <= 524000
> THEN 144
> WHEN A.LENGTH <= 4190000
> THEN 168
> WHEN A.LENGTH <= 134000000
> THEN 200
> WHEN A.LENGTH <= 536000000
> THEN 224
> WHEN A.LENGTH <= 1070000000
> THEN 256
> WHEN A.LENGTH <= 1470000000
> THEN 280
> WHEN A.LENGTH <= 2147483647
> THEN 316
> END +
> CASE
> WHEN LOGGED <> ''
> AND NULLS = 'Y'
> THEN 1
> ELSE 0
> END ) + 10 )
> ELSE -1
> END AS AVGROWLEN
> FROM SYSCAT.COLUMNS A
> GROUP BY TABSCHEMA, TABNAME

Nov 12 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.