By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
429,435 Members | 1,967 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

CHAR to VARCHAR conversion in DB2(iSeries)

P: 1
Background
Often while remodeling legacy application, one of the important tasks for the architects is to have an optimum usage of storage capabilities of database. Most of the legacy applications are constrained by the technology available at the time of their development and hence aren’t optimum as per current scenario. One of such cases is the extensive usage of CHAR fields, which aren’t optimum solution for space storage now. This paper will highlight the systematic approach which needs to be taken while converting CHAR data fields to VARCHAR in DB2 for iSeries.

Problem Statement
The basic constraint of the CHAR data type is its inflexibility to get optimally required space for storage. A CHAR(50) field will reserve 50 bytes in database, irrespective of the data content of the field. If most of the values of this field are less than length 35, then remaining 15 bytes are sheer wastage of space as it cannot be utilized by database otherwise. This will result in extra disk space.

Solution
In such a case, one of the most important and immediate option to save storage space is to let database store only relevant data without unnecessary trailing spaces. For this, architects simply convert CHAR fields to VARCHAR.

Is it really a solution?
Is simply converting all CHAR columns to VARCHAR is appropriate solution and if not then what is the optimum way to do this. This is the point where the real catch is, ignoring which can led to performance degradation, which will be hard to track at later stages.
Ideally it’s a two step process:
(a) Identifying the relevant columns for conversion: Changing all the columns blindly from CHAR to VARCHAR may result in huge effort as not only the table definition but related programs which are inserting/updating/deleting these columns needs to be changed accordingly. Therefore only those columns should be chosen where space saving is significant. Please refer section “Recommendation for architects” to select appropriate columns.
(b) Converting data definition of identified columns and related programs from CHAR to VARCHAR: This will require a little knowledge of how DB2 on iSeries implements storage of VARCHAR. Please refer section “Implementation of VARCHAR storage in DB2 (iSeries)” below.

Implementation of VARCHAR storage in DB2 (iSeries)
Variable-length column (VARCHAR) support allows you to define any number of columns in a table as variable length. If you use VARCHAR support, the size of a table can usually be reduced. Data in a variable-length column is stored internally in two areas: a fixed-length or ALLOCATE area and an overflow area. Behavior of ALLOCATE area is similar to the behavior of CHAR data type i.e. irrespective of the actual data length database reserves as much bytes for storage as specified in allocate area. Any data, over and above that length is stored in variable or overflow area. If a default value is specified, the allocated length is at least as large as the value.
Recommendation for architects
The following points help you determine the best way to use your storage area. If the primary goal is:
 Space saving: use simply VARCHAR data type (instead of CHAR) while defining table definition.
 Performance: Use VARCHAR with ALLOCATE clause. Create the table using the ALLOCATE keyword as
CREATE TABLE Tab (col1 VARCHAR(40) ALLOCATE(10),
Col2 VARCHAR(40) ALLOCATE(10),
COL3 VARCHAR(40) ALLOCATE(7))

In many applications, performance must be considered. If you use the default VARCHAR datatype i.e. ALLOCATE(0), it will double the disk unit traffic. ALLOCATE(0) requires two reads; one to read the fixed-length portion of the row and one to read the overflow space. The variable-length implementations, with the carefully chosen ALLOCATE, minimize overflow and space and maximizes performance.

It is possible to balance space savings and performance by using ALLOCATE clause properly. When you define a table with variable-length data, you must decide the width of the ALLOCATE area i.e. ALLOCATE area should be wide enough to incorporate at least 90% to 95% of the values for the column.

Now the question is how to get columns which needs to be converted from CHAR to VARCHAR and how to get the magical figure for the columns’ ALLOCATE clause? Because choosing irrelevant columns for conversion may lead to high efforts but fewer gains. Normally, columns which have very high variation in data length are most suitable for conversion. Columns like remarks, comments etc fall into this category and usually columns in this category have length 35 or above. Also, the corresponding table should have enough number of rows to justify the effort involved as space saving is directly proportional to the number of rows. Thus, to get the list of relevant columns and their corresponding tables, run the SQL script listed below:

SELECT a.table_name, a.column_name, a.data_type, a.length, b.number_rows
FROM qsys2.syscolumns a , qsys2.systablestat b
WHERE a.table_schema = ‘lib'
AND a.length >= 35
AND b.table_schema = a.table_schema
AND b.table_name = a.table_name
AND b.number_rows > 5000

Here, lib is the name of library where database table resides. This SQL will provide the list of all the relevant columns. Architects are recommended to review suitable value for a.length and b.number_rows according to the nature of their application. Now, we need to analyze the existing data and find the ALLOCATE figure, which should be appropriate to hold 90-95% of the data. For this, run this SQL in the concerned database, where col1 and tab1 is the column and its respective database table under consideration.

SELECT LENGTH(RTRIM(col1)) Column_Length, count(*) Count
FROM tab1
GROUP BY LENGTH(RTRIM(col1))
ORDER BYLENGTH(RTRIM(col1))

This SQL will provide column’s classification by their data length. Paste this information into an excel spread sheet and derive cumulative sum and related percentage. Attached below is the similar data for a column where number of rows in the parent table was 1626986 and the length of the column was defined as CHAR(50).

Column Length Count Cumm. Sum Percentage

null 997280 997280
0 581 997861 0.613318738
1 116 997977 0.613390035
2 188 998165 0.613505586
3 603 998768 0.61387621
4 961 999729 0.614466873
5 14 999743 0.614475478
6 2 999745 0.614476707
7 23 999768 0.614490844
8 14 999782 0.614499449
9 26 999808 0.614515429
10 2786 1002594 0.616227798
11 500 1003094 0.616535115
12 2454 1005548 0.618043425
13 4438 1009986 0.620771168
14 12623 1022609 0.628529686
15 21423 1044032 0.641696978
16 33377 1077409 0.662211599
17 43236 1120645 0.68878589
18 57116 1177761 0.723891293
19 59734 1237495 0.760605807
20 65324 1302819 0.800756122
21 64123 1366942 0.840168262
22 59600 1426542 0.876800415
23 49011 1475553 0.906924214
24 40671 1516224 0.931921971
25 30856 1547080 0.9508871
26 23382 1570462 0.96525846
27 17017 1587479 0.975717677
28 12170 1599649 0.983197766
29 8537 1608186 0.988444891
30 5674 1613860 0.991932321
31 4108 1617968 0.994457236
32 2778 1620746 0.996164687
33 1922 1622668 0.997346013
34 1315 1623983 0.998154256
35 1030 1625013 0.998787328
36 686 1625699 0.999208967
37 523 1626222 0.99953042
38 280 1626502 0.999702517
39 169 1626671 0.99980639
40 145 1626816 0.999895512
41 56 1626872 0.999929932
42 37 1626909 0.999952673
43 20 1626929 0.999964966
44 10 1626939 0.999971112
45 19 1626958 0.99998279
46 5 1626963 0.999985863
47 3 1626966 0.999987707
50 20 1626986 1

Almost 95% data is having length 25 or less than 25, therefore choosing ALLOCATE(25) will save around 35MB space with virtually no impact on performance. This can be roughly estimated as:
((Total_rows * percentile * saved_space) – (Total_rows * remaining percentile * average variable space per overflowed row)).
As the later half is not supposed to be significant, approximate value of the first half can provide rough estimate. Moreover, we need not to be very specific; any close estimate can aid our decision. For tables having millions of rows this saving may go up to even 300MB with just one column.
The result of this analysis will also aid in deciding whether this column is appropriate for conversion or not as less saving in storage indicates there is no use in spending effort for that column.

This type of analysis can easily be done for existing columns, but while designing new tables what should be this “ALLOCATE” size? For this, data architect initially has to choose an appropriate value according to the functionality of the field (initially 75% of the total length can be considered appropriate if there is no other input to aid the decision). After having considerable production data, data architect should check the performance of this field with the SQL listed below:

SELECT DOUBLE_PRECISION(overflow)/(number_rows)
FROM qsys2.systablestat
WHERE table_name = “tab1”
AND table_schema = “lib”

Any value below 5% is OK and anything above 5% means ALLOCATE length needs to be increased.

Possible impact on code
Apart from the table definition, programs depending upon these fields may also require to be changed in case host variables are used. If you are using host variables to insert or update variable-length columns, the host variables should be variable length. Because blanks are not truncated from fixed-length host variables, using fixed-length host variables would cause more rows to spill into the overflow space. This would increase the size of the table. In this example, fixed-length host variables are used to insert a row into a table:
01 COL1 PIC X(40). ...
MOVE "SMITH" TO COL1.
EXEC SQL INSERT INTO tab1 VALUES
(:COL1, :COL2, :COL3)
END-EXEC.
The host-variable COL1 is not variable length. The string “SMITH”, followed by 45 blanks, is inserted into the VARCHAR column COL1. The value is longer than the allocate size of 25. Twenty five of forty-five trailing blanks are in the overflow area. In this example, variable-length host variables are used to insert a row into a table:
01 VCol1.
49 Col1-LEN PIC S9(4) BINARY.
49 Col1-DATA PIC X(40).
...
MOVE "SMITH" TO Col1-DATA.
MOVE 5 TO Col1-LEN.
EXEC SQL INSERT INTO tab1 VALUES
(:VCol1, :VCol2, :VCol3)
END-EXEC.
The host variable VCol1 is variable length. The actual length of the data is set to 5. The value is shorter than the allocated length. It can be placed in the fixed portion of the column and thus having no performance overhead.
Jan 16 '08 #1
Share this Article
Share on Google+