DB2 Universal Language support

Kieran Green

Greetings,

We are building an application written for Windows in C++ which uses
OLEDB to connect to AIX DB2 8.2. Our app stores all string data in
the wchar_t datatype, which generates dynamic SQL, typically with
bound parameters DBTYPE_WSTR, and so is a Unicode app.

We don't know whether to use the vargraphic datatype for storing
strings, or varchar, and which database character sets to support.

Option a - support only a UTF-8 database and use vargraphics.
Performance should be better since the app stores Unicode strings in
Unicode columns (UCS-2). However from a product perspective this is
limiting because customers may want to use, say IBM-1252, because they
want to! But they could be told that UTF-8 will store their data just
fine. Yes, we could also use vargraphics in any mbcs, but there are
conversion issues (control char's convert to SUB's in IBM-943).

Option b - support any database character set, and use varchar string
columns. This requires us to use the multiplier factor (e.g. 3 or 4)
for the column length (to support asian lang's when UTF-8 is chosen),
which heavily devours the limited rowsize in DB2 (you get deducted on
column creation time from the pagesize). Also going with a large
pagesize like 32K may hurt performance. Yes, we could make our code
choose whether to multiply or not depending on the language, but for
simplicity we don't. The UTF-8 is nice because it doesn't store a lot
of extra bytes as for vargraphics, when ascii is primarily stored
(saves disk space).

Our concern is that we support the most popular character set that
real people use. What is most prevalent? If we choose option a, will
there be a customer that balks because they want IBM-943?
Specifically: Will customers be perfectly happy with option a, or
will some demand to use other mbcs's (such as IBM-943), or would they
prefer to have the varchar's?

Thanks,
Kieran

Nov 12 '05 #1

Subscribe Post Reply

7365

Mark Yudkin

vargraphic is for DBCS, not Unicode. Use a Unicode databases with
(var)chars. (Var)chars in a Unicode db are actually ucs-2, although it
appears counterintuitive!

"Kieran Green" <ki***********@yahoo.com> wrote in message
news:ed*************************@posting.google.co m...

Greetings,

We are building an application written for Windows in C++ which uses
OLEDB to connect to AIX DB2 8.2. Our app stores all string data in
the wchar_t datatype, which generates dynamic SQL, typically with
bound parameters DBTYPE_WSTR, and so is a Unicode app.

We don't know whether to use the vargraphic datatype for storing
strings, or varchar, and which database character sets to support.

Option a - support only a UTF-8 database and use vargraphics.
Performance should be better since the app stores Unicode strings in
Unicode columns (UCS-2). However from a product perspective this is
limiting because customers may want to use, say IBM-1252, because they
want to! But they could be told that UTF-8 will store their data just
fine. Yes, we could also use vargraphics in any mbcs, but there are
conversion issues (control char's convert to SUB's in IBM-943).

Option b - support any database character set, and use varchar string
columns. This requires us to use the multiplier factor (e.g. 3 or 4)
for the column length (to support asian lang's when UTF-8 is chosen),
which heavily devours the limited rowsize in DB2 (you get deducted on
column creation time from the pagesize). Also going with a large
pagesize like 32K may hurt performance. Yes, we could make our code
choose whether to multiply or not depending on the language, but for
simplicity we don't. The UTF-8 is nice because it doesn't store a lot
of extra bytes as for vargraphics, when ascii is primarily stored
(saves disk space).

Our concern is that we support the most popular character set that
real people use. What is most prevalent? If we choose option a, will
there be a customer that balks because they want IBM-943?
Specifically: Will customers be perfectly happy with option a, or
will some demand to use other mbcs's (such as IBM-943), or would they
prefer to have the varchar's?

Thanks,
Kieran

Nov 12 '05 #2

Kieran Green

MS SQL Server uses nvarchar for Unicode. Isn't the analogy: MSSQL's
nvarchar = DB2's vargraphic? Assuming basic "Unicode" is two bytes,
and there are schemes to encode Unicode, such as UTF-8, wouldn't
MSSQL's nvarchar and DB2's vargraphic both store double-byte Unicode
in the basic form of UCS-2?

The DB2 docs state: "When a Unicode database is created, CHAR,
VARCHAR, [etc] data are stored in UTF-8, and GRAPHIC, VARGRAPHIC,
[etc] data are stored in UCS-2." It would seem that when our Unicode
OLEDB app inserts into a varchar column (DB2 database is created as
UTF-8), the Unicode data gets encoded and stored as UTF-8. Is that
right?

I've heard of "pure DBCS", in reference to IBM Asian character sets.
By "DBCS", do you mean one of the CCSID numbers relating to specific
IBM language encodings in double-byte? If so, then vargraphics would
be great for "DBCS", but if the vargraphic is UCS-2, isn't it as good
a receptical to store Unicode, as is MSSQL's nvarchar?

I'm also concerned about glossing over subtleties with language
encodings if we employ Option a or b, such as loss of support of
characters. So if anyone has some real expertise in use-cases in
Options a or b, that would be useful.

Much Thanks!
Kieran

"Mark Yudkin" <my***********************@boing.org> wrote in message news:<co**********@ngspool-d02.news.aol.com>...

vargraphic is for DBCS, not Unicode. Use a Unicode databases with
(var)chars. (Var)chars in a Unicode db are actually ucs-2, although it
appears counterintuitive!

"Kieran Green" <ki***********@yahoo.com> wrote in message
news:ed*************************@posting.google.co m...
Greetings,

We are building an application written for Windows in C++ which uses
OLEDB to connect to AIX DB2 8.2. Our app stores all string data in
the wchar_t datatype, which generates dynamic SQL, typically with
bound parameters DBTYPE_WSTR, and so is a Unicode app.

We don't know whether to use the vargraphic datatype for storing
strings, or varchar, and which database character sets to support.

Option a - support only a UTF-8 database and use vargraphics.
Performance should be better since the app stores Unicode strings in
Unicode columns (UCS-2). However from a product perspective this is
limiting because customers may want to use, say IBM-1252, because they
want to! But they could be told that UTF-8 will store their data just
fine. Yes, we could also use vargraphics in any mbcs, but there are
conversion issues (control char's convert to SUB's in IBM-943).

Option b - support any database character set, and use varchar string
columns. This requires us to use the multiplier factor (e.g. 3 or 4)
for the column length (to support asian lang's when UTF-8 is chosen),
which heavily devours the limited rowsize in DB2 (you get deducted on
column creation time from the pagesize). Also going with a large
pagesize like 32K may hurt performance. Yes, we could make our code
choose whether to multiply or not depending on the language, but for
simplicity we don't. The UTF-8 is nice because it doesn't store a lot
of extra bytes as for vargraphics, when ascii is primarily stored
(saves disk space).

Our concern is that we support the most popular character set that
real people use. What is most prevalent? If we choose option a, will
there be a customer that balks because they want IBM-943?
Specifically: Will customers be perfectly happy with option a, or
will some demand to use other mbcs's (such as IBM-943), or would they
prefer to have the varchar's?

Thanks,
Kieran

Nov 12 '05 #3

Mark Yudkin

DB2 does not have an equivalent to MS SQL's nvarchar - DB2 does not have a
Unicode data type. I too would like to see such a solution, but that's not
the way IBM decided to do things. Vargraphic is not Unicode, it is DBCS, an
earlier standard for handling CJK languages.

Provided your database has a Unicode code page, the data will be Unicode.
The internal encoding is not really important.

As I implied, I don't recommend either of your options a or b.

"Kieran Green" <ki***********@yahoo.com> wrote in message
news:ed*************************@posting.google.co m...

MS SQL Server uses nvarchar for Unicode. Isn't the analogy: MSSQL's
nvarchar = DB2's vargraphic? Assuming basic "Unicode" is two bytes,
and there are schemes to encode Unicode, such as UTF-8, wouldn't
MSSQL's nvarchar and DB2's vargraphic both store double-byte Unicode
in the basic form of UCS-2?

The DB2 docs state: "When a Unicode database is created, CHAR,
VARCHAR, [etc] data are stored in UTF-8, and GRAPHIC, VARGRAPHIC,
[etc] data are stored in UCS-2." It would seem that when our Unicode
OLEDB app inserts into a varchar column (DB2 database is created as
UTF-8), the Unicode data gets encoded and stored as UTF-8. Is that
right?

I've heard of "pure DBCS", in reference to IBM Asian character sets.
By "DBCS", do you mean one of the CCSID numbers relating to specific
IBM language encodings in double-byte? If so, then vargraphics would
be great for "DBCS", but if the vargraphic is UCS-2, isn't it as good
a receptical to store Unicode, as is MSSQL's nvarchar?

I'm also concerned about glossing over subtleties with language
encodings if we employ Option a or b, such as loss of support of
characters. So if anyone has some real expertise in use-cases in
Options a or b, that would be useful.

Much Thanks!
Kieran

"Mark Yudkin" <my***********************@boing.org> wrote in message
news:<co**********@ngspool-d02.news.aol.com>...
vargraphic is for DBCS, not Unicode. Use a Unicode databases with
(var)chars. (Var)chars in a Unicode db are actually ucs-2, although it
appears counterintuitive!

"Kieran Green" <ki***********@yahoo.com> wrote in message
news:ed*************************@posting.google.co m...
> Greetings,
>
> We are building an application written for Windows in C++ which uses
> OLEDB to connect to AIX DB2 8.2. Our app stores all string data in
> the wchar_t datatype, which generates dynamic SQL, typically with
> bound parameters DBTYPE_WSTR, and so is a Unicode app.
>
> We don't know whether to use the vargraphic datatype for storing
> strings, or varchar, and which database character sets to support.
>
> Option a - support only a UTF-8 database and use vargraphics.
> Performance should be better since the app stores Unicode strings in
> Unicode columns (UCS-2). However from a product perspective this is
> limiting because customers may want to use, say IBM-1252, because they
> want to! But they could be told that UTF-8 will store their data just
> fine. Yes, we could also use vargraphics in any mbcs, but there are
> conversion issues (control char's convert to SUB's in IBM-943).
>
> Option b - support any database character set, and use varchar string
> columns. This requires us to use the multiplier factor (e.g. 3 or 4)
> for the column length (to support asian lang's when UTF-8 is chosen),
> which heavily devours the limited rowsize in DB2 (you get deducted on
> column creation time from the pagesize). Also going with a large
> pagesize like 32K may hurt performance. Yes, we could make our code
> choose whether to multiply or not depending on the language, but for
> simplicity we don't. The UTF-8 is nice because it doesn't store a lot
> of extra bytes as for vargraphics, when ascii is primarily stored
> (saves disk space).
>
> Our concern is that we support the most popular character set that
> real people use. What is most prevalent? If we choose option a, will
> there be a customer that balks because they want IBM-943?
> Specifically: Will customers be perfectly happy with option a, or
> will some demand to use other mbcs's (such as IBM-943), or would they
> prefer to have the varchar's?
>
> Thanks,
> Kieran

Nov 12 '05 #4

by: news.microsoft.com | last post by:

To Microsoft and fellow MSDN Universal subscribers... Regarding new MSDN Universal (I mean Premier) price and level changes: 1) Way too expensive for the small and medium developer Universal...

.NET Framework

Universal String (4 Byte Canonical Encoding) and UTF-32

by: Jeffrey Walton | last post by:

Hi All, BMP Strings are a subset of Universal Strings.The BMP string uses approximately 65,000 code points from Universal String encoding. BMP Strings: ISO/IEC 10646, 2-octet canonical form,...

C# / C Sharp

doctest.testfile universal newline -- only when module_relative=True?

by: Peter Donis | last post by:

When running a doctest text file with doctest.testfile, I noticed that universal newline support did not appear to work when module_relative is False. My text file was saved on a Windows machine...

Python

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

DB2 Universal Language support

Similar topics