>>On 1/16/2008 at 3:40 PM, in message <fm**********@news.tiscali.fr>,
Colin
Booth<co*********@gmail.comwrote:
Frank Swarbrick wrote:
>Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
application uses code page 1252 will it perform better because no code
page
translation is required? I assume so. What type of performance hit
might
>I
expect when connecting to a UTF-8 database? What advantages would I get
by
using a UTF-8 database? Obviously it can store the entire Unicode
'plane'
>(or whatever that's called), but if my PC can't display it anyway what
do
>I
really care? And I guess that storing XML data requires UTF-8? But I
don't think we plan on utilizing this.
What else should we know to make our decision?
Thanks,
Frank
Hi
Some characters that may be single byte in 1252 are mult-byte in UTF-8.
With
a standard UK keyboard I think that there are 3 or 4 characters that are
multi-byte in UTF-8.
I like and prefere UTF-8 but the applications must coded for UTF-8. E.g.
if
you have an 8 byte character column and an 8 byte (1252) entry field and
fill the entry field using at least 1 of the UTF-8 multibyte characters
you
will get a data truncation error. Also you need to be careful about the
number of characters in a column as the byte count is not necessarily
the
character count.
Things are becoming much more global. I have moved to France but still
have
some accounts and investments in the UK. I also purchase some things
from
the UK and my address contans accents
I question your comment "the applications must coded for UTF-8". I just
wrote an OpenCobol application with imbedded DB2. No special "UTF-8"
coding, whatever that might mean. All it does is connect to the database,
retrieve the "string" and "hex" values of a set of VARCHAR(25) columns, and
displays those values.
I run this against two databases:
TEST1 is a database defined as codeset IBM-1252.
UTFDB is a database defined as codeset UTF-8.
Here are the results:
CONNECT TO test1
5B544553545D
+0006: [TEST]
7C544553547C
+0006: |TEST|
A654455354A6
+0006: ¦TEST¦
80
+0001: €
CONNECT TO utfdb
5B544553545D
+0006: [TEST]
7C544553547C
+0006: |TEST|
C2A654455354C2A6
+0006: ¦TEST¦
E282AC
+0001: €
(+0001: € <== that actually shows as the euro symbol in Notepad.)
As you can see, for the UTF-8 database the euro symbol was stored as
x'E282AC'. But since my application used code page 1252 DB2 was smart
enough to translate it to x'80', which is the value for euro in code page
1252.
Now of course when there is a symbol that exists in UTF-8 and not in 1252
then there will be a problem.
I guess your point is, and it's a good one, that if a CHAR or VARCHAR column
is defined in a UTF-8 database then you, in a sense, have to "over define"
the length to take in to account the possibility of multi-byte characters?
For instance, a 1 character field that could possibly contain a multi-byte
UTF-8 character (such as the euro symbol) would have to be defined in the
database as, say, CHAR(3).
This does bring to mind a question I have been pondering. Is there any harm
in defining 'string' fields to be much larger than the largest string length
that you would ever expect? Like an address line. It might be 50 or so
characters. Is there harm in defining it as VARCHAR(250) or even
VARCHAR(32000)? Does it waste space or any other resource?
Thanks for your help.
Frank