473,407 Members | 2,326 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,407 software developers and data experts.

Storing Japanese characters in UTF-8 database

I am having some data reject on an insert of Japanese characters to a
Varchar column in my DB2 UTF-8 database. I am trying to understand how
big to make the Varchar column for the inesert to work successfully for
all of my data. I would also like ot understand if Vargraphic is a
better approach and what ramifications that might have.

This data is sourced from UTF-8 Oracle (using byte semantics) which
defines a columns as
Varchar2(255). I had defined my DB2 columns Varchar(255) but the
insert rejects some rows.

I notice that for Oracle : if I apply the length function to a specific
column , I get a value of 8 for a specified key. However, for this
same key that sucecssfully loaded on the DB2 side, I get a length of
33. Sometimes, the length on the DB2 side is more than 5 times the
Oracle side. I do not understand how this can be.

Also, Can someone clarify how much storage DB2 allocates in bytes, for
the following (assume non-nullable) ? My assumptions are below -

CHAR(1) - assume 1 byte
CHAR(100) - assume 100 bytes
VARCHAR(1) - assume length + 1 byte = 2 bytes
VARCHAR(100) - assume lenght + 100 bytes = 101 bytes at most
VARGRAPHIC(100) - ?

Thank you in advance for your assistance.

Apr 12 '06 #1
13 13870
"mike_dba" <mi*************@yahoo.com> wrote in message
news:11**********************@i39g2000cwa.googlegr oups.com...
I am having some data reject on an insert of Japanese characters to a
Varchar column in my DB2 UTF-8 database. I am trying to understand how
big to make the Varchar column for the inesert to work successfully for
all of my data. I would also like ot understand if Vargraphic is a
better approach and what ramifications that might have.

This data is sourced from UTF-8 Oracle (using byte semantics) which
defines a columns as
Varchar2(255). I had defined my DB2 columns Varchar(255) but the
insert rejects some rows.

I notice that for Oracle : if I apply the length function to a specific
column , I get a value of 8 for a specified key. However, for this
same key that sucecssfully loaded on the DB2 side, I get a length of
33. Sometimes, the length on the DB2 side is more than 5 times the
Oracle side. I do not understand how this can be.

Also, Can someone clarify how much storage DB2 allocates in bytes, for
the following (assume non-nullable) ? My assumptions are below -

CHAR(1) - assume 1 byte
CHAR(100) - assume 100 bytes
VARCHAR(1) - assume length + 1 byte = 2 bytes
VARCHAR(100) - assume lenght + 100 bytes = 101 bytes at most
VARGRAPHIC(100) - ?

Thank you in advance for your assistance.


I would triple the size of the VARCHAR in DB2. You can go up to about 32K
(not quite) with VARCHAR in DB2.

DB2 uses 2 extra bytes for the length, but that does not reduce the amount
of space for the data. If it is nullable, it uses one additional byte.
Apr 12 '06 #2
Thank you for your response. The triple size will work in most cases.
However, we have noticed the following scenario -

I have a varchar(255) in Oracle. For a particular data item on Oracle,
I apply the Oracle function for length and get 86 character back. I
also apply the Oracle function for byte size and get back 238.

Inserting this same data into DB2 varchar(255) rejects. Inserting
into DB2 varchar(2000) works. DB2 length function returns 466 (this
should be bytes, correct ?). How could this be ?

Apr 12 '06 #3
I don't know how Oracle measures the length of a UTF-8 varchar field or
how it physically stores it but UDB stores UTF-8 data as 1-4 bytes for
each character. Characters in pictographic languages, especially
Japanese, Chinese, and other languages used in that part of the world,
are more often the longer three and four byte ones.

A varchar definition specifies the number of bytes, not the number of
characters to be stored. The data capacity of the column can't be
precisely determined because each character may occupy a different
number of bytes.

A single UTF-8 character occupies a maximum of four bytes. Specify a
varchar length four times the number of characters you expect to store
and you'll always have enough space. Unfortunately, this will NOT
prevent storing more characters than you want if each character is
shorter than the 4/character worst case.

Philip Sherman

mike_dba wrote:
Thank you for your response. The triple size will work in most cases.
However, we have noticed the following scenario -

I have a varchar(255) in Oracle. For a particular data item on Oracle,
I apply the Oracle function for length and get 86 character back. I
also apply the Oracle function for byte size and get back 238.

Inserting this same data into DB2 varchar(255) rejects. Inserting
into DB2 varchar(2000) works. DB2 length function returns 466 (this
should be bytes, correct ?). How could this be ?

Apr 12 '06 #4
Is there any way in which you can change the character semantics for
the field? In particular, I know in oracle you can define a fields data
type to either use byte or character length semantics:

A) varchar2(120 BYTE)
B) varchar2(120 CHAR)

Where A is 120 bytes and B is 120 characters?

Also, just out of curiosity if I were to attempt to load a UTF-16
character into a field in a UTF-8 database should I expect DB2 to use 8
bytes for this?

Apr 12 '06 #5
Is there any way in which you can change the character semantics for
the field? In particular, I know in oracle you can define a fields data
type to either use byte or character length semantics:

A) varchar2(120 BYTE)
B) varchar2(120 CHAR)

Where A is 120 bytes and B is 120 characters?

Also, just out of curiosity if I were to attempt to load a UTF-16
character into a field in a UTF-8 database should I expect DB2 to use 8
bytes for this?

Apr 12 '06 #6
"mike_dba" <mi*************@yahoo.com> wrote in message
news:11**********************@t31g2000cwb.googlegr oups.com...
Thank you for your response. The triple size will work in most cases.
However, we have noticed the following scenario -

I have a varchar(255) in Oracle. For a particular data item on Oracle,
I apply the Oracle function for length and get 86 character back. I
also apply the Oracle function for byte size and get back 238.

Inserting this same data into DB2 varchar(255) rejects. Inserting
into DB2 varchar(2000) works. DB2 length function returns 466 (this
should be bytes, correct ?). How could this be ?


You obviously have too much time on your hands.
Apr 12 '06 #7
Hi..
I have a table which I use to store names and other
information(English). Now, the requirement is that the table needs to
store all local language like Japanese, Chinese, Hindhi etc.

What all things do I have to change in the Database.The databse is a
DB2 UTF-8 database.

Also, how can I test if the data is being Inserted and how to retreive
the data.
I dont have a clue. Can some one guide me..

Thanks in advance
RaInDeEr.

Apr 13 '06 #8
mike_dba wrote:
Thank you for your response. The triple size will work in most cases.
However, we have noticed the following scenario -

I have a varchar(255) in Oracle. For a particular data item on Oracle,
I apply the Oracle function for length and get 86 character back. I
also apply the Oracle function for byte size and get back 238.

Inserting this same data into DB2 varchar(255) rejects. Inserting
into DB2 varchar(2000) works. DB2 length function returns 466 (this
should be bytes, correct ?). How could this be ?

The following functions have been introduced in Viper
-------
* CHARACTER_LENGTH
* OCTET_LENGTH
* POSITION
* SUBSTRING
The modified existing functions include:
* LENGTH
* LOCATE
These functions process strings along character boundaries rather than
along byte or double-byte boundaries. Each function (except
OCTET_LENGTH) accepts an argument specifying the code unit, or string
length unit of the result:

* CODEUNITS16 specifies that the result is to be expressed in
16-bit UTF-16 code units
* CODEUNITS32 specifies that the result is to be expressed in
32-bit UTF-32 code units
* OCTETS specifies that the result is to be expressed in bytes

This argument is optional for the existing functions.
----------
More information is needed to figure out the issue on bind-in.
But the fact that the byte length is notes > 400 gives some hint.
Unicode has an n-m relationship between glyphs and code points.
It could be that in your DB2 App you end up with longer code points for
the same glyph.

Just a wild guess. You could compare the HEX() values between the two.

Cheers
Serge
--
Serge Rielau
DB2 Solutions Development
IBM Toronto Lab
Apr 13 '06 #9
After accepting the fact that for a table with a varchar(255) in
Oracle, I will need to define a DB2 table as varchar (3*255), I find
that I exceed the 32K page size (as someone pointed out earlier) . So
how come Oracle can have a table defined exactly the same as DB2 and
yet fit the data in a 32K page and DB2 cannot ?

Apr 20 '06 #10
After accepting the fact that for a table with a varchar(255) in
Oracle, I will need to define a DB2 table as varchar (3*255), I find
that I exceed the 32K page size (as someone pointed out earlier) . So
how come Oracle can have a table defined exactly the same as DB2 and
yet fit the data in a 32K page and DB2 cannot ?

Apr 20 '06 #11
Please disregard my last post. As I was misinformed. The Oracle table
has fewer columns.

Apr 20 '06 #12
Please disregard my last post. As I was misinformed. The Oracle table
has fewer columns.

Apr 20 '06 #13
This was something missed. By setting the DB2CODEPAGE=1208 variable on
the client side, The data loads correctly. Apparently, the client was
not picking up the locale from the OS. There was no need for me to
expand the table by three.

As an FYI - I have found that Oracle permits page chaining (ie Oracle
can have a row span multiple data pages).

Apr 27 '06 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: David Thomas | last post by:
Hi there, a while ago, I posted a question regarding reading japanese text from a text file. Well, since I solved the problem, I thought I'd post my solution for the benefit of other people with...
8
by: Daniel | last post by:
I'm trying to make a site work for japanese characters. It works fine except for the alerts in javascript. The characters are stored in unicode, as this; 'コミック全巻配' Those unicode characters...
1
by: Sriv Chakravarthy | last post by:
I am trying to use xerces-c SAX parser to parse japanese characters. I have a <?xml... utf-8> line in the xml file. When the parser encounters the jap characters it throws a UTFDataFormatException....
2
by: Robert M. Gary | last post by:
I'm using JRE 1.5 on Solaris Japanese (Sparc). The JVM claims its default character set is EUC-JP I'm seeing two strange things when using Japanese character sets... 1) If I write a program that...
2
by: Robert M. Gary | last post by:
I'm on a Solaris 9 Japanese machine w/ an Ultra 5 Sparc CPU. I'm using Xerces 2.6 DOM I've got a document in UTF-8 format.. <?xml version="1.0" encoding="UTF-8"?>...
2
by: Joseph | last post by:
Hello. I have this problem. See I have a transformed XML file and I checked its contents prior to outputting it to excel file via responseset. here is the gist of the code: XmlReader reader =...
21
by: Doug Lerner | last post by:
I'm working on a client/server app that seems to work fine in OS Firefox and Windows IE and Firefox. However, in OS X Safari, although the UI/communications themselves work fine, if the...
12
by: paulgor | last post by:
Hi, May be it's a know issue but my search brought nothing... We have static HTML files with Japanese text in UTF-8 encoding - it's on-line Help for our application, so there are no Web...
2
by: prashantkisanpatil | last post by:
Hi All, I am trying to wrap a japanese text in Python, by the following code. if len(message) 54: message = message.decode("UTF8") strlist = textwrap.wrap(message,54) After this I am...
4
by: kettle | last post by:
Hi, I am rather new to python, and am currently struggling with some encoding issues. I have some utf-8-encoded text which I need to encode as iso-2022-jp before sending it out to the world. I am...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.