469,286 Members | 2,378 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,286 developers. It's quick & easy.

Unicode vs SQL_ASCII DBs

Hi

I need to store accented characters in a postgres (7.4) database, and
access the data (mostly) using the postgres JDBC driver (from a web app).

Does anyone know if:

1) Is there a performance loss using (multibyte) UNICODE vs (single byte)
SQL_ASCII/LATINxxx character encoding? (In terms of extra data, and
searching/sorting speeds).

2) Can SQL_ASCII be used for accented characters.

3) If I want accented characters to sort correctly, must I select UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?

4) I'm not initially expecting arabic, chinese, cyrillic or other language
types to be stored in the database. But if they were, would UNICODE be the
best encoding scheme to use for future proofing the data?

5) If I decide not to support/store non-latin languages, is it better to
use a LATIN encoding over UNICODE?

6) What is MULE_INTERNAL? Does it offer performance advantages?

[This next question probably belongs to the JDBC list, but I'll ask anyway]

7) Because the database is being used to backend a java web application,
are there other issues that I need to be aware of, for example, do I have
to convert all data received to UTF-8 before writing it into the database?
And do I have to ensure that the response (from the webserver)
content-type is always set to UTF-8 to be rendered correctly in a user's
browser?

Thanks for any help/advice.

John Sidney-Woollett

ps I did some tests between two databases; once created using UNICODE
encoding, and the other using SQL_ASCII encoding. The database initdb
command specified no encoding, so I guess that SQL_ASCII is the default
encoding. The results are below:

I created the following table in two databases:

Table "public.table1"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
data | character varying(255) |

Here are the databases; one with UNICODE encoding, and the other with
SQL_ASCII encoding.

List of databases
Name | Owner | Encoding
--------------+----------+-----------
johntest | postgres | UNICODE
johntest2 | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII

I inserted the data into both databases in psql (first setting the client
encoding to LATIN1). Here is the data (the funny character is the word
"test" with the e replaced by an e acute (233).

set client_encoding to 'LATIN1';

insert into table1 values (1, 'tést');
insert into table1 values (2, 'tast');
insert into table1 values (3, 'tost');
insert into table1 values (4, 'test');

Now select the data ordered

(UNICODE DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
1 | tést
3 | tost

(SQL_ASCII DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
3 | tost
1 | tést

NOTE: Bad ordering using SQL_ASCII...

Now I did some other tests, I set the client encoding to UNICODE, and
retrieved the data from both databases:

set client_encoding to 'UNICODE';

(SQL_ASCII DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
3 | tost
1 | tst

NOTE: You can see that the e-acute has been "lost"...

(UNICODE DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
1 | tést
3 | tost



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 22 '05 #1
6 6351


On Sat, 31 Jan 2004, John Sidney-Woollett wrote:
Hi

I need to store accented characters in a postgres (7.4) database, and
access the data (mostly) using the postgres JDBC driver (from a web app).

Does anyone know if:

2) Can SQL_ASCII be used for accented characters.
Not with the JDBC driver. A client which is blissfully unaware of
encoding issues can pass data into and out of an ascii db without knowing
what the encoding is, but java must know.

3) If I want accented characters to sort correctly, must I select UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?
You are confusing encoding with locale. Locales determines the correct
sort order and you must choose an encoding that works with your locale.

4) I'm not initially expecting arabic, chinese, cyrillic or other language
types to be stored in the database. But if they were, would UNICODE be the
best encoding scheme to use for future proofing the data?
Yes.
7) Because the database is being used to backend a java web application,
are there other issues that I need to be aware of, for example, do I have
to convert all data received to UTF-8 before writing it into the database?
And do I have to ensure that the response (from the webserver)
content-type is always set to UTF-8 to be rendered correctly in a user's
browser?


The jdbc driver will correctly handle conversions between the database
encoding and the encoding the jvm is run under. Receiving data from a web
application is a little different because you must convert data from the
client's encoding to the jvm's encoding for this to work. The simplest
way to do this is just to make sure that you are using unicode in all
three places (server,jvm, and client).

Other things to note:

LOWER()/UPPER() only work correctly in a single byte encoding (not
unicode)

If using binary data (bytea) via JDBC you may need to use a unicode
db. I don't know if this has been fixed, but the server would attempt to
do an encoding conversion on the binary data:

http://archives.postgresql.org/pgsql...1/msg00045.php

Kris Jurka
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 22 '05 #2

On Jan 31, 2004, at 7:32 AM, John Sidney-Woollett wrote:

4) I'm not initially expecting arabic, chinese, cyrillic or other
language
types to be stored in the database. But if they were, would UNICODE be
the
best encoding scheme to use for future proofing the data?


If there is a remote chance that you might ever need CJK support you
should start now with unicode encoding. Switching the encoding of your
database is a major, major pain. I have a SQL_ASCII database that I
wish were UNICODE but the conversion is such an ugly undertaking that I
haven't even tried.

culley
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 22 '05 #3
Kris, thanks for you feedback. Can you give me any further info on the
questions below?

Kris Jurka said:
3) If I want accented characters to sort correctly, must I select
UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?
You are confusing encoding with locale. Locales determines the correct
sort order and you must choose an encoding that works with your locale.


Except that in my test, the two differently encoded databases were in the
same 7.4.1 cluster with the same locale, yet they sorted the *same* data
differently - implying the encoding is a factor.

Any idea why would that be?

here is output from pg_controldata:

pg_control version number: 72
Catalog version number: 200310211
Database cluster state: in production
pg_control last modified: Mon 02 Feb 2004 11:21:29 GMT
Current log file ID: 0
Next log file segment: 2
Latest checkpoint location: 0/124B958
Prior checkpoint location: 0/1149DFC
Latest checkpoint's REDO location: 0/124B958
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 16
Latest checkpoint's NextXID: 527327
Latest checkpoint's NextOID: 26472
Time of latest checkpoint: Mon 02 Feb 2004 11:21:27 GMT
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: floating-point numbers
Maximum length of locale name: 128
LC_COLLATE: en_GB.UTF-8
LC_CTYPE: en_GB.UTF-8

and

Name | Owner | Encoding
---------------+----------+-----------
johntest | postgres | UNICODE
johntest2 | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII
Other things to note:

LOWER()/UPPER() only work correctly in a single byte encoding (not
unicode)


Are there any other gotchas that I need to be aware of with a UNICODE
encoded database?

I save mention by Tom Lane of a bug: [quote] The bug turns out not to be
Fedora-specific at all. I believe it will happen on any platform if you
are using both a multibyte database encoding (such as Unicode) *and* a
non-C locale. PG 7.4 has a more restricted form of the bug --- it's not
locale specific but does still require a multibyte encoding. [END QUOTE]

I basically need "english" sorting, and accented character support without
any JDBC access/conversion problems. Do you think that my current DB
locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or
can you suggest something better?

Thanks

John Sidney-Woollett


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 22 '05 #4


On Mon, 2 Feb 2004, John Sidney-Woollett wrote:
Kris, thanks for you feedback. Can you give me any further info on the
questions below?

Kris Jurka said:
3) If I want accented characters to sort correctly, must I select
UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?
You are confusing encoding with locale. Locales determines the correct
sort order and you must choose an encoding that works with your locale.


Except that in my test, the two differently encoded databases were in the
same 7.4.1 cluster with the same locale, yet they sorted the *same* data
differently - implying the encoding is a factor.


Right, note the "and you must choose an encoding that works with your
locale." clause. A SQL_ASCII encoding and a UTF-8 locale don't work.
I basically need "english" sorting, and accented character support without
any JDBC access/conversion problems. Do you think that my current DB
locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or
can you suggest something better?


If you need "english" sorting like "en_GB" then that is the best option,
but if you just need regular sorting the C locale might be better. It is
sometimes confusing how en_US (I assume GB is similar) sorts strings with
spaces and punctuation and so on.

Kris Jurka
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 22 '05 #5
Kris Jurka said:
If you need "english" sorting like "en_GB" then that is the best option,
but if you just need regular sorting the C locale might be better. It is
sometimes confusing how en_US (I assume GB is similar) sorts strings with
spaces and punctuation and so on.
If I switch from "en_GB" locale to "C" locale (by recreating the cluster
using "initdb --no-locale"), will I still get accented characters be
sorted correctly, ie tast, test, tést, tost (if the DB encoding is
UNICODE)?

From what I've read, the "C" locale will give me better performance and
optimization for certain functions/expressions (eg "like").

Do you think that the "C" locale (which affects the LC_CTYPE setting) will
allow me to assume that searching and sorting will operate as I
hope/expect, and that all my existing functions etc will work as expected?
LOWER()/UPPER() only work correctly in a single byte encoding (not
unicode)


In one pl/pgsql function, I need the ability to lowercase (LOWER) the
three character file extension of a filename stored in the datbase. Will
the LOWER call do nothing with a char/varchar object in a unicode
database?

If so, is there somekind of workaround - I "know" the file extensions that
I'm interested in - can I replace the call to LOWER by a *huge* if then
elsif block of code which does a string match to workout the replacement
extension (in lowercase)?

Thanks again for your patience and help

John Sidney-Woollett

ps I will do some more testing, but with this topic I'm groping around in
the dark...

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 22 '05 #6
Kris Jurka <bo***@ejurka.com> writes:
On Mon, 2 Feb 2004, John Sidney-Woollett wrote:
Except that in my test, the two differently encoded databases were in the
same 7.4.1 cluster with the same locale, yet they sorted the *same* data
differently - implying the encoding is a factor.
Right, note the "and you must choose an encoding that works with your
locale." clause. A SQL_ASCII encoding and a UTF-8 locale don't work.


In practice, any given locale setting assumes a particular encoding and
will not work if some other encoding is used. For instance, on recent
Red Hat releases:

$ locale -a | grep ^de_DE
de_DE
de_DE.iso88591
de_DE.iso885915@euro
de_DE.utf8
de_DE.utf8@euro
de_DE@euro

I'm not too sure which encoding "de_DE" uses, but the other two are
clearly named to reflect their expected encoding.

It is really a bug that PG allows you to select incompatible locale and
encoding settings. We'd fix it if we could figure out a portable way of
determining which encoding a locale expects --- unfortunately the
standard APIs for libc omit this information ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 22 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Tambet Matiisen | last post: by
2 posts views Thread by David Lutz | last post: by
1 post views Thread by Rajesh Kumar Mallah | last post: by
2 posts views Thread by Priem, Alexander | last post: by
6 posts views Thread by Dennis Gearon | last post: by
reply views Thread by Jason Tesser | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.