473,382 Members | 1,147 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,382 software developers and data experts.

Unicode vs SQL_ASCII DBs

Hi

I need to store accented characters in a postgres (7.4) database, and
access the data (mostly) using the postgres JDBC driver (from a web app).

Does anyone know if:

1) Is there a performance loss using (multibyte) UNICODE vs (single byte)
SQL_ASCII/LATINxxx character encoding? (In terms of extra data, and
searching/sorting speeds).

2) Can SQL_ASCII be used for accented characters.

3) If I want accented characters to sort correctly, must I select UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?

4) I'm not initially expecting arabic, chinese, cyrillic or other language
types to be stored in the database. But if they were, would UNICODE be the
best encoding scheme to use for future proofing the data?

5) If I decide not to support/store non-latin languages, is it better to
use a LATIN encoding over UNICODE?

6) What is MULE_INTERNAL? Does it offer performance advantages?

[This next question probably belongs to the JDBC list, but I'll ask anyway]

7) Because the database is being used to backend a java web application,
are there other issues that I need to be aware of, for example, do I have
to convert all data received to UTF-8 before writing it into the database?
And do I have to ensure that the response (from the webserver)
content-type is always set to UTF-8 to be rendered correctly in a user's
browser?

Thanks for any help/advice.

John Sidney-Woollett

ps I did some tests between two databases; once created using UNICODE
encoding, and the other using SQL_ASCII encoding. The database initdb
command specified no encoding, so I guess that SQL_ASCII is the default
encoding. The results are below:

I created the following table in two databases:

Table "public.table1"
Column | Type | Modifiers
--------+------------------------+-----------
id | integer | not null
data | character varying(255) |

Here are the databases; one with UNICODE encoding, and the other with
SQL_ASCII encoding.

List of databases
Name | Owner | Encoding
--------------+----------+-----------
johntest | postgres | UNICODE
johntest2 | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII

I inserted the data into both databases in psql (first setting the client
encoding to LATIN1). Here is the data (the funny character is the word
"test" with the e replaced by an e acute (233).

set client_encoding to 'LATIN1';

insert into table1 values (1, 'tést');
insert into table1 values (2, 'tast');
insert into table1 values (3, 'tost');
insert into table1 values (4, 'test');

Now select the data ordered

(UNICODE DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
1 | tést
3 | tost

(SQL_ASCII DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
3 | tost
1 | tést

NOTE: Bad ordering using SQL_ASCII...

Now I did some other tests, I set the client encoding to UNICODE, and
retrieved the data from both databases:

set client_encoding to 'UNICODE';

(SQL_ASCII DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
3 | tost
1 | tst

NOTE: You can see that the e-acute has been "lost"...

(UNICODE DB): select * from table1 order by data;
id | data
----+------
2 | tast
4 | test
1 | tést
3 | tost



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 22 '05 #1
6 6590


On Sat, 31 Jan 2004, John Sidney-Woollett wrote:
Hi

I need to store accented characters in a postgres (7.4) database, and
access the data (mostly) using the postgres JDBC driver (from a web app).

Does anyone know if:

2) Can SQL_ASCII be used for accented characters.
Not with the JDBC driver. A client which is blissfully unaware of
encoding issues can pass data into and out of an ascii db without knowing
what the encoding is, but java must know.

3) If I want accented characters to sort correctly, must I select UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?
You are confusing encoding with locale. Locales determines the correct
sort order and you must choose an encoding that works with your locale.

4) I'm not initially expecting arabic, chinese, cyrillic or other language
types to be stored in the database. But if they were, would UNICODE be the
best encoding scheme to use for future proofing the data?
Yes.
7) Because the database is being used to backend a java web application,
are there other issues that I need to be aware of, for example, do I have
to convert all data received to UTF-8 before writing it into the database?
And do I have to ensure that the response (from the webserver)
content-type is always set to UTF-8 to be rendered correctly in a user's
browser?


The jdbc driver will correctly handle conversions between the database
encoding and the encoding the jvm is run under. Receiving data from a web
application is a little different because you must convert data from the
client's encoding to the jvm's encoding for this to work. The simplest
way to do this is just to make sure that you are using unicode in all
three places (server,jvm, and client).

Other things to note:

LOWER()/UPPER() only work correctly in a single byte encoding (not
unicode)

If using binary data (bytea) via JDBC you may need to use a unicode
db. I don't know if this has been fixed, but the server would attempt to
do an encoding conversion on the binary data:

http://archives.postgresql.org/pgsql...1/msg00045.php

Kris Jurka
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 22 '05 #2

On Jan 31, 2004, at 7:32 AM, John Sidney-Woollett wrote:

4) I'm not initially expecting arabic, chinese, cyrillic or other
language
types to be stored in the database. But if they were, would UNICODE be
the
best encoding scheme to use for future proofing the data?


If there is a remote chance that you might ever need CJK support you
should start now with unicode encoding. Switching the encoding of your
database is a major, major pain. I have a SQL_ASCII database that I
wish were UNICODE but the conversion is such an ugly undertaking that I
haven't even tried.

culley
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 22 '05 #3
Kris, thanks for you feedback. Can you give me any further info on the
questions below?

Kris Jurka said:
3) If I want accented characters to sort correctly, must I select
UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?
You are confusing encoding with locale. Locales determines the correct
sort order and you must choose an encoding that works with your locale.


Except that in my test, the two differently encoded databases were in the
same 7.4.1 cluster with the same locale, yet they sorted the *same* data
differently - implying the encoding is a factor.

Any idea why would that be?

here is output from pg_controldata:

pg_control version number: 72
Catalog version number: 200310211
Database cluster state: in production
pg_control last modified: Mon 02 Feb 2004 11:21:29 GMT
Current log file ID: 0
Next log file segment: 2
Latest checkpoint location: 0/124B958
Prior checkpoint location: 0/1149DFC
Latest checkpoint's REDO location: 0/124B958
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 16
Latest checkpoint's NextXID: 527327
Latest checkpoint's NextOID: 26472
Time of latest checkpoint: Mon 02 Feb 2004 11:21:27 GMT
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: floating-point numbers
Maximum length of locale name: 128
LC_COLLATE: en_GB.UTF-8
LC_CTYPE: en_GB.UTF-8

and

Name | Owner | Encoding
---------------+----------+-----------
johntest | postgres | UNICODE
johntest2 | postgres | SQL_ASCII
template0 | postgres | SQL_ASCII
template1 | postgres | SQL_ASCII
Other things to note:

LOWER()/UPPER() only work correctly in a single byte encoding (not
unicode)


Are there any other gotchas that I need to be aware of with a UNICODE
encoded database?

I save mention by Tom Lane of a bug: [quote] The bug turns out not to be
Fedora-specific at all. I believe it will happen on any platform if you
are using both a multibyte database encoding (such as Unicode) *and* a
non-C locale. PG 7.4 has a more restricted form of the bug --- it's not
locale specific but does still require a multibyte encoding. [END QUOTE]

I basically need "english" sorting, and accented character support without
any JDBC access/conversion problems. Do you think that my current DB
locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or
can you suggest something better?

Thanks

John Sidney-Woollett


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 22 '05 #4


On Mon, 2 Feb 2004, John Sidney-Woollett wrote:
Kris, thanks for you feedback. Can you give me any further info on the
questions below?

Kris Jurka said:
3) If I want accented characters to sort correctly, must I select
UNICODE
(or the appropriate ISO 8859 char set) over SQL_ASCII?
You are confusing encoding with locale. Locales determines the correct
sort order and you must choose an encoding that works with your locale.


Except that in my test, the two differently encoded databases were in the
same 7.4.1 cluster with the same locale, yet they sorted the *same* data
differently - implying the encoding is a factor.


Right, note the "and you must choose an encoding that works with your
locale." clause. A SQL_ASCII encoding and a UTF-8 locale don't work.
I basically need "english" sorting, and accented character support without
any JDBC access/conversion problems. Do you think that my current DB
locale (en_GB.UTF-8) and UNICODE encoded database the best solution? Or
can you suggest something better?


If you need "english" sorting like "en_GB" then that is the best option,
but if you just need regular sorting the C locale might be better. It is
sometimes confusing how en_US (I assume GB is similar) sorts strings with
spaces and punctuation and so on.

Kris Jurka
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 22 '05 #5
Kris Jurka said:
If you need "english" sorting like "en_GB" then that is the best option,
but if you just need regular sorting the C locale might be better. It is
sometimes confusing how en_US (I assume GB is similar) sorts strings with
spaces and punctuation and so on.
If I switch from "en_GB" locale to "C" locale (by recreating the cluster
using "initdb --no-locale"), will I still get accented characters be
sorted correctly, ie tast, test, tést, tost (if the DB encoding is
UNICODE)?

From what I've read, the "C" locale will give me better performance and
optimization for certain functions/expressions (eg "like").

Do you think that the "C" locale (which affects the LC_CTYPE setting) will
allow me to assume that searching and sorting will operate as I
hope/expect, and that all my existing functions etc will work as expected?
LOWER()/UPPER() only work correctly in a single byte encoding (not
unicode)


In one pl/pgsql function, I need the ability to lowercase (LOWER) the
three character file extension of a filename stored in the datbase. Will
the LOWER call do nothing with a char/varchar object in a unicode
database?

If so, is there somekind of workaround - I "know" the file extensions that
I'm interested in - can I replace the call to LOWER by a *huge* if then
elsif block of code which does a string match to workout the replacement
extension (in lowercase)?

Thanks again for your patience and help

John Sidney-Woollett

ps I will do some more testing, but with this topic I'm groping around in
the dark...

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 22 '05 #6
Kris Jurka <bo***@ejurka.com> writes:
On Mon, 2 Feb 2004, John Sidney-Woollett wrote:
Except that in my test, the two differently encoded databases were in the
same 7.4.1 cluster with the same locale, yet they sorted the *same* data
differently - implying the encoding is a factor.
Right, note the "and you must choose an encoding that works with your
locale." clause. A SQL_ASCII encoding and a UTF-8 locale don't work.


In practice, any given locale setting assumes a particular encoding and
will not work if some other encoding is used. For instance, on recent
Red Hat releases:

$ locale -a | grep ^de_DE
de_DE
de_DE.iso88591
de_DE.iso885915@euro
de_DE.utf8
de_DE.utf8@euro
de_DE@euro

I'm not too sure which encoding "de_DE" uses, but the other two are
clearly named to reflect their expected encoding.

It is really a bug that PG allows you to select incompatible locale and
encoding settings. We'd fix it if we could figure out a portable way of
determining which encoding a locale expects --- unfortunately the
standard APIs for libc omit this information ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 22 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Tambet Matiisen | last post by:
My Postgres databases used to have default (SQL_ASCII) encoding. I could store any 8-bit character in it regardless of actual charset, because all clients also used default encoding and no charset...
2
by: David Lutz | last post by:
Hello, I want to convert an existing database with SQL_ASCII encoding to UNICODE encoding. (postgresql ver 7.3.2) I thought that it might be as easy as: pg_dump mydatabase > dump.sql...
0
by: Amin Schoeib | last post by:
Hi, I have a database with encoding SQL_ASCII. Now I need to convert the encoding to UNICODE. Can somebody tell me how I can do that???? Thanxx Schoeib 4Tek Gesellschaft für angewandte...
6
by: peter pilsl | last post by:
postgres 7.3.2 I store unicode-data in postgresql. The data is retrieved via webinterfaces, processed with perl and then stored in postgresql (and viceversa). All is going nice with one...
1
by: Rajesh Kumar Mallah | last post by:
Hi , my current database in 7.3.4 is SQL_ASCII are there any benifits of coverting it to UNICODE encoding in 7.4 ? If so what is the process. Regds Mallah.
2
by: Priem, Alexander | last post by:
Hi everyone, I have a PostgreSQL 7.4 database running, which was initdb-ed using standard (SQL_ASCII) encoding, with -lc-collate=C option set. Everything is running fine, but I just discovered...
6
by: Dennis Gearon | last post by:
This is what has to be eventually done:(as sybase, and probably others do it) http://www.ianywhere.com/whitepapers/unicode.html I'm not sure how that will affect LIKE and REGEX. ...
1
by: peter pilsl | last post by:
postgres 7.4 on linux, glibc 2.2.4-6 I've a table containing unicode-data and the lower()-function does not work proper. While it lowers standard letters like A->a,B->b ... it fails on special...
0
by: Jason Tesser | last post by:
I have a database in sql_ascii that I need to convert to Unicode. I tried using pg_dump -Fc .. but it fails on certain characters. like this one "è" How can I get the data transferred? ...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.