By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
428,659 Members | 940 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 428,659 IT Pros & Developers. It's quick & easy.

'order by' does "wrong" with unicode-chars (german umlauts)

P: n/a

postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via webinterfaces,
processed with perl and then stored in postgresql (and viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: ÷) are stored as doublechars ("÷" is "√Ė") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

÷sterreich
America
Pakistan

instead of

Amerika
÷sterreich
Pakistan
How to deal with this Problem ? Of course converting to latin before storing
would be a solution but we plan to offer support for many non-latin languages
later and the meaning of unicode is to get rid of all this converting-stuff
after all.

thnx,
peter

--
IT-Consulting
mag. peter pilsl
tel:+43-699-1-3574035
fax:+43-699-4-3574035
pi***@goldfisch.at
http://www.goldfisch.at


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 11 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
On Friday 19 September 2003 13:11, peter pilsl wrote:
postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via
webinterfaces, processed with perl and then stored in postgresql (and
viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: ÷) are stored as doublechars ("÷" is "√Ė") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

÷sterreich
America
Pakistan

instead of

Amerika
÷sterreich
Pakistan
I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.
How to deal with this Problem ? Of course converting to latin before
storing would be a solution but we plan to offer support for many non-latin
languages later and the meaning of unicode is to get rid of all this
converting-stuff after all.


What sorting-order do you want? You probably have options like: C, en_GB, de,
it etc.
If you are storing non-latin1 symbols as well as latin1, I can't think of what
a reasonable sort order would be.

Unfortunately, PG only supports one locale at a time, and gets set during
initdb. See the chapter on Localisation in the manuals for details.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 11 '05 #2

P: n/a
>
I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.


I dont think that this is my problem.

I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.

If I store a german-umlaut-O (uppercase) to postgres and then retrieve it using
the lower-function on it I dont get a german-umlaut-o (lowercase) at all.
Only the first byte is converted to lowercase and the second is left untouched,
while in "real" unicode-lowercasing the first byte would stay untouched and the
second would change.
I still dont know how to tell postgres that the data it receives is unicode and
not just "singlebyte".

I'll rethink my problem and post a somehow more precise question to the mainlist
then, but any comments to shorten and improve my rethinking are highly welcome.

thnx,
peter

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 11 '05 #3

P: n/a
On Saturday 20 September 2003 13:56, peter pilsl wrote:
I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the
postgresql.conf file for details of what your current settings are.
I dont think that this is my problem.


Sorry - looks like the sorting part of your question threw me off track.
I get my text from a web-form, process it via perl and store it in
postgreSQL via DBI-Interface. The unicode-text appears as multibyte in perl
and I got the suspect that postgresql simply takes this multibyte-text and
doesnt even reckognize that it could be unicode.
Could be the case - try "show client_encoding" in psql to see what encoding
you are using.
If I store a german-umlaut-O (uppercase) to postgres and then retrieve it
using the lower-function on it I dont get a german-umlaut-o (lowercase) at
all. Only the first byte is converted to lowercase and the second is left
untouched, while in "real" unicode-lowercasing the first byte would stay
untouched and the second would change.
I still dont know how to tell postgres that the data it receives is unicode
and not just "singlebyte".


If it turns out you want to change encoding to multibyte, I think you'll need
to dump an initdb again. See the chapter on localization - multi-byte
encodings for details.
--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 11 '05 #4

P: n/a
peter pilsl <pi***@goldfisch.at> writes:
I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.


If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do. You need to make the database encoding
be unicode.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 11 '05 #5

P: n/a


It makes no difference if I use a database with encoding unicode:
# \l
List of databases
Name | Owner | Encoding
--------------------+--------------+-----------
<skip>
test | peter | SQL_ASCII
unicode | peter | UNICODE
unicode2 | peter | LATIN1
I tried with all these databases. The problem stays exactely the same. The
german umlaut-O appears as double-byte "ö" and when ordering it appears between
"A" and "B", cause imho only the first byte "√" is taking into account.

I still have no idea, if there is a problem outside postgreSQL (in perl or maybe
in the DBD::Pg-interface) or inside postgreSQL. I still struggle with serious
debugging cause I dont know how to insert "real" unicode into postgres ... Any
debugging-hints welcome also :)
Dealing with the UNICODE-database raises new problems:
unicode=# insert into test values ('Ųsterreich');
ERROR: Unicode >= 0x10000 is not supported
unicode=# show client_encoding;
client_encoding
-----------------
UNICODE
(1 row)
thnx,
peter
Quoting Tom Lane <tg*@sss.pgh.pa.us>:
peter pilsl <pi***@goldfisch.at> writes:
I get my text from a web-form, process it via perl and store it in

postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got

the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.


If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do. You need to make the database encoding
be unicode.

regards, tom lane


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 11 '05 #6

P: n/a
Sat, 20 Sep 2003 18:39:35 +0200
skrev pi***@goldfisch.at (peter pilsl):
It makes no difference if I use a database with encoding unicode:


ah@ahb:~$ LC_ALL=da_DK initdb
ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah"
ah@ahb:~$ createdb ah
ah@ahb:~$ psql ah

ah=# \l

List of databases
Name | Owner | Encoding
-----------+----------+-----------
ah | ah | SQL_ASCII
ah=# show client_encoding;

client_encoding
-----------------
SQL_ASCII
(1 row)
CREATE TABLE test (f1 varchar);

INSERT INTO test VALUES ('A');
INSERT INTO test VALUES ('B');
INSERT INTO test VALUES ('AA');
INSERT INTO test VALUES ('∆');
INSERT INTO test VALUES ('Ň');
INSERT INTO test VALUES ('ō');
INSERT INTO test VALUES ('ń');
INSERT INTO test VALUES ('÷');
INSERT INTO test VALUES ('‹');

SELECT * FROM test ORDER BY f1;

t
----
A
B


ń
ō
÷
Ň
AA
(9 rows)
Looks OK to me ;-)
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.