Connecting Tech Pros Worldwide Help | Site Map

'order by' does "wrong" with unicode-chars (german umlauts)

peter pilsl
Guest
 
Posts: n/a
#1: Nov 11 '05

postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via webinterfaces,
processed with perl and then stored in postgresql (and viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ö") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

Österreich
America
Pakistan

instead of

Amerika
Österreich
Pakistan


How to deal with this Problem ? Of course converting to latin before storing
would be a solution but we plan to offer support for many non-latin languages
later and the meaning of unicode is to get rid of all this converting-stuff
after all.

thnx,
peter



--
IT-Consulting
mag. peter pilsl
tel:+43-699-1-3574035
fax:+43-699-4-3574035
pilsl@goldfisch.at
http://www.goldfisch.at




---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Richard Huxton
Guest
 
Posts: n/a
#2: Nov 11 '05

re: 'order by' does "wrong" with unicode-chars (german umlauts)


On Friday 19 September 2003 13:11, peter pilsl wrote:[color=blue]
> postgres 7.3.2
>
> I store unicode-data in postgresql. The data is retrieved via
> webinterfaces, processed with perl and then stored in postgresql (and
> viceversa).
>
> All is going nice with one problem. If performing a "select * order by
> field"-query the result is not what I expected.
>
> German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ö") and only the
> first char seems to be taken into account when sorting.
>
> So it happens that the order is like:
>
> Österreich
> America
> Pakistan
>
> instead of
>
> Amerika
> Österreich
> Pakistan[/color]

I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.
[color=blue]
> How to deal with this Problem ? Of course converting to latin before
> storing would be a solution but we plan to offer support for many non-latin
> languages later and the meaning of unicode is to get rid of all this
> converting-stuff after all.[/color]

What sorting-order do you want? You probably have options like: C, en_GB, de,
it etc.
If you are storing non-latin1 symbols as well as latin1, I can't think of what
a reasonable sort order would be.

Unfortunately, PG only supports one locale at a time, and gets set during
initdb. See the chapter on Localisation in the manuals for details.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

peter pilsl
Guest
 
Posts: n/a
#3: Nov 11 '05

re: 'order by' does "wrong" with unicode-chars (german umlauts)


>[color=blue]
> I'm no expert on locales, but I think you're confusing two things.
> Your character-set determines what symbols you can store.
> Your locale determines sorting rules. Check the end of the postgresql.conf
> file for details of what your current settings are.
>[/color]

I dont think that this is my problem.

I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.

If I store a german-umlaut-O (uppercase) to postgres and then retrieve it using
the lower-function on it I dont get a german-umlaut-o (lowercase) at all.
Only the first byte is converted to lowercase and the second is left untouched,
while in "real" unicode-lowercasing the first byte would stay untouched and the
second would change.
I still dont know how to tell postgres that the data it receives is unicode and
not just "singlebyte".

I'll rethink my problem and post a somehow more precise question to the mainlist
then, but any comments to shorten and improve my rethinking are highly welcome.

thnx,
peter



---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Richard Huxton
Guest
 
Posts: n/a
#4: Nov 11 '05

re: 'order by' does "wrong" with unicode-chars (german umlauts)


On Saturday 20 September 2003 13:56, peter pilsl wrote:[color=blue][color=green]
> > I'm no expert on locales, but I think you're confusing two things.
> > Your character-set determines what symbols you can store.
> > Your locale determines sorting rules. Check the end of the
> > postgresql.conf file for details of what your current settings are.[/color]
>
> I dont think that this is my problem.[/color]

Sorry - looks like the sorting part of your question threw me off track.
[color=blue]
> I get my text from a web-form, process it via perl and store it in
> postgreSQL via DBI-Interface. The unicode-text appears as multibyte in perl
> and I got the suspect that postgresql simply takes this multibyte-text and
> doesnt even reckognize that it could be unicode.[/color]

Could be the case - try "show client_encoding" in psql to see what encoding
you are using.
[color=blue]
> If I store a german-umlaut-O (uppercase) to postgres and then retrieve it
> using the lower-function on it I dont get a german-umlaut-o (lowercase) at
> all. Only the first byte is converted to lowercase and the second is left
> untouched, while in "real" unicode-lowercasing the first byte would stay
> untouched and the second would change.
> I still dont know how to tell postgres that the data it receives is unicode
> and not just "singlebyte".[/color]

If it turns out you want to change encoding to multibyte, I think you'll need
to dump an initdb again. See the chapter on localization - multi-byte
encodings for details.


--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Tom Lane
Guest
 
Posts: n/a
#5: Nov 11 '05

re: 'order by' does "wrong" with unicode-chars (german umlauts)


peter pilsl <pilsl@goldfisch.at> writes:[color=blue]
> I get my text from a web-form, process it via perl and store it in postgreSQL
> via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
> suspect that postgresql simply takes this multibyte-text and doesnt even
> reckognize that it could be unicode.[/color]

If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do. You need to make the database encoding
be unicode.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

peter pilsl
Guest
 
Posts: n/a
#6: Nov 11 '05

re: 'order by' does "wrong" with unicode-chars (german umlauts)




It makes no difference if I use a database with encoding unicode:


# \l
List of databases
Name | Owner | Encoding
--------------------+--------------+-----------
<skip>
test | peter | SQL_ASCII
unicode | peter | UNICODE
unicode2 | peter | LATIN1


I tried with all these databases. The problem stays exactely the same. The
german umlaut-O appears as double-byte "ö" and when ordering it appears between
"A" and "B", cause imho only the first byte "Ã" is taking into account.

I still have no idea, if there is a problem outside postgreSQL (in perl or maybe
in the DBD::Pg-interface) or inside postgreSQL. I still struggle with serious
debugging cause I dont know how to insert "real" unicode into postgres ... Any
debugging-hints welcome also :)


Dealing with the UNICODE-database raises new problems:
unicode=# insert into test values ('österreich');
ERROR: Unicode >= 0x10000 is not supported
unicode=# show client_encoding;
client_encoding
-----------------
UNICODE
(1 row)


thnx,
peter


Quoting Tom Lane <tgl@sss.pgh.pa.us>:
[color=blue]
> peter pilsl <pilsl@goldfisch.at> writes:[color=green]
> > I get my text from a web-form, process it via perl and store it in[/color]
> postgreSQL[color=green]
> > via DBI-Interface. The unicode-text appears as multibyte in perl and I got[/color]
> the[color=green]
> > suspect that postgresql simply takes this multibyte-text and doesnt even
> > reckognize that it could be unicode.[/color]
>
> If you have set the database encoding as SQL_ASCII, then that's exactly
> what it will (and should) do. You need to make the database encoding
> be unicode.
>
> regards, tom lane
>
>[/color]

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Andreas Hinz
Guest
 
Posts: n/a
#7: Nov 11 '05

re: 'order by' does "wrong" with unicode-chars (german umlauts)


Sat, 20 Sep 2003 18:39:35 +0200
skrev pilsl@goldfisch.at (peter pilsl):
[color=blue]
> It makes no difference if I use a database with encoding unicode:
>[/color]

ah@ahb:~$ LC_ALL=da_DK initdb
ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah"
ah@ahb:~$ createdb ah
ah@ahb:~$ psql ah

ah=# \l

List of databases
Name | Owner | Encoding
-----------+----------+-----------
ah | ah | SQL_ASCII


ah=# show client_encoding;

client_encoding
-----------------
SQL_ASCII
(1 row)


CREATE TABLE test (f1 varchar);

INSERT INTO test VALUES ('A');
INSERT INTO test VALUES ('B');
INSERT INTO test VALUES ('AA');
INSERT INTO test VALUES ('Æ');
INSERT INTO test VALUES ('Å');
INSERT INTO test VALUES ('Ø');
INSERT INTO test VALUES ('Ä');
INSERT INTO test VALUES ('Ö');
INSERT INTO test VALUES ('Ü');

SELECT * FROM test ORDER BY f1;

t
----
A
B
Ü
Æ
Ä
Ø
Ö
Å
AA
(9 rows)


Looks OK to me ;-)


---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Closed Thread