472,780 Members | 1,751 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,780 software developers and data experts.

'order by' does "wrong" with unicode-chars (german umlauts)


postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via webinterfaces,
processed with perl and then stored in postgresql (and viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ö") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

Österreich
America
Pakistan

instead of

Amerika
Österreich
Pakistan
How to deal with this Problem ? Of course converting to latin before storing
would be a solution but we plan to offer support for many non-latin languages
later and the meaning of unicode is to get rid of all this converting-stuff
after all.

thnx,
peter

--
IT-Consulting
mag. peter pilsl
tel:+43-699-1-3574035
fax:+43-699-4-3574035
pi***@goldfisch.at
http://www.goldfisch.at


---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 11 '05 #1
6 5495
On Friday 19 September 2003 13:11, peter pilsl wrote:
postgres 7.3.2

I store unicode-data in postgresql. The data is retrieved via
webinterfaces, processed with perl and then stored in postgresql (and
viceversa).

All is going nice with one problem. If performing a "select * order by
field"-query the result is not what I expected.

German umlauts (ie: Ö) are stored as doublechars ("Ö" is "Ö") and only the
first char seems to be taken into account when sorting.

So it happens that the order is like:

Österreich
America
Pakistan

instead of

Amerika
Österreich
Pakistan
I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.
How to deal with this Problem ? Of course converting to latin before
storing would be a solution but we plan to offer support for many non-latin
languages later and the meaning of unicode is to get rid of all this
converting-stuff after all.


What sorting-order do you want? You probably have options like: C, en_GB, de,
it etc.
If you are storing non-latin1 symbols as well as latin1, I can't think of what
a reasonable sort order would be.

Unfortunately, PG only supports one locale at a time, and gets set during
initdb. See the chapter on Localisation in the manuals for details.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 11 '05 #2
>
I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the postgresql.conf
file for details of what your current settings are.


I dont think that this is my problem.

I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.

If I store a german-umlaut-O (uppercase) to postgres and then retrieve it using
the lower-function on it I dont get a german-umlaut-o (lowercase) at all.
Only the first byte is converted to lowercase and the second is left untouched,
while in "real" unicode-lowercasing the first byte would stay untouched and the
second would change.
I still dont know how to tell postgres that the data it receives is unicode and
not just "singlebyte".

I'll rethink my problem and post a somehow more precise question to the mainlist
then, but any comments to shorten and improve my rethinking are highly welcome.

thnx,
peter

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 11 '05 #3
On Saturday 20 September 2003 13:56, peter pilsl wrote:
I'm no expert on locales, but I think you're confusing two things.
Your character-set determines what symbols you can store.
Your locale determines sorting rules. Check the end of the
postgresql.conf file for details of what your current settings are.
I dont think that this is my problem.


Sorry - looks like the sorting part of your question threw me off track.
I get my text from a web-form, process it via perl and store it in
postgreSQL via DBI-Interface. The unicode-text appears as multibyte in perl
and I got the suspect that postgresql simply takes this multibyte-text and
doesnt even reckognize that it could be unicode.
Could be the case - try "show client_encoding" in psql to see what encoding
you are using.
If I store a german-umlaut-O (uppercase) to postgres and then retrieve it
using the lower-function on it I dont get a german-umlaut-o (lowercase) at
all. Only the first byte is converted to lowercase and the second is left
untouched, while in "real" unicode-lowercasing the first byte would stay
untouched and the second would change.
I still dont know how to tell postgres that the data it receives is unicode
and not just "singlebyte".


If it turns out you want to change encoding to multibyte, I think you'll need
to dump an initdb again. See the chapter on localization - multi-byte
encodings for details.
--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 11 '05 #4
peter pilsl <pi***@goldfisch.at> writes:
I get my text from a web-form, process it via perl and store it in postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.


If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do. You need to make the database encoding
be unicode.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 11 '05 #5


It makes no difference if I use a database with encoding unicode:
# \l
List of databases
Name | Owner | Encoding
--------------------+--------------+-----------
<skip>
test | peter | SQL_ASCII
unicode | peter | UNICODE
unicode2 | peter | LATIN1
I tried with all these databases. The problem stays exactely the same. The
german umlaut-O appears as double-byte "ö" and when ordering it appears between
"A" and "B", cause imho only the first byte "Ã" is taking into account.

I still have no idea, if there is a problem outside postgreSQL (in perl or maybe
in the DBD::Pg-interface) or inside postgreSQL. I still struggle with serious
debugging cause I dont know how to insert "real" unicode into postgres ... Any
debugging-hints welcome also :)
Dealing with the UNICODE-database raises new problems:
unicode=# insert into test values ('österreich');
ERROR: Unicode >= 0x10000 is not supported
unicode=# show client_encoding;
client_encoding
-----------------
UNICODE
(1 row)
thnx,
peter
Quoting Tom Lane <tg*@sss.pgh.pa.us>:
peter pilsl <pi***@goldfisch.at> writes:
I get my text from a web-form, process it via perl and store it in

postgreSQL
via DBI-Interface. The unicode-text appears as multibyte in perl and I got

the
suspect that postgresql simply takes this multibyte-text and doesnt even
reckognize that it could be unicode.


If you have set the database encoding as SQL_ASCII, then that's exactly
what it will (and should) do. You need to make the database encoding
be unicode.

regards, tom lane


---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 11 '05 #6
Sat, 20 Sep 2003 18:39:35 +0200
skrev pi***@goldfisch.at (peter pilsl):
It makes no difference if I use a database with encoding unicode:


ah@ahb:~$ LC_ALL=da_DK initdb
ah@ahb:~$ su postgres -c "/usr/local/pgsql/bin/createuser -ad ah"
ah@ahb:~$ createdb ah
ah@ahb:~$ psql ah

ah=# \l

List of databases
Name | Owner | Encoding
-----------+----------+-----------
ah | ah | SQL_ASCII
ah=# show client_encoding;

client_encoding
-----------------
SQL_ASCII
(1 row)
CREATE TABLE test (f1 varchar);

INSERT INTO test VALUES ('A');
INSERT INTO test VALUES ('B');
INSERT INTO test VALUES ('AA');
INSERT INTO test VALUES ('Æ');
INSERT INTO test VALUES ('Å');
INSERT INTO test VALUES ('Ø');
INSERT INTO test VALUES ('Ä');
INSERT INTO test VALUES ('Ö');
INSERT INTO test VALUES ('Ü');

SELECT * FROM test ORDER BY f1;

t
----
A
B
Ü
Æ
Ä
Ø
Ö
Å
AA
(9 rows)
Looks OK to me ;-)
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 11 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: yurps | last post by:
Hello, I installed both, 1st mysql 4.1.7 but heard there was a change with password hashing so now got 4.0.9 Basically when I choose a page with php_info() I can load it fine, but when I call a...
2
by: Asfand Yar Qazi | last post by:
Hi, xmllint --valid gives error output on some code (included below:) file "Shotgun_Robot.dtd" (don't ask) <!ELEMENT Image (#PCDATA) > <!-- #PCDATA == image path --> <!-- x/y values...
18
by: day | last post by:
I know I've seen this issue described before, but I can't find it, or the solution now that I need it. I have some css-specified floating divs that contain images or text. The text divs have a...
6
by: wukexin | last post by:
Help me, good men. I find mang books that introduce bit "mang header files",they talk too bit,in fact it is my too fool, I don't learn it, I have do a test program, but I have no correct doing...
2
by: joewhitehair | last post by:
Using the XSD.exe tool, I created a number of classes from my XSD file. When I generate the WSDL for my web service, the schema does not have the proper Occurance constraints for the attributes. In...
2
by: ~~~ .NET Ed ~~~ | last post by:
It is not the first time I see this happen. I am using VS.NET 2003 with .NET Framework 1.1. In this particular situation I have a custom user control in a windows form. There is a member variable...
42
by: Holger | last post by:
Hi guys Tried searching for a solution to this, but the error message is so generic, that I could not get any meaningfull results. Anyways - errormessage:...
15
by: David White | last post by:
The size of a struct can be affected by compiler packing. Suppose you need it to be a specific value for some reason (e.g., in firmware). How can you get the compiler to generate an error for the...
1
by: Java Guy | last post by:
I'm trying to view a web page. IE tells me there are (Java?) errors on the page. Here they are: Line: 15 Char: 7 Error: Wrong number of arguments or invalid propert assignment Code: 0 URL:...
2
by: sukatoa | last post by:
I've invoke that .COM file using the code below, Runtime.getRuntime().exec("cmd /C start FINAL6.COM"); The program's behavior is wrong when i use this implementation.... When just double...
0
by: Rina0 | last post by:
Cybersecurity engineering is a specialized field that focuses on the design, development, and implementation of systems, processes, and technologies that protect against cyber threats and...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
linyimin
by: linyimin | last post by:
Spring Startup Analyzer generates an interactive Spring application startup report that lets you understand what contributes to the application startup time and helps to optimize it. Support for...
0
by: kcodez | last post by:
As a H5 game development enthusiast, I recently wrote a very interesting little game - Toy Claw ((http://claw.kjeek.com/))。Here I will summarize and share the development experience here, and hope it...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
0
by: Rina0 | last post by:
I am looking for a Python code to find the longest common subsequence of two strings. I found this blog post that describes the length of longest common subsequence problem and provides a solution in...
0
by: lllomh | last post by:
How does React native implement an English player?
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.