Unicode + LC_COLLATE

Priem, Alexander said:

I recreated my entire database (luckily I keep scripts for
table/index/view
creation) and initdb-ed it using --lc-collate=C --encoding=UNICODE. In my
psqlODBC DSN settings I added "set client_encoding='LATIN9';" to the
Connect Settings and that solved all my problems regarding the
special characters.

Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will
be for sorts (and indexes?) when a multibyte unicode character is
encountered?

Is --lc-collate=C --encoding=UNICODE even valid? And if it's valid what
unexpected nasties could it cause?

Is it also true that if LC_COLLATE != 'C' that indexes cannot be used for
LIKE comparisons (and is this also true for en_US.iso885915)?

Our database is UNICODE with LC_COLLATE=en_US.iso885915. Does anyone know
what the effect of someone storing a cyrillic/chinese or korean character
is? (We are using JDBC with a webapp so all the unicode concerns are
handled transparently, apparantly). When the data is extracted from the DB
will it render correctly in the browser provided we send all responses
encoded in UTF-8?

Although http://www.postgresql.org/docs/7.4/i...e/charset.html
describes Postgres specific implementation and "how to configure for" a
given locale - the subtle nuances of combinations of encoding and
LC_COLLATE, and the tradeoffs are not entirely clear (to me at least). For
example are the performance penalties of using UNICODE over ASCII
significant?

Maybe it's just my inexperience but this topic seems to cause lots of
questions. A good/simple technote would be really useful... I'd do one but
I really don't know my ass from my elbow around this topic (and probably
many others too!).

Thanks for any answers/feedback/more info.

John Sidney-Woollett

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #1

Subscribe Post Reply

3380

Tom Lane

"John Sidney-Woollett" <jo****@wardbrook.com> writes:

Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will
be for sorts (and indexes?) when a multibyte unicode character is
encountered?
C locale basically means "sort by the byte sequence values". It'll do
something self-consistent, but maybe not what you'd like for UTF8
characters.
Our database is UNICODE with LC_COLLATE=en_US.iso885915.

Does that sort rationally at all? I should think you'd need to specify
an LC_COLLATE setting that's designed for UTF8 encoding, not 8859-15.

If you only ever store characters that are in 7-bit ASCII then none of
this will affect you, and you can get away with broken combinations of
encoding and locale. But if you'd like to sort characters outside the
minimal ASCII set then you need to get it right ...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #2

Tom Lane

"John Sidney-Woollett" <jo****@wardbrook.com> writes:

Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will
be for sorts (and indexes?) when a multibyte unicode character is
encountered?
C locale basically means "sort by the byte sequence values". It'll do
something self-consistent, but maybe not what you'd like for UTF8
characters.
Our database is UNICODE with LC_COLLATE=en_US.iso885915.

Nov 23 '05 #3

Peter Eisentraut

Am Donnerstag, 22. April 2004 13:17 schrieb John Sidney-Woollett:

Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will
be for sorts (and indexes?) when a multibyte unicode character is
encountered?
You get your strings sorted in binary order of the UTF-8 encoding, which is
probably not very interesting, but it's possible.
Is it also true that if LC_COLLATE != 'C' that indexes cannot be used for
LIKE comparisons (and is this also true for en_US.iso885915)?
No, see <http://www.postgresql.org/docs/7.4/static/indexes-opclass.html>.
Our database is UNICODE with LC_COLLATE=en_US.iso885915. Does anyone know
what the effect of someone storing a cyrillic/chinese or korean character
is?
This setup will result in UTF-8 characters being sorted by the system thinking
they are actually ISO-8859-15 characters. So the result will be random at
best.
(We are using JDBC with a webapp so all the unicode concerns are
handled transparently, apparantly). When the data is extracted from the DB
will it render correctly in the browser provided we send all responses
encoded in UTF-8?

If your database is in UNICODE and you're using JDBC then you should be all
set as far as PostgreSQL is concerned. Of course, your HTML pages need to
declare the encoding correctly as well.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #4

Peter Eisentraut

Am Donnerstag, 22. April 2004 13:17 schrieb John Sidney-Woollett:

Does anyone know what the effect of --lc-collate=C --encoding=UNICODE will
be for sorts (and indexes?) when a multibyte unicode character is
encountered?
You get your strings sorted in binary order of the UTF-8 encoding, which is
probably not very interesting, but it's possible.
Is it also true that if LC_COLLATE != 'C' that indexes cannot be used for
LIKE comparisons (and is this also true for en_US.iso885915)?
No, see <http://www.postgresql.org/docs/7.4/static/indexes-opclass.html>.
Our database is UNICODE with LC_COLLATE=en_US.iso885915. Does anyone know
what the effect of someone storing a cyrillic/chinese or korean character
is?
This setup will result in UTF-8 characters being sorted by the system thinking
they are actually ISO-8859-15 characters. So the result will be random at
best.
(We are using JDBC with a webapp so all the unicode concerns are
handled transparently, apparantly). When the data is extracted from the DB
will it render correctly in the browser provided we send all responses
encoded in UTF-8?

Nov 23 '05 #5

John Sidney-Woollett

Tom Lane said:

C locale basically means "sort by the byte sequence values". It'll do
something self-consistent, but maybe not what you'd like for UTF8
characters.
OK, that explains that. I guess I will need to try it out to see what the
effect is on extended character sets.

Our database is UNICODE with LC_COLLATE=en_US.iso885915.

Does that sort rationally at all? I should think you'd need to specify
an LC_COLLATE setting that's designed for UTF8 encoding, not 8859-15.

Er..., actually the LC_COLLATE for the DB in question is C - I was looking
at the wrong database (wrong telnet session)! So your comments above apply
in this case.
If you only ever store characters that are in 7-bit ASCII then none of
this will affect you, and you can get away with broken combinations of
encoding and locale. But if you'd like to sort characters outside the
minimal ASCII set then you need to get it right ...

Tom, thanks for the answers above.

I guess if I have some time I should build some different DBs with
different combinations of encoding and collations and summarise my
findings using different types of data and sort/search commands, in case
anyone else has the same level of confusion that I do...

John Sidney-Woollett

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #6

John Sidney-Woollett

Tom Lane said:

C locale basically means "sort by the byte sequence values". It'll do
something self-consistent, but maybe not what you'd like for UTF8
characters.
OK, that explains that. I guess I will need to try it out to see what the
effect is on extended character sets.

Our database is UNICODE with LC_COLLATE=en_US.iso885915.

Does that sort rationally at all? I should think you'd need to specify
an LC_COLLATE setting that's designed for UTF8 encoding, not 8859-15.

Er..., actually the LC_COLLATE for the DB in question is C - I was looking
at the wrong database (wrong telnet session)! So your comments above apply
in this case.
If you only ever store characters that are in 7-bit ASCII then none of
this will affect you, and you can get away with broken combinations of
encoding and locale. But if you'd like to sort characters outside the
minimal ASCII set then you need to get it right ...

Nov 23 '05 #7

Karsten Hilbert

John,

I guess if I have some time I should build some different DBs with
different combinations of encoding and collations and summarise my
findings using different types of data and sort/search commands, in case
anyone else has the same level of confusion that I do...

that'd be excellent. Be sure to offer the writeup for
inclusion into the techdocs site.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #8

Karsten Hilbert

John,

I guess if I have some time I should build some different DBs with
different combinations of encoding and collations and summarise my
findings using different types of data and sort/search commands, in case
anyone else has the same level of confusion that I do...

Nov 23 '05 #9

John Sidney-Woollett

Peter Eisentraut said:

Am Donnerstag, 22. April 2004 13:17 schrieb John Sidney-Woollett:
You get your strings sorted in binary order of the UTF-8 encoding, which
is probably not very interesting, but it's possible.
Agreed.

Is it also true that if LC_COLLATE != 'C' that indexes cannot be used
for LIKE comparisons (and is this also true for en_US.iso885915)?

No, see <http://www.postgresql.org/docs/7.4/static/indexes-opclass.html>.
I wish I understood what this page actually was trying to say.

Is it saying that varchar_pattern_ops sorts according to the 'C' locale
regardless of LC_COLLATE, and that varchar_ops sorts according to the
current value of LC_COLLATE?
This setup will result in UTF-8 characters being sorted by the system
thinking
they are actually ISO-8859-15 characters. So the result will be random at
best.

Actually the LC_COLLATE is currently 'C' not as I reported ISO-8859-1.

What would be a correct LC_COLLATE value for my database if we want to
primarily service ISO-8859-1, but allow for
cyrillic/chinese/japanese/korean characters too and have them sorting and
indexing correctly? We are building a multilanguage website...

ls /usr/share/locale produces:
ca de en@boldquot en_SE fi hr ko no sk zh_TW
cs el en_GB en_US fr it locale.alias pl sv
da en en@quot es gl ja nl pt_BR tr

Thanks for anymore info.

John Sidney-Woollett
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #10

John Sidney-Woollett

Peter Eisentraut said:

Am Donnerstag, 22. April 2004 13:17 schrieb John Sidney-Woollett:
You get your strings sorted in binary order of the UTF-8 encoding, which
is probably not very interesting, but it's possible.
Agreed.

Is it also true that if LC_COLLATE != 'C' that indexes cannot be used
for LIKE comparisons (and is this also true for en_US.iso885915)?

No, see <http://www.postgresql.org/docs/7.4/static/indexes-opclass.html>.
I wish I understood what this page actually was trying to say.

Is it saying that varchar_pattern_ops sorts according to the 'C' locale
regardless of LC_COLLATE, and that varchar_ops sorts according to the
current value of LC_COLLATE?
This setup will result in UTF-8 characters being sorted by the system
thinking
they are actually ISO-8859-15 characters. So the result will be random at
best.

Nov 23 '05 #11

by: Erlend Fuglum | last post by:

Hi everyone, I'm having some trouble sorting lists. I suspect this might have something to do with locale settings and/or character encoding/unicode. Consider the following example, text...

Python

Revised PEP 349: Allow str() to return unicode strings

by: Neil Schemenauer | last post by:

python-dev@python.org.] The PEP has been rewritten based on a suggestion by Guido to change str() rather than adding a new built-in function. Based on my testing, I believe the idea is...

Python

LC_COLLATE=C not working

by: Robert Wille | last post by:

I have two Linux servers, one is test and one is production. I run Postgres7.3.3 on both of them. I added a feature to my product that requires sorting like strcmp. So, I did an initdb as follows: ...

PostgreSQL Database

Unicode vs SQL_ASCII DBs

by: John Sidney-Woollett | last post by:

Hi I need to store accented characters in a postgres (7.4) database, and access the data (mostly) using the postgres JDBC driver (from a web app). Does anyone know if: 1) Is there a...

PostgreSQL Database

Sorting in Unicode not working

by: Hitesh Bagadiya | last post by:

Hi, Our database contains Hindi as well as English characters. We have specified the encoding to be unicode during initdb as well as createdb commands. Unfortunately sorting of the Hindi...

PostgreSQL Database

Unicode + LC_COLLATE

by: John Sidney-Woollett | last post by:

Priem, Alexander said: > I recreated my entire database (luckily I keep scripts for > table/index/view > creation) and initdb-ed it using --lc-collate=C --encoding=UNICODE. In my > psqlODBC DSN...

PostgreSQL Database

problems with lower() and unicode-databases

by: peter pilsl | last post by:

postgres 7.4 on linux, glibc 2.2.4-6 I've a table containing unicode-data and the lower()-function does not work proper. While it lowers standard letters like A->a,B->b ... it fails on special...

PostgreSQL Database

Unicode I/O

by: himanshu.garg | last post by:

Hi, The following std c++ program does not output the unicode character.:- %./a.out en_US.UTF-8 Infinity:

C / C++

how to use unicode in c under linux?

by: flywav | last post by:

hi all you know unicdoe is very important, under linux, i always use utf-8, but now i need save one file in unicode. my linux is centos. and i know this system support unicode. the wchar_t *p is...

C / C++

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Similar topics