473,795 Members | 2,892 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

langauges, locales, regex, LIKE

If I've read everything right, in order to get:

multiple languages on a site

with the functionality of ALL of:

REGEX
LIKE
Correctly sorted text

A site would have to:

create a cluster for every language needed
run a separate database instance for every language
and have the database instances each have their own port
and use 8 bit encoding for that specific language

because:

Sorting is fixed at cluster/directory creation per single
database instance
And LIKE only works on C Locale with an eight bit encoding
and sorting (MAYBE?) works only on 8 bit encoding
when using C Locale.

If anyone can correct me on this, I'd love to hear it.

Boy, the old LOCALE system has really got to go someday.

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #1
7 2052
> If I've read everything right, in order to get:

multiple languages on a site

with the functionality of ALL of:

REGEX
LIKE
Correctly sorted text

...


You might want to look at this:

http://www.unicode.org/reports/tr10/

to get an idea of where things stand at the Unicode Consortium.

But, for sorting mixed content, what sort order should take effect
between "now" and $B!V8=;~E@!W( B (the latter being Japanese)?

A little more down-to-earth, would you want "genjiten" and the hiragana
equivalent, $B!V$2$s$8$F$s !W(B to fold together in the collation? You
definitely would need a Japanese locale for that, if it ever could work.
(I'm not sure it could work, unless you had some way to specify the
method of romanization in a sublocale or something.)

And what happens when you mix "genjiten" with "genealogy" ? Or the US
English name "Moench" with the German name using the correct character
for the "oe", with the romanized Japanese "moeru", and the same ($B!V$b$((B
$B$k!W(B) in hiragana?

--
Joel <re**@ddcom.co. jp>
Just ranting, ignore me.
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #2
Dennis Gearon wrote:
If I've read everything right, in order to get:

multiple languages on a site

with the functionality of ALL of:

REGEX
LIKE
Correctly sorted text

A site would have to:

create a cluster for every language needed
run a separate database instance for every language
and have the database instances each have their own port
and use 8 bit encoding for that specific language
You'd need a separate database, not a separate cluster. Each database
can then have their own encoding and locale.
because:

Sorting is fixed at cluster/directory creation per single
database instance
To clarify, a cluster is a group of databases that share user logins and
can all be accessed via the same server.
And LIKE only works on C Locale with an eight bit encoding
and sorting (MAYBE?) works only on 8 bit encoding
when using C Locale.
You can sort, and I believe use LIKE on UTF etc. However, index use is a
different matter.
If anyone can correct me on this, I'd love to hear it.

Boy, the old LOCALE system has really got to go someday.


The issue isn't so much the difficulty of supporting multiple locales
(AFAIK). I believe it's more to do with interactions. If you have a
table containing multiple languages in the same column, what does it
mean to sort that table? Do you sort by language-name then by languages?
If you don't, what rules do you follow?

What happens if we compare different languages?
Does fr/fr:"a" == en/gb:"a"?
Does en/gb:"hello" == en/us:"hello"?

Messy, isn't it?

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #3
For what it's worth, we have a unicode 7.4.1 database which gives us the
sorting and searching behaviour that we expect (with the exception of
the upper and lower functions). We access the data via jdbc so we don't
have to deal with encoding issues per se as the driver does any
translation for us.

Currently we don't use any LIKE statements, but if we did, and wanted
them optimized then we'd use the appropriate OP Class when defining the
index. We also don't use any REGEX expressions. And we'll shortly be
experimenting with tsearch2...

List of databases
Name | Owner | Encoding
---------------+----------+----------
test | postgres | UNICODE

Setting the psql client encoding to Latin1 and inserting the following
data...

# select * from johntest;
id | value
----+-------
1 | test
2 | tést
3 | tèst
4 | taste
5 | TEST
6 | TÉST
7 | TÈST
8 | TASTE
(8 rows)

and then extracting the data in sorted order works as we would expect

# select * from johntest order by value (no index on the value field)
id | value
----+-------
8 | TASTE
5 | TEST
7 | TÈST
6 | TÉST
4 | taste
1 | test
3 | tèst
2 | tést
(8 rows)

however, applying the UPPER function to the data does not work as
expected, problem with ids 6,7,3,2 - should be ordered (3,7 or 7,3) ,
(6,2 or 2,6)

# select * from johntest order by upper(value);
id | value
----+-------
4 | taste
8 | TASTE
1 | test
5 | TEST
7 | TÈST
6 | TÉST
3 | tèst
2 | tést
(8 rows)

using a LIKE operation also works as expected (again no index on value
field)

# select * from johntest where value like 't%';
id | value
----+-------
1 | test
2 | tést
3 | tèst
4 | taste
(4 rows)

Here's our pg_controldata output:
version number: 72
Catalog version number: 200310211
Database cluster state: in production
pg_control last modified: Thu 24 Jun 2004 07:18:56 GMT
Current log file ID: 0
Next log file segment: 29
Latest checkpoint location: 0/1CA5F8D8
Prior checkpoint location: 0/1C8F2074
Latest checkpoint's REDO location: 0/1CA5F8D8
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 17
Latest checkpoint's NextXID: 42355483
Latest checkpoint's NextOID: 29814
Time of latest checkpoint: Thu 24 Jun 2004 07:18:54 GMT
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: floating-point numbers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

and our locale is

locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB .UTF-8"
LC_NUMERIC="en_ GB.UTF-8"
LC_TIME="en_GB. UTF-8"
LC_COLLATE="en_ GB.UTF-8"
LC_MONETARY="en _GB.UTF-8"
LC_MESSAGES="en _GB.UTF-8"
LC_PAPER="en_GB .UTF-8"
LC_NAME="en_GB. UTF-8"
LC_ADDRESS="en_ GB.UTF-8"
LC_TELEPHONE="e n_GB.UTF-8"
LC_MEASUREMENT= "en_GB.UTF-8"
LC_IDENTIFICATI ON="en_GB.UTF-8"
LC_ALL=

We are intending to support various European languages on our website,
and so far the unicode seems to be working for us.

But maybe we' re just happy in our ignorance!

John Sidney-Woollett

Dennis Gearon wrote:
If I've read everything right, in order to get:

multiple languages on a site

with the functionality of ALL of:

REGEX
LIKE
Correctly sorted text

A site would have to:

create a cluster for every language needed
run a separate database instance for every language
and have the database instances each have their own port
and use 8 bit encoding for that specific language

because:

Sorting is fixed at cluster/directory creation per single
database instance
And LIKE only works on C Locale with an eight bit encoding
and sorting (MAYBE?) works only on 8 bit encoding
when using C Locale.

If anyone can correct me on this, I'd love to hear it.

Boy, the old LOCALE system has really got to go someday.

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddres sHere" to ma*******@postg resql.org)

Nov 23 '05 #4
Richard Huxton <de*@archonet.c om> writes:
You'd need a separate database, not a separate cluster. Each database
can then have their own encoding and locale.


Not so, unfortunately. You get one locale per cluster, at least for
LC_COLLATE and LC_CTYPE which are the significant settings for this
discussion.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postg resql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #5
Richard Huxton wrote:
Dennis Gearon wrote:
If I've read everything right, in order to get:

multiple languages on a site

with the functionality of ALL of:
REGEX
LIKE
Correctly sorted text

A site would have to:

create a cluster for every language needed
run a separate database instance for every language
and have the database instances each have their own port
and use 8 bit encoding for that specific language

You'd need a separate database, not a separate cluster. Each database
can then have their own encoding and locale.


If I wanted all the languages to be running concurently, I can't switch clusters that the database is connected to on the fly, right? The database stays in the cluster it was started in, right? So, if that's true, then I need separate database instances if I want truly accurate sorting.
because:

Sorting is fixed at cluster/directory creation per single
database instance

To clarify, a cluster is a group of databases that share user logins and
can all be accessed via the same server.
And LIKE only works on C Locale with an eight bit encoding
and sorting (MAYBE?) works only on 8 bit encoding
when using C Locale.

You can sort, and I believe use LIKE on UTF etc. However, index use is a
different matter.


Yup, there is no facility to declare character sets for indexes.
If anyone can correct me on this, I'd love to hear it.

Boy, the old LOCALE system has really got to go someday.

The issue isn't so much the difficulty of supporting multiple locales
(AFAIK). I believe it's more to do with interactions. If you have a
table containing multiple languages in the same column, what does it
mean to sort that table? Do you sort by language-name then by languages?
If you don't, what rules do you follow?

What happens if we compare different languages?
Does fr/fr:"a" == en/gb:"a"?
Does en/gb:"hello" == en/us:"hello"?

Messy, isn't it?

Without languge specific characters, they will sort exactly the same.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #6


Dennis Gearon wrote:
John Sidney-Woollett wrote:
For what it's worth, we have a unicode 7.4.1 database which gives us
the sorting and searching behaviour that we expect (with the
exception of the upper and lower functions). We access the data via
jdbc so we don't have to deal with encoding issues per se as the
driver does any translation for us.

Currently we don't use any LIKE statements, but if we did, and wanted
them optimized then we'd use the appropriate OP Class when defining
the index. We also don't use any REGEX expressions. And we'll shortly
be experimenting with tsearch2...

List of databases
Name | Owner | Encoding
---------------+----------+----------
test | postgres | UNICODE

Setting the psql client encoding to Latin1 and inserting the
following data...

# select * from johntest;
id | value
----+-------
1 | test
2 | tést
3 | tèst
4 | taste
5 | TEST
6 | TÉST
7 | TÈST
8 | TASTE
(8 rows)

[snip]

using a LIKE operation also works as expected (again no index on
value field)

# select * from johntest where value like 't%';
id | value
----+-------
1 | test
2 | tést
3 | tèst
4 | taste
(4 rows)

Like works, but it can't use an index, and so would have horibble
performance vs. the situation where it CAN use an index. I believe
this is how Postgres is working now.

If you use one of the OPCLASSes then LIKE operations using indexes
should work, I believe.

See http://www.postgresql.org/docs/7.4/s...s-opclass.html

John Sidney-Woollett

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postg resql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #7
John Sidney-Woollett wrote:
For what it's worth, we have a unicode 7.4.1 database which gives us the
sorting and searching behaviour that we expect (with the exception of
the upper and lower functions). We access the data via jdbc so we don't
have to deal with encoding issues per se as the driver does any
translation for us.

Currently we don't use any LIKE statements, but if we did, and wanted
them optimized then we'd use the appropriate OP Class when defining the
index. We also don't use any REGEX expressions. And we'll shortly be
experimenting with tsearch2...

List of databases
Name | Owner | Encoding
---------------+----------+----------
test | postgres | UNICODE

Setting the psql client encoding to Latin1 and inserting the following
data...

# select * from johntest;
id | value
----+-------
1 | test
2 | tést
3 | tèst
4 | taste
5 | TEST
6 | TÉST
7 | TÈST
8 | TASTE
(8 rows)

and then extracting the data in sorted order works as we would expect

# select * from johntest order by value (no index on the value field)
id | value
----+-------
8 | TASTE
5 | TEST
7 | TÈST
6 | TÉST
4 | taste
1 | test
3 | tèst
2 | tést
(8 rows)

however, applying the UPPER function to the data does not work as
expected, problem with ids 6,7,3,2 - should be ordered (3,7 or 7,3) ,
(6,2 or 2,6)

# select * from johntest order by upper(value);
id | value
----+-------
4 | taste
8 | TASTE
1 | test
5 | TEST
7 | TÈST
6 | TÉST
3 | tèst
2 | tést
(8 rows)

using a LIKE operation also works as expected (again no index on value
field)

# select * from johntest where value like 't%';
id | value
----+-------
1 | test
2 | tést
3 | tèst
4 | taste
(4 rows)

Like works, but it can't use an index, and so would have horibble performance vs. the situation where it CAN use an index. I believe this is how Postgres is working now.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
1747
by: Ellen Herzfeld | last post by:
I have been tearing my hair out trying to get Python to sort strings containing accented characters according to the French locale settings on Mac OS X (10.3). I think I've done it correctly because on a Linux box it works... What I find in the Mac's entrails is this: /usr/share/locale/fr_FR.UTF-8 contains LC_COLLATE -> ../la_LN.UTF-8/LC_COLLATE LC_CTYPE -> ../la_LN.UTF-8/LC_CTYPE
2
1757
by: garykpdx | last post by:
Is there some sort of tutorial on locales or the locale module? I can't seem to find a list showing all possible locales. I made 'en' work alright, but when I tried 'de' or 'de_DE' or 'es_ES', etc. it said that those were not valid locaes. Worst of all, when I tried 'es' it said that this was estonian?! Obviously, this is not what ISO 639 says. 'es' should be Spanish, and Estonian should be 'et. Anyway, I'd love to have a method called...
2
2411
by: Neil Morris | last post by:
I would like to know the benefits for javascripts' prototype objects as compared to say java class-based objects. The main points that I have on my mind are: 1. Is there a way to mark properties in prototype-based langauges as 'final or 'private' as you can in java so that some properties are 'hidden' from child objects. 2. Is there a way for objects to limit the delete function in javascript, or any prototype-based langauge, to...
7
3083
by: saroj.yadav | last post by:
As I understand it (correct me, if I am wrong) Unicode came into picture so that a document containing multiple language characters can be supported like somebody can write a document comparing Korean and Chinese in French language. Now, I am looking at all UNIX platforms and seems like all Unix (AIX, HP, Solaris) platforms support Unicode by supporting language/region specific UTF-8 locales like fr_FR.UTF-8, ja_JP.UTF-8, ko_KR.UTF-8...
17
3981
by: clintonG | last post by:
I'm using an .aspx tool I found at but as nice as the interface is I think I need to consider using others. Some can generate C# I understand. Your preferences please... <%= Clinton Gallagher http://forta.com/books/0672325667/
3
1538
by: Joshua Stewart | last post by:
Hi, I am trying to characterise a problem I am seeing on our C#/C++ xml driven application. We have recently added some basic Spanish language support to our application, but it seems that there is an issue with certain locale settings. Under Regional and Language Options -Standards and Formats I would normally change the applications language by selecting Spanish (Mexico). It looks like the application ignores the Location field.
8
4712
by: Rik | last post by:
Hi all, is there a in PHP to get the available locales without shellacces (i.e. locale -a)? Grtz, -- Rik Wasmus
2
1881
by: Adrian | last post by:
Hi All, Is there anyway to change what isspace thinks is a space character. I am parsing some log files and it would be nice to just read a field as what ever is between quotes or between 's ie clf log files I know I can go char by char or find_last_of etc, but I would like to know if it is possible with locales and facets? Also are there any suggestions for outputting in multiple languages? Can I use locales again? I was just...
4
1806
by: Gilles Ganault | last post by:
Hello What does it take to support locales? The following test code found on the PHP site doesn't work as planned: ======= setlocale(LC_TIME, "C"); echo strftime("%A"); setlocale(LC_TIME, "fi_FI"); echo strftime(" in Finnish is %A,");
0
9673
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
1
10165
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10002
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7543
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5437
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5565
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4113
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3728
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2921
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.