473,836 Members | 1,477 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

unicode and sorting(at least)

This is what has to be eventually done:(as sybase, and probably others do it)

http://www.ianywhere.com/whitepapers/unicode.html

I'm not sure how that will affect LIKE and REGEX.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #1
6 2266

On Wed, 23 Jun 2004, Dennis Gearon wrote:
This is what has to be eventually done:(as sybase, and probably others do it)

http://www.ianywhere.com/whitepapers/unicode.html


Actually, what probably has to be eventually done is what's in the SQL
spec.

Which is AFAICS basically:
Allow multiple encodings
Allow multiple character sets (within an encoding)
Allow one or more collations per character set
Allow columns to specify character set and collation
Allow literals in multiple character sets
Allow translations and encoding conversions (as supported)
Allow explicit COLLATE clauses to control ordering and comparisons.
Handle identifiers in multiple character sets

plus some misc things like allowing sets that control the default
character set for literals for this session and such.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #2
> On Wed, 23 Jun 2004, Dennis Gearon wrote:
This is what has to be eventually done:(as sybase, and probably others do it)

http://www.ianywhere.com/whitepapers/unicode.html
Actually, what probably has to be eventually done is what's in the SQL
spec.

Which is AFAICS basically:
Allow multiple encodings
Allow multiple character sets (within an encoding)


Could Please explain more details for above. In my understanding a
character set can have multiple encodings but...
--
Tatsuo Ishii
Allow one or more collations per character set
Allow columns to specify character set and collation
Allow literals in multiple character sets
Allow translations and encoding conversions (as supported)
Allow explicit COLLATE clauses to control ordering and comparisons.
Handle identifiers in multiple character sets

plus some misc things like allowing sets that control the default
character set for literals for this session and such.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match


---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #3
All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimes they can handle several langauges besides English, and are actually designed to handle a family of langauges.

The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCS series. (UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esoteric ones postulated, and there is room for future languages.

So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family that the database's encoding was defined for(in postgres currently, choosing encoding down to the column level is available on several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't have UTF16 or UTF32 available)

Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of any kind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, which as an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instance of Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/character set encoding is chosen for the database.

It so happens that most non UTF encodings are designed to sort well in an extended ascii/8 bit environment, which is what the ISO8xxxx and LatinX encodings actually are. I'm not sure that it's perfect though. So, if SQL_ASCII is chosen for the LOCALE, and the encoding is ISO8xxx or LATINx, it will probably sort OK.

UTF8/16/32 is built the same way. However, this only applies per character, and only works painlessly on UTF32, which has fixed width characters. UTF8/16 OTOH, have variable length characters (in multiples of 8 bits). Since SQL_ASCII sorts in a binary fashion, UTF8/16 won't sort correctly under SQL_ASCII locale, I believe.
Tatsuo Ishii wrote:
On Wed, 23 Jun 2004, Dennis Gearon wrote:

This is what has to be eventually done:(as sybase, and probably others do it)

http://www.ianywhere.com/whitepapers/unicode.html


Actually, what probably has to be eventually done is what's in the SQL
spec.

Which is AFAICS basically:
Allow multiple encodings
Allow multiple character sets (within an encoding)

Could Please explain more details for above. In my understanding a
character set can have multiple encodings but...
--
Tatsuo Ishii

Allow one or more collations per character set
Allow columns to specify character set and collation
Allow literals in multiple character sets
Allow translations and encoding conversions (as supported)
Allow explicit COLLATE clauses to control ordering and comparisons.
Handle identifiers in multiple character sets

plus some misc things like allowing sets that control the default
character set for literals for this session and such.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #4
> All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimes they can handle several langauges besides English, and are actually designed to handle a family of langauges.

ISO 8xxx series are not encodings but character sets. For example,
ISO-8859-1 can be expressed in 8-bit encoding form, it also can be
expressed in 7-bit encoding form. This is called ISO-2022. I know that
PostgreSQL treats ISO-8859-1 as an encoding but it's just a short hand
for "8-bit encoded ISO-8859-1".

Also, let's not mix together "languages" and "character
sets". Langugaes are defined by human, not by computers. While
character sets are perfectly definable by computers. More important
thing is that a language can be expressed in several character
sets. For example language Japanese can be expressed in EUC-JP of
cousrse. It also can be expressed in ASCII by using ROMAJI script.

What I want to say here is talking about "languages" is almost
useless and we have to talk about character sets and encodings.
The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCS series. (UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esoteric ones postulated, and there is room for future languages.

So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family that the database's encoding was defined for(in postgres currently, choosing encoding down to the column level is available on several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't have UTF16 or UTF32 available)

Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of any kind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, which as an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instance of Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/character set encoding is chosen for the database.
The sorting order problem is not neccessary limited to "clutser
vs. locale" one. My example about ROMAJI above raises another question
"How to sort ROMAJI Japanese?" If we regard it just ASCII strings, we
could sort it in alphabetical order. But if we regard it as Japanaese,
probably sorting in alphabetical order is not appropreate. This
example shows that the sorting order should be defined by users or
applications, not by systems or DBMSs. This is why the SQL standard
has "COLLATION" concept IMO.
It so happens that most non UTF encodings are designed to sort well in an extended ascii/8 bit environment, which is what the ISO8xxxx and LatinX encodings actually are. I'm not sure that it's perfect though. So, if SQL_ASCII is chosen for the LOCALE, and the encoding is ISO8xxx or LATINx, it will probably sort OK.

UTF8/16/32 is built the same way. However, this only applies per character, and only works painlessly on UTF32, which has fixed width characters. UTF8/16 OTOH, have variable length characters (in multiples of 8 bits). Since SQL_ASCII sorts in a binary fashion, UTF8/16 won't sort correctly under SQL_ASCII locale, I believe.
Tatsuo Ishii wrote:
On Wed, 23 Jun 2004, Dennis Gearon wrote:
This is what has to be eventually done:(as sybase, and probably others do it)

http://www.ianywhere.com/whitepapers/unicode.html

Actually, what probably has to be eventually done is what's in the SQL
spec.

Which is AFAICS basically:
Allow multiple encodings
Allow multiple character sets (within an encoding)

Could Please explain more details for above. In my understanding a
character set can have multiple encodings but...
--
Tatsuo Ishii

Allow one or more collations per character set
Allow columns to specify character set and collation
Allow literals in multiple character sets
Allow translations and encoding conversions (as supported)
Allow explicit COLLATE clauses to control ordering and comparisons.
Handle identifiers in multiple character sets

plus some misc things like allowing sets that control the default
character set for literals for this session and such.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #5
On Thu, 24 Jun 2004 09:06:44 -0700
Dennis Gearon <ge*****@firese rve.net> wrote
All of the ISO 8xxx encodings and LATINX encodings can handle two
langauges, English and at least one other. Sometimes they can handle
several langauges besides English, and are actually designed to handle a
family of langauges.
And that's where the confusion with locales started, I think. But it's
not really true. Every encoding can handle the Latin/English based
computer (C) locale plus one other. In one specific case, it's the C
locale plus (almost real) English, even.
The ONLY encodings that can handle a significant amount of multiple
langauges and character sets are the ISO/UTF/UCS series. (UCS is giving
way to UTF).
Providing a little emphasis, here:

Unicode is a character set. There is currently one encoding defined for
it, but there are several transformations of that encoding, which we
refer to as UTF-n where n tells us something about the bit width that
the transformation is optimized for.

In theory, there could be other encodings of Unicode. (No one expects it
to actually happen, but they did try to leave a door open, just in case.)
In fact they can handle every human langauge ever used,
Again, adding some emphasis here, (almost) every language currently
known in our modern society can be handled about as well as or better
wth Unicode than we were able to handle English with just ASCII. (I
have a personal interest in some of the dark corners, but that's OT.)
plus some esoteric ones postulated, and there is room for future
languages.
(Can't you just feel the chills run up and down your spine? This could
be a wild ride, boys! heh. A little drama for your lunch hour.)
...

UTF8/16/32 is built the same way. However, this only applies per
character, and only works painlessly on UTF32, which has fixed width
characters.
Again, a little point of emphasis, UTF32 is fixed width on the code
points, but there are still composite characters. (You only thought it
was safe to go back to the beach.)
UTF8/16 OTOH, have variable length characters (in multiples
of 8
or 16
bits). Since SQL_ASCII sorts in a binary fashion, UTF8/16 won't
sort correctly under SQL_ASCII locale, I believe.
It might almost sort well enough to cause you real pain later, too.
Tatsuo Ishii wrote:
On Wed, 23 Jun 2004, Dennis Gearon wrote:
This is what has to be eventually done:(as sybase, and probably others do it)

http://www.ianywhere.com/whitepapers/unicode.html

Actually, what probably has to be eventually done is what's in the SQL
spec.

Which is AFAICS basically:
Allow multiple encodings
Allow multiple character sets (within an encoding)

Could Please explain more details for above. In my understanding a
character set can have multiple encodings but...
Well, UTF-8 was originally, IIRC, intended to be a _Universal_ly
applicable transformation. (The scheme would apply as easily to the JIS
character sets.) But I don't think that's what they were talking about.

I think they are talking about multiple major locales (or character
subsets) within Unicode. (I could be wrong, of course.)
...


--
Joel <re**@ddcom.co. jp>
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #6
On Fri, 25 Jun 2004 10:19:05 +0900 (JST)
Tatsuo Ishii <t-*****@sra.co.jp > wrote
All of the ISO 8xxx encodings and LATINX encodings can handle two langauges, English and at least one other. Sometimes they can handle several langauges besides English, and are actually designed to handle a family of langauges.
ISO 8xxx series are not encodings but character sets. For example,
ISO-8859-1 can be expressed in 8-bit encoding form, it also can be
expressed in 7-bit encoding form. This is called ISO-2022. I know that
PostgreSQL treats ISO-8859-1 as an encoding but it's just a short hand
for "8-bit encoded ISO-8859-1".

Also, let's not mix together "languages" and "character
sets". Langugaes are defined by human, not by computers. While
character sets are perfectly definable by computers. More important
thing is that a language can be expressed in several character
sets. For example language Japanese can be expressed in EUC-JP of
cousrse. It also can be expressed in ASCII by using ROMAJI script.


(Which isn't to say that everyone will find romanized Japanese easy to
read for meaning.)

But we should point out that there are several variations on the
romanization of Japanese (some of which are anything but regular).
What I want to say here is talking about "languages" is almost
useless and we have to talk about character sets and encodings.
The ONLY encodings that can handle a significant amount of multiple langauges and character sets are the ISO/UTF/UCS series. (UCS is giving way to UTF). In fact they can handle every human langauge ever used, plus some esoteric ones postulated, and there is room for future languages.

So, for a column to handle multiple langauges/character sets, the languages/character sets have to be in the family that the database's encoding was defined for(in postgres currently, choosing encoding down to the column level is available on several databases and is the SQL spec), OR, the encoding for the database has to be UTF8 (since we don't have UTF16 or UTF32 available)

Right now, the SORTING algorithm and functionality is fixed for the database cluster, which contains databases of any kind of encodings. It really does not do much good to have a different locale than the encoding, except for UTF8, which as an encoding is langauge/character set neutral, or SQL_ASCII and an ISO8xxx or LatinX encoding. Since a running instance of Postgres can only be connected to one cluster, a database engine has FIXED sorting, no matter what language/character set encoding is chosen for the database.
The sorting order problem is not neccessary limited to "clutser
vs. locale" one. My example about ROMAJI above raises another question
"How to sort ROMAJI Japanese?" If we regard it just ASCII strings, we
could sort it in alphabetical order. But if we regard it as Japanaese,
probably sorting in alphabetical order is not appropreate.


I think we should say that, while there are some contexts in which
ordinary alphabetic order would be okay, there are some, for instance,
in which we'd want to mirror the kana order as much as possible. (Not
exactly a straightforward map-this-code-point-to-this-collation-value
exercise, but should be doable.)
This
example shows that the sorting order should be defined by users or
applications, not by systems or DBMSs. This is why the SQL standard
has "COLLATION" concept IMO.
...

--
Joel <re**@ddcom.co. jp>
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postg resql.org so that your
message can get through to the mailing list cleanly

Nov 23 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
7092
by: sebastien.hugues | last post by:
Hi I would like to retrieve the application data directory path of the logged user on windows XP. To achieve this goal i use the environment variable APPDATA. The logged user has this name: sébastien. The second character is not an ascii one and when i try to encode the path that contains this name in utf-8,
1
1247
by: saha.goutam | last post by:
I have created Unicode XML (in Notepad and selected Unicode encoding) as follows: <?xml version="1.0" encoding="UTF-8" ?> <doc> <language> <english>English</english> <german>Strategien für Ost</german> <french>être évacuées mi-août</french> <norway>Geir & Gunnar og
1
1640
by: bpalczewski | last post by:
Ciao !!! I'm working on the project that requires multi-lingual database ... I have never done it before and i have problem to configure mysql server and create databases which will allow me to store data in different languages, which includes e.g. hebrew or arabic ... I will be very gratefull if someone send me a dumped database with a table containing couple of records (or more) of data in unicode - at least i will have a starting...
6
2519
by: Tim Edwards | last post by:
Currently I'm running PostgreSQL 7.2.3 and having a problem sorting. I've got two colums of data, one Int4 one Varchar. When I sort ASC on the varchar I get some strange results. Here a section of data cut after running a sort. It starts with RM- then does RMT- Then goes back for more RM-. I've checked for spaces or tabs or anything else odd and cant find anything that would effect the sort. 63 RM-791
0
1303
by: Simon Gregory | last post by:
I am currently attempting to figure out how the new databinding stucture works in ASP.NET 2.0 after working with v1.0 & v1.1 for several years. It seems that if you wish to do set up databinding at Design Time, you are restricted to using one of the 5 xxxDataSource controls. The data in our websites are usually obtained via a set of webservices and are contained in a set of strongly typed relational DataSets.
4
4289
by: FBM | last post by:
Hi, I am working on a program that simulates one of the elements of ATM. The simulation stores events which occurs every some milliseconds for a certain amount of time. Every time that an event is stored in a double linked list, the whole list is sorted for the next round. My problem appears when subjecting the program to heavy load, that is, when I run the simulation for more than 10,000 miliseconds (every event occurs in...
15
2126
by: John Salerno | last post by:
Forgive my newbieness, but I don't quite understand why Unicode is still something that needs special treatment in Python (and perhaps elsewhere). I'm reading Dive Into Python right now, and it constantly refers to a 'regular string' versus a 'Unicode string' and how you need to convert back and forth. But why isn't Unicode considered a regular string by now? Is it for historical reasons that we still use ASCII and Latin-1? Why can't...
8
2211
by: sara | last post by:
Hi - I have looked at all posts and tried both Allen Browne's Report Sorting at run Time ( Select Case Forms!frmChooseSort!grpSort Case 1 'Name Me.GroupLevel(0).ControlSource = "LastName" Me.GroupLevel(1).ControlSource = "FirstName") ..... And another post that was OrderByOn = True and setting the sort Order.
3
5265
by: Supra | last post by:
Hi, I'm using the following code to send an email from a form on a PHP page. If there're unicode characters in message body, they get carried over fine in the email sent out, however if there are some in the subject or name fields, they come out as blanks in the email received. Does anyone know how this can be fixed? #!/usr/bin/perl
0
9672
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10559
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10601
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10260
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
6981
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5829
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4460
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4023
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3116
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.