By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,933 Members | 1,676 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,933 IT Pros & Developers. It's quick & easy.

UTF-8 and =, LIKE problems

P: n/a
I am running a web-based accounting package (SQL-Ledger) that supports
multiple languages on PostgreSQL. When a database encoding is set to
Unicode, multilingual operation is possible.

However, when a user's input language is set to say English, and the
user enters data such as "79", the data that is sent back to PostgreSQL
for storage is U+FF17 U+FF19, which are the Unicode half width
characters "79". So far so good.

Now, if the user switches languages and enters "79" as a search key, the
previously entered row will not be found with the LIKE or = operators,
and all other comparison operations will fail too. The problem is that
the browser now sends back U+0037 U+0039, which are Unicode full width
characters for "79".

Semantically, one might expect U+FF17 U+FF19 to be identical to U+0037
U+0039, but of course they aren't if a simple-minded byte-by-byte or
character-by-character comparison is done.

In the ideal case, one would probably want to convert all full width
chars to their half width equivalents because the numbers look wierd on
the screen (e.g., "7 9 B r i s b a n e S t r e e t" instead of "79
Brisbane Street". Is there any way to get PostgreSQL to do so?

Failing this, is there any way to get PostgreSQL to be a bit smarter in
doing comparisons? I think I'm SOL, but I thought I'd ask anyway.
....Edmund.

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a

On Nov 4, 2004, at 1:24 PM, Edmund Lian wrote:
I am running a web-based accounting package (SQL-Ledger) that supports
multiple languages on PostgreSQL. When a database encoding is set to
Unicode, multilingual operation is possible.

<snip />
Semantically, one might expect U+FF17 U+FF19 to be identical to U+0037
U+0039, but of course they aren't if a simple-minded byte-by-byte or
character-by-character comparison is done.

In the ideal case, one would probably want to convert all full width
chars to their half width equivalents because the numbers look wierd
on the screen (e.g., "7 9 B r i s b a n e S t r e e t" instead of
"79 Brisbane Street". Is there any way to get PostgreSQL to do so?

Failing this, is there any way to get PostgreSQL to be a bit smarter
in doing comparisons? I think I'm SOL, but I thought I'd ask anyway.


I've thought this would be a useful addition to PostgreSQL, but
currently I think it's best handled in the application layer. A brief
glance at the SQL-Ledger homepage shows that it's written in Perl. I'm
still in the early learning stages of Perl (heck, I'm the in the early
learning stages of nearly everthing), but I'd assume with Perl's good
Unicode support there should be a way to do this, similar to PHP's
mb_convert_kana (which handles much more than just kana, btw). Ideally,
I'd think you'd want to store all numbers and latin characters as
single-width characters, so you'd filter them before they enter the
database.

I'd think this might be best placed in the SQL-Ledger code, though you
might be able to fashion a plperl function that would do the same
thing. You could either update all entries (UPDATE foo SET bar =
double_to_single(bar)) or make a functional index on
double_to_single(bar).

I'm not sure which would be the best, and others out there have more
informed opinions than mine which I'd love to read.

Hope this helps a bit.

Michael
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.