473,385 Members | 1,908 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

UTF-8 and =, LIKE problems

I am running a web-based accounting package (SQL-Ledger) that supports
multiple languages on PostgreSQL. When a database encoding is set to
Unicode, multilingual operation is possible.

However, when a user's input language is set to say English, and the
user enters data such as "79", the data that is sent back to PostgreSQL
for storage is U+FF17 U+FF19, which are the Unicode half width
characters "79". So far so good.

Now, if the user switches languages and enters "79" as a search key, the
previously entered row will not be found with the LIKE or = operators,
and all other comparison operations will fail too. The problem is that
the browser now sends back U+0037 U+0039, which are Unicode full width
characters for "79".

Semantically, one might expect U+FF17 U+FF19 to be identical to U+0037
U+0039, but of course they aren't if a simple-minded byte-by-byte or
character-by-character comparison is done.

In the ideal case, one would probably want to convert all full width
chars to their half width equivalents because the numbers look wierd on
the screen (e.g., "7 9 B r i s b a n e S t r e e t" instead of "79
Brisbane Street". Is there any way to get PostgreSQL to do so?

Failing this, is there any way to get PostgreSQL to be a bit smarter in
doing comparisons? I think I'm SOL, but I thought I'd ask anyway.
....Edmund.

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #1
1 3259

On Nov 4, 2004, at 1:24 PM, Edmund Lian wrote:
I am running a web-based accounting package (SQL-Ledger) that supports
multiple languages on PostgreSQL. When a database encoding is set to
Unicode, multilingual operation is possible.

<snip />
Semantically, one might expect U+FF17 U+FF19 to be identical to U+0037
U+0039, but of course they aren't if a simple-minded byte-by-byte or
character-by-character comparison is done.

In the ideal case, one would probably want to convert all full width
chars to their half width equivalents because the numbers look wierd
on the screen (e.g., "7 9 B r i s b a n e S t r e e t" instead of
"79 Brisbane Street". Is there any way to get PostgreSQL to do so?

Failing this, is there any way to get PostgreSQL to be a bit smarter
in doing comparisons? I think I'm SOL, but I thought I'd ask anyway.


I've thought this would be a useful addition to PostgreSQL, but
currently I think it's best handled in the application layer. A brief
glance at the SQL-Ledger homepage shows that it's written in Perl. I'm
still in the early learning stages of Perl (heck, I'm the in the early
learning stages of nearly everthing), but I'd assume with Perl's good
Unicode support there should be a way to do this, similar to PHP's
mb_convert_kana (which handles much more than just kana, btw). Ideally,
I'd think you'd want to store all numbers and latin characters as
single-width characters, so you'd filter them before they enter the
database.

I'd think this might be best placed in the SQL-Ledger code, though you
might be able to fashion a plperl function that would do the same
thing. You could either update all entries (UPDATE foo SET bar =
double_to_single(bar)) or make a functional index on
double_to_single(bar).

I'm not sure which would be the best, and others out there have more
informed opinions than mine which I'd love to read.

Hope this helps a bit.

Michael
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: lawrence | last post by:
Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work?...
4
by: Alban Hertroys | last post by:
Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....
12
by: Mike Dee | last post by:
A very very basic UTF-8 question that's driving me nuts: If I have this in the beginning of my Python script in Linux: #!/usr/bin/env python # -*- coding: UTF-8 -*- should I - or should I...
38
by: Haines Brown | last post by:
I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...
6
by: archana | last post by:
Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...
7
by: Jimmy Shaw | last post by:
Hi everybody, Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be mixed up, but is it possible that all UTF-16 "code points" that are 16 bits long appear just the same in UTF-32,...
1
by: sheldon.regular | last post by:
I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. äöü äöü '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc'...
4
by: shreshth.luthra | last post by:
Hi All, I am having a GUI which accepts a Unicode string and searches a given set of xml files for that string. Now, i have 2 XML files both of them saved in UTF-8 format, having characters...
10
by: Jed | last post by:
I have a form that needs to handle international characters withing the UTF-8 character set. I have tried all the recommended strategies for getting utf-8 characters from form input to email...
23
by: Allan Ebdrup | last post by:
I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.