Connecting Tech Pros Worldwide Forums | Help | Site Map

MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

jrs_14618@yahoo.com
Guest
 
Posts: n/a
#1: Jun 13 '06
Hello All,

This post is essentially a reply a previous post/thread
here on this mailing.database.myodbc group titled:

MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode

[This version has a couple subtle edits from the orginial I posted
on mailing.database.myodbc - I'm cross posting here on this
topic/subject related newsgroup]

I was wondering if anybody has experienced the same issues
challenges I'm experiencing I'll describe shortly. Once
resolved some fascinating and powerful multi-lingual
apps incorporating non-English/latin character sets can be
realized by many developers.

I have a Unicode utf8 English - Arabic - Hebrew - Greek (and
several other languages) database in Microsoft Excel. I KNOW
that it is Unicode utf8 data because MySQL tells me it
recognizes the encoding as such but not in the context I want.

Allow me to explain ...

I can search the Unicode utf8 encoding with no problem in
Excel. While in Excel I highlight a complete word or a
partial string of an Arabic word copy it to the clipboard
(i.e. memory). I then do a find and the process is the
same successful result as if it was an English string.

MySQL 5.0 is supposed to handle Unicode utf8

I created a MySQL database I named: languages

CREATE DATABASE languages ;

and I implemented the following command on a MySQL
command prompt:

ALTER DATABASE languages DEFAULT CHARACTER SET utf8;

No problem (so far) MySQL seemingly recognized utf8 and
accepted it. My understanding is with the ALTER command
the tables I create against languages will be utf8.

I now created a table I named mainlang which denotes it
will be the main table for my languages.

mysql>CREATE TABLE mainlang
->(
->langNumID varchar(30),
->colB varchar(30),
->colC varchar(30),
->primary key (langNumID, colB)
->);

Again so far no problem: Table successfully created.
My third column 'colC' is where the Unicode data
will be stored.

I now attempt to import the database from my
Excel file into my MySQL database as follows:

mysql>load data infile 'c:\\arabicdictionary.csv'
->into table mainlang
->fields terminated by ','
->lines terminated by '\n'
->(langNumID, colB, colC);
ERROR 1406 (22001): Data too long for 'colC' at row 1

So what to do? I did a search and found other
people seemingly had the same problem and someone
suggested:

ALTER DATABASE languages DEFAULT CHARACTER SET cp1250;

I dropped mainlang, recreated it, redid the load and
Lo and behold ... it seemed to work. No Data too long
error occurred and when I did the following query:

mysql>select langNumID, colB, colC
->from mainlang
->where colB = '4994';

I see colA have a correct numeric value, colB a
correct numeric value (4994) and for colC a string of
unintelligible characters with diacritical marks,
oomlats etc. which I know is the cp1250 encoding
interpretation of the Unicode utf8 data which is
similarly unintelligible in its own regard.

Now what I try is: do a copy of the obscure colC
cp1250 character string into the clipboard/memory
and then do the following tweak on the original
select statement to see if I can search on the
(now) cp1250 character string:

mysql>select langNumID, colB, colC
->from mainlang
->where colc = 'paste of the cp1250 character string';

The computer would not allow a paste unless I pressed
the escape key. On initiating this select command
I got an empty set (no match)

My questions are:

Has anyone been successful creating a Unicode utf8
MySQL database that accepts Arabic?

If yes, how did you get around or not encounter the
Data too long issue?

Have you tried the cp1250 (or cp1251 - same mechanics
same results) work around as I have? Are you
able to search the cp1250 character string (my colC)?
If yes, how did you successfully manage to do it?

Lastly, if I take the cp1250 encoded string and paste
it into Excel ... I can string search the cp1250
encoding with no problem.

Also, here's how I know my Unicode utf-8 data is
correct apart from my own manual cross-referencing
and being recognized by MySQL in some respect:

When I copy the Unicode utf8 encoding and try to
paste it into the select command to see what would
happen I get the following error:

ERROR 1257 (HY000): Illegal mix of collations
(cp1250_general_ci, IMPLICIT) and
(utf8_general_ci, COERCIBLE) for operation '='

So what I have here is a situation where MySQL
is recognizing Unicode utf8 encoding but not
from the respect of packing a table!

Go Figure ...

Anyone wishing to share any insight or solution would
be GREATLY appeciated. I promise if I find a solution
I will share it.

Thank you Very Much, Shukran Jiddan, Todah Rabah,
Muchos Gracias ...

Joel S
(585) 255-0997
jrs_14618 at yahoo.com


Jerry Stuckle
Guest
 
Posts: n/a
#2: Jun 14 '06

re: MySQL 5.0, FULL-TEXT Indexing and Search Arabic Data, Unicode


jrs_14618@yahoo.com wrote:[color=blue]
> Hello All,
>
> This post is essentially a reply a previous post/thread
> here on this mailing.database.myodbc group titled:
>
> MySQL 4.0, FULL-TEXT Indexing and Search Arabic Data, Unicode
>
> [This version has a couple subtle edits from the orginial I posted
> on mailing.database.myodbc - I'm cross posting here on this
> topic/subject related newsgroup]
>
> I was wondering if anybody has experienced the same issues
> challenges I'm experiencing I'll describe shortly. Once
> resolved some fascinating and powerful multi-lingual
> apps incorporating non-English/latin character sets can be
> realized by many developers.
>
> I have a Unicode utf8 English - Arabic - Hebrew - Greek (and
> several other languages) database in Microsoft Excel. I KNOW
> that it is Unicode utf8 data because MySQL tells me it
> recognizes the encoding as such but not in the context I want.
>
> Allow me to explain ...
>
> I can search the Unicode utf8 encoding with no problem in
> Excel. While in Excel I highlight a complete word or a
> partial string of an Arabic word copy it to the clipboard
> (i.e. memory). I then do a find and the process is the
> same successful result as if it was an English string.
>
> MySQL 5.0 is supposed to handle Unicode utf8
>
> I created a MySQL database I named: languages
>
> CREATE DATABASE languages ;
>
> and I implemented the following command on a MySQL
> command prompt:
>
> ALTER DATABASE languages DEFAULT CHARACTER SET utf8;
>
> No problem (so far) MySQL seemingly recognized utf8 and
> accepted it. My understanding is with the ALTER command
> the tables I create against languages will be utf8.
>
> I now created a table I named mainlang which denotes it
> will be the main table for my languages.
>
> mysql>CREATE TABLE mainlang
> ->(
> ->langNumID varchar(30),
> ->colB varchar(30),
> ->colC varchar(30),
> ->primary key (langNumID, colB)
> ->);
>
> Again so far no problem: Table successfully created.
> My third column 'colC' is where the Unicode data
> will be stored.
>
> I now attempt to import the database from my
> Excel file into my MySQL database as follows:
>
> mysql>load data infile 'c:\\arabicdictionary.csv'
> ->into table mainlang
> ->fields terminated by ','
> ->lines terminated by '\n'
> ->(langNumID, colB, colC);
> ERROR 1406 (22001): Data too long for 'colC' at row 1
>
> So what to do? I did a search and found other
> people seemingly had the same problem and someone
> suggested:
>
> ALTER DATABASE languages DEFAULT CHARACTER SET cp1250;
>
> I dropped mainlang, recreated it, redid the load and
> Lo and behold ... it seemed to work. No Data too long
> error occurred and when I did the following query:
>
> mysql>select langNumID, colB, colC
> ->from mainlang
> ->where colB = '4994';
>
> I see colA have a correct numeric value, colB a
> correct numeric value (4994) and for colC a string of
> unintelligible characters with diacritical marks,
> oomlats etc. which I know is the cp1250 encoding
> interpretation of the Unicode utf8 data which is
> similarly unintelligible in its own regard.
>
> Now what I try is: do a copy of the obscure colC
> cp1250 character string into the clipboard/memory
> and then do the following tweak on the original
> select statement to see if I can search on the
> (now) cp1250 character string:
>
> mysql>select langNumID, colB, colC
> ->from mainlang
> ->where colc = 'paste of the cp1250 character string';
>
> The computer would not allow a paste unless I pressed
> the escape key. On initiating this select command
> I got an empty set (no match)
>
> My questions are:
>
> Has anyone been successful creating a Unicode utf8
> MySQL database that accepts Arabic?
>
> If yes, how did you get around or not encounter the
> Data too long issue?
>
> Have you tried the cp1250 (or cp1251 - same mechanics
> same results) work around as I have? Are you
> able to search the cp1250 character string (my colC)?
> If yes, how did you successfully manage to do it?
>
> Lastly, if I take the cp1250 encoded string and paste
> it into Excel ... I can string search the cp1250
> encoding with no problem.
>
> Also, here's how I know my Unicode utf-8 data is
> correct apart from my own manual cross-referencing
> and being recognized by MySQL in some respect:
>
> When I copy the Unicode utf8 encoding and try to
> paste it into the select command to see what would
> happen I get the following error:
>
> ERROR 1257 (HY000): Illegal mix of collations
> (cp1250_general_ci, IMPLICIT) and
> (utf8_general_ci, COERCIBLE) for operation '='
>
> So what I have here is a situation where MySQL
> is recognizing Unicode utf8 encoding but not
> from the respect of packing a table!
>
> Go Figure ...
>
> Anyone wishing to share any insight or solution would
> be GREATLY appeciated. I promise if I find a solution
> I will share it.
>
> Thank you Very Much, Shukran Jiddan, Todah Rabah,
> Muchos Gracias ...
>
> Joel S
> (585) 255-0997
> jrs_14618 at yahoo.com
>[/color]

No idea, Joel. Why don't you try asking in a mysql database newsgroup - such as
comp.databases.mysql. This newsgroup is for PHP programming.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================
Closed Thread