By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,691 Members | 2,041 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,691 IT Pros & Developers. It's quick & easy.

Tsearch2 and Unicode?

P: n/a
I'm trying to use tsearch2 with database which is in 'UNICODE' encoding.
It works fine for English text, but as I intend to search Polish texts I did:

insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8');
(and I updated other pg_ts_* tables as written in manual).

However, Polish-specific chars are being eaten alive, it seems.
I.e. doing select to_tsvector('default_polish', body) from messages;
results in list of words but with national chars stripped...

I wonder, am I doing something wrong, or just tsearch2 doesn't grok
Unicode, despite the locales setting? This also is a good question
regarding ispell_dict and its feelings regarding Unicode, but that's
another story.

Assuming Unicode unsupported means I should perhaps... oh, convert
the data to iso8859 prior feeding it to_tsvector()... interesting idea,
but so far I have failed to actually do it. Maybe store the data as
'bytea' and add a column with encoding information (assuming I don't
want to recreate whole database with new encoding, and that I want
to use unicode for some columns (so I don't have to keep encoding
with every text everywhere...).

And while we are at it, how do you feel -- an extra column with tsvector
and its index -- would it be OK to keep it away from my data (so I can
safely get rid of them if need be)?
[ I intend to keep index of around 2 000 000 records, few KBs of
text each ]...

Regards,
Dawid Kuroczko

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #1
Share this Question
Share on Google+
1 Reply


P: n/a
Dawid,

unfortunately, tsearch2 doesn't support unicode yet.
If you keep tsvector separately from data than you'll need one more join.

Oleg
On Wed, 17 Nov 2004, Dawid Kuroczko wrote:
I'm trying to use tsearch2 with database which is in 'UNICODE' encoding.
It works fine for English text, but as I intend to search Polish texts I did:

insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8');
(and I updated other pg_ts_* tables as written in manual).

However, Polish-specific chars are being eaten alive, it seems.
I.e. doing select to_tsvector('default_polish', body) from messages;
results in list of words but with national chars stripped...

I wonder, am I doing something wrong, or just tsearch2 doesn't grok
Unicode, despite the locales setting? This also is a good question
regarding ispell_dict and its feelings regarding Unicode, but that's
another story.

Assuming Unicode unsupported means I should perhaps... oh, convert
the data to iso8859 prior feeding it to_tsvector()... interesting idea,
but so far I have failed to actually do it. Maybe store the data as
'bytea' and add a column with encoding information (assuming I don't
want to recreate whole database with new encoding, and that I want
to use unicode for some columns (so I don't have to keep encoding
with every text everywhere...).

And while we are at it, how do you feel -- an extra column with tsvector
and its index -- would it be OK to keep it away from my data (so I can
safely get rid of them if need be)?
[ I intend to keep index of around 2 000 000 records, few KBs of
text each ]...

Regards,
Dawid Kuroczko

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 23 '05 #2

This discussion thread is closed

Replies have been disabled for this discussion.