By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,361 Members | 1,683 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,361 IT Pros & Developers. It's quick & easy.

tsearch2, ispell, utf-8 and german special characters

P: n/a
Hi!

Sorry to bother you, but I just don't know how to get tsearch2 configured correctly for my setup. I've got a 7.4.3 database-cluster initdb'ed with de_DE@euro as locale, the database is with Unicode encoding.

I made and installed contrib/tsearch2 after installing the dump/reload-patch http://www.sai.msu.su/~megera/postgr...e_7.4.patch.gz as advised by the docs. So far everything is looking good, I have generated a snowball stemmer dictionary and an ispell dictionary as described in the docs and created a new configuration 'default_german' as described.

This is working somehow:
SELECT to_tsvector('default_german',
'tsearch2 erlernen ist wie zur Schule zu gehen');
-> 'gehen':10 'schulen':8 'erlernen':3 'tsearch2':2

though I don't quite understand why "Schule" is converted to "schulen" and not the other way round, but so be it. My problem lies, as every so often, with the non-ascii-characters, namely german umlauts and the .

SELECT to_tsvector('default_german',
'ich mu tsearch2 begreifen ');

returns null. So does any phrase which contains or anything that's beyond ASCII.

Another thing is the ISpell functionality; the docs are quite vague on thispart when it comes to explaining which file(s) to use to create german.med.. In ISpell conventions, umlauts seem to be represented as A" a" O" o" U" u" and thus when doing

SELECT lexize('de_ispell', 'ther');
I receive NULL

whereas
SELECT lexize('de_ispell', 'A"ther');
gives me {"a\"ther"}
as result.

I downloaded igerman98-20030222.tar.bz2 from http://j3e.de/ispell/igerman98/dict/ which seems to be the recommended ISpell dictionary distribution forthe german language as noted on http://fmg-www.cs.ucla.edu/fmg-membe...l#German-dicts

Of course there are no german.0 or german.1 files in this distribution which would be the obvious counterparts to english.0 and english.1 mentioned inthe tsearch2-docs; there is however a file all.words built on installation, which seems to be the basis for building the hash-file later on. The first few lines of this file are

A"bte/N
A"btissin/F
a"chten/DIXY
A"chtens
A"chtung/P
a"chzen/DIXY
a"chzt/EGPX
A"cker/N

In order to get the .med-File I did sort -u -t/ +0f -1 +0 -T /usr/tmp -o german.med all.words

There is an option to generate another wordlist via make isowordlist - but this didn't resolve the umlaut-issue either, neither in the standard encoding provided in the package nor after conversion to UTF-8 (I tried both withand without a BOM).

Now has anybody actually managed to get a working configuration with tsearch2 and german language support in a unicode-database? What am I doing wrong? I just can't find any more hints in the docs, and there's a topic on the OpenFTS-Mailinglist with somewhat similar issues ( http://sourceforge.net/mailarchive/f...&forum_id=7671 ), but nothing which would actually help to resolve it.

Kind regards

Markus

Nov 23 '05 #1
Share this question for a faster answer!
Share on Google+

This discussion thread is closed

Replies have been disabled for this discussion.