469,358 Members | 1,639 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,358 developers. It's quick & easy.

tsearch2, ispell, utf-8 and german special characters

Hi!

Sorry to bother you, but I just don't know how to get tsearch2 configured correctly for my setup. I've got a 7.4.3 database-cluster initdb'ed with de_DE@euro as locale, the database is with Unicode encoding.

I made and installed contrib/tsearch2 after installing the dump/reload-patch http://www.sai.msu.su/~megera/postgr...e_7.4.patch.gz as advised by the docs. So far everything is looking good, I have generated a snowball stemmer dictionary and an ispell dictionary as described in the docs and created a new configuration 'default_german' as described.

This is working somehow:
SELECT to_tsvector('default_german',
'tsearch2 erlernen ist wie zur Schule zu gehen');
-> 'gehen':10 'schulen':8 'erlernen':3 'tsearch2':2

though I don't quite understand why "Schule" is converted to "schulen" and not the other way round, but so be it. My problem lies, as every so often, with the non-ascii-characters, namely german umlauts and the .

SELECT to_tsvector('default_german',
'ich mu tsearch2 begreifen ');

returns null. So does any phrase which contains or anything that's beyond ASCII.

Another thing is the ISpell functionality; the docs are quite vague on thispart when it comes to explaining which file(s) to use to create german.med.. In ISpell conventions, umlauts seem to be represented as A" a" O" o" U" u" and thus when doing

SELECT lexize('de_ispell', 'ther');
I receive NULL

whereas
SELECT lexize('de_ispell', 'A"ther');
gives me {"a\"ther"}
as result.

I downloaded igerman98-20030222.tar.bz2 from http://j3e.de/ispell/igerman98/dict/ which seems to be the recommended ISpell dictionary distribution forthe german language as noted on http://fmg-www.cs.ucla.edu/fmg-membe...l#German-dicts

Of course there are no german.0 or german.1 files in this distribution which would be the obvious counterparts to english.0 and english.1 mentioned inthe tsearch2-docs; there is however a file all.words built on installation, which seems to be the basis for building the hash-file later on. The first few lines of this file are

A"bte/N
A"btissin/F
a"chten/DIXY
A"chtens
A"chtung/P
a"chzen/DIXY
a"chzt/EGPX
A"cker/N

In order to get the .med-File I did sort -u -t/ +0f -1 +0 -T /usr/tmp -o german.med all.words

There is an option to generate another wordlist via make isowordlist - but this didn't resolve the umlaut-issue either, neither in the standard encoding provided in the package nor after conversion to UTF-8 (I tried both withand without a BOM).

Now has anybody actually managed to get a working configuration with tsearch2 and german language support in a unicode-database? What am I doing wrong? I just can't find any more hints in the docs, and there's a topic on the OpenFTS-Mailinglist with somewhat similar issues ( http://sourceforge.net/mailarchive/f...&forum_id=7671 ), but nothing which would actually help to resolve it.

Kind regards

Markus

Nov 23 '05 #1
0 1850

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

13 posts views Thread by Nigel J. Andrews | last post: by
1 post views Thread by psql-mail | last post: by
1 post views Thread by Pavel Stehule | last post: by
16 posts views Thread by Ben | last post: by
2 posts views Thread by Fischer Ulrich | last post: by
reply views Thread by Ben | last post: by
3 posts views Thread by Marcel Boscher | last post: by
2 posts views Thread by Net Virtual Mailing Lists | last post: by
1 post views Thread by Dawid Kuroczko | last post: by
reply views Thread by suresh191 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.