By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,828 Members | 1,068 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,828 IT Pros & Developers. It's quick & easy.

making tsearch2 dictionaries

P: n/a
Ben
I'm trying to make myself a dictionary for tsearch2 that converts
numbers to their english word equivalents. This seems to be working
great, except that I can't figure out how to make my lexize function
return multiple lexemes. For instance, I'd like "100" to get converted
to {one,hundred}, not {"one hundred"} as is currently happening.

How do I specify the output of the lexize function so that this will
happen?
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 22 '05 #1
Share this Question
Share on Google+
16 Replies

P: n/a
Ben
Okay, so I was actually able to answer this question on my own, in a
manner of speaking. It seems the way to do this is to merely return a
larger char** array, with one element for each word. But I was having
trouble with postgres crashing, because (I think) it tries to free each
element independently before using all of them. I had set each element
to a different null-terminated chunk of the same palloc'd memory
segment. Having never written C stored procs before, I take it that's
bad practice?

Anyway, now that this is working, my next question is: can I take the
lexemes from one dictionary lookup and pipe them into another
dictionary? I see that I can have redundant dictionaries, such that if
lexemes aren't found in one it'll try another, but that's not quite the
same.

For instance, the en_stem dictionary converts "hundred" into "hundr".
Right now, my dictionary converts "100" into "one" and "hundred", but
I'd like it to filter both one and hundred through the en_stem
dictionary to arrive at "one" and "hundr".

It also occurs to me I could pipe things through an ispell dictionary
and be able to handle misspellings....

On Sun, 2004-02-15 at 15:35, Ben wrote:
I'm trying to make myself a dictionary for tsearch2 that converts
numbers to their english word equivalents. This seems to be working
great, except that I can't figure out how to make my lexize function
return multiple lexemes. For instance, I'd like "100" to get converted
to {one,hundred}, not {"one hundred"} as is currently happening.

How do I specify the output of the lexize function so that this will
happen?

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 22 '05 #2

P: n/a
From http://www.sai.msu.su/~megera/oddmus...ch_V2_in_Brief

Table for storing dictionaries. Dict_init field store Oid of function
that initialize dictionary. Dict_init has one option: text value from
dict_initoption and should return internal representation (structure)
of dictionary. Structure must be malloced or palloced in
TopMemoryContext. Dict_init is called only one times per process.
dict_lexize field store Oid of function that lemmatize lexem.
Input values: structure of dictionary, pionter to string and it's
length. Output: pointer to array of pointers to C-strings. Last pointer
in array must be NULL. Returns NULL means that dictionary can't resolve
this word, but return void array means that dictionary know input word,
but suppose that word is stop-word.

Ben wrote:
I'm trying to make myself a dictionary for tsearch2 that converts
numbers to their english word equivalents. This seems to be working
great, except that I can't figure out how to make my lexize function
return multiple lexemes. For instance, I'd like "100" to get converted
to {one,hundred}, not {"one hundred"} as is currently happening.

How do I specify the output of the lexize function so that this will
happen?
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster


--
Teodor Sigaev E-mail: te****@sigaev.ru

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 22 '05 #3

P: n/a
Ben <be***@silentmedia.com> writes:
Okay, so I was actually able to answer this question on my own, in a
manner of speaking. It seems the way to do this is to merely return a
larger char** array, with one element for each word. But I was having
trouble with postgres crashing, because (I think) it tries to free each
element independently before using all of them. I had set each element
to a different null-terminated chunk of the same palloc'd memory
segment. Having never written C stored procs before, I take it that's
bad practice?


Given Teodor's response, I think the issue is probably that you were
palloc'ing in too short-lived a context. But whatever the problem is,
you'll narrow it down a lot faster if you build with --enable-cassert.
I wouldn't ever recommend trying to debug C functions without that.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 22 '05 #4

P: n/a
Excuse me, but I was too brief.
I mean your lexize method of dictionary should return pointer to array with 3
elements:
first should points to "one" C-string, second - to "hundred" C-string and 3rd is
NULL.
Array and C-strings should be palloc'ed in short-lived context, because it's
lives during parse text only.


Tom Lane wrote:
Ben <be***@silentmedia.com> writes:
Okay, so I was actually able to answer this question on my own, in a
manner of speaking. It seems the way to do this is to merely return a
larger char** array, with one element for each word. But I was having
trouble with postgres crashing, because (I think) it tries to free each
element independently before using all of them. I had set each element
to a different null-terminated chunk of the same palloc'd memory
segment. Having never written C stored procs before, I take it that's
bad practice?

Given Teodor's response, I think the issue is probably that you were
palloc'ing in too short-lived a context. But whatever the problem is,
you'll narrow it down a lot faster if you build with --enable-cassert.
I wouldn't ever recommend trying to debug C functions without that.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html


--
Teodor Sigaev E-mail: te****@sigaev.ru

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 22 '05 #5

P: n/a
Ben
Thanks for the replies. Just to clarify what I was doing, quaicode
looked something like:

phrase = palloc(8);
phrase = "foo\0bar\0";
res = palloc(3);
res[0] = phrase[0];
res[1] = phrase[5];
res[2] = 0;

That crashed. Once I changed it to:

res = palloc(3);
res[0] = palloc(4);
res[0] = "foo\0";
res[1] = palloc(4);
res[2] = "bar\0";
res[3] = 0;

it worked.

Anyway, I'm happy to forget my pain with this if only I could figure out
how to pipe the lexemes from one dictionary into another dictionary. :)

On Mon, 2004-02-16 at 08:09, Teodor Sigaev wrote:
Excuse me, but I was too brief.
I mean your lexize method of dictionary should return pointer to array with 3
elements:
first should points to "one" C-string, second - to "hundred" C-string and 3rd is
NULL.
Array and C-strings should be palloc'ed in short-lived context, because it's
lives during parse text only.


Tom Lane wrote:
Ben <be***@silentmedia.com> writes:
Okay, so I was actually able to answer this question on my own, in a
manner of speaking. It seems the way to do this is to merely return a
larger char** array, with one element for each word. But I was having
trouble with postgres crashing, because (I think) it tries to free each
element independently before using all of them. I had set each element
to a different null-terminated chunk of the same palloc'd memory
segment. Having never written C stored procs before, I take it that's
bad practice?

Given Teodor's response, I think the issue is probably that you were
palloc'ing in too short-lived a context. But whatever the problem is,
you'll narrow it down a lot faster if you build with --enable-cassert.
I wouldn't ever recommend trying to debug C functions without that.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 22 '05 #6

P: n/a
Ben
Like I said, quasicode. :)

And in fact I see I even put an off-by-one error in this last email that
wasn't in my function. (Honest!) Should have been "res[1] = phrase[4]"
in the first section.

Are there docs for making parsers? Or anything like gendict?

On Mon, 2004-02-16 at 09:25, Teodor Sigaev wrote:
:)
I hope you mean:
res = palloc(3);
res[0] = palloc(4);
memcpy(res[0] ,"foo", 4);
res[1] = palloc(4);
memcpy(res[1] ,"bar", 4);
res[2] = 0;

Look at indexes of res.

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 22 '05 #7

P: n/a


Ben wrote:
Thanks for the replies. Just to clarify what I was doing, quaicode
looked something like:

phrase = palloc(8);
phrase = "foo\0bar\0";
res = palloc(3);
res[0] = phrase[0];
res[1] = phrase[5];
res[2] = 0;

That crashed. Once I changed it to:

res = palloc(3);
res[0] = palloc(4);
res[0] = "foo\0";
res[1] = palloc(4);
res[2] = "bar\0";
res[3] = 0;

it worked.

:)
I hope you mean:
res = palloc(3);
res[0] = palloc(4);
memcpy(res[0] ,"foo", 4);
res[1] = palloc(4);
memcpy(res[1] ,"bar", 4);
res[2] = 0;

Look at indexes of res.

--
Teodor Sigaev E-mail: te****@sigaev.ru

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 22 '05 #8

P: n/a
Small docs are avaliable at
http://www.sai.msu.su/~megera/oddmus...ch_V2_in_Brief

and into current implementation at contrib/tsearch2/wparser_def.c. The largest
code is about headline stuff.

Ben wrote:
Like I said, quasicode. :)

And in fact I see I even put an off-by-one error in this last email that
wasn't in my function. (Honest!) Should have been "res[1] = phrase[4]"
in the first section.

Are there docs for making parsers? Or anything like gendict?

On Mon, 2004-02-16 at 09:25, Teodor Sigaev wrote:

:)
I hope you mean:
res = palloc(3);
res[0] = palloc(4);
memcpy(res[0] ,"foo", 4);
res[1] = palloc(4);
memcpy(res[1] ,"bar", 4);
res[2] = 0;

Look at indexes of res.


--
Teodor Sigaev E-mail: te****@sigaev.ru

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 22 '05 #9

P: n/a
btw, Ben, if you get you dictionary working, could you describe process
of developing so other people will appreciate your work. This part of
tsearch2 documentation is very weak.

Oleg

On Mon, 16 Feb 2004, Teodor Sigaev wrote:


Ben wrote:
Thanks for the replies. Just to clarify what I was doing, quaicode
looked something like:

phrase = palloc(8);
phrase = "foo\0bar\0";
res = palloc(3);
res[0] = phrase[0];
res[1] = phrase[5];
res[2] = 0;

That crashed. Once I changed it to:

res = palloc(3);
res[0] = palloc(4);
res[0] = "foo\0";
res[1] = palloc(4);
res[2] = "bar\0";
res[3] = 0;

it worked.

:)
I hope you mean:
res = palloc(3);
res[0] = palloc(4);
memcpy(res[0] ,"foo", 4);
res[1] = palloc(4);
memcpy(res[1] ,"bar", 4);
res[2] = 0;

Look at indexes of res.


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 22 '05 #10

P: n/a
Ben
So I noticed. ;) The dictionary's working, and I'd be happy to expand
upon the documentation. Just point me at something to work on.

But, like I said, I really want to figure out a way to pipe the output
of my dictionary through the another dictionary. If I can't do that, it
doesn't seem as useful, because "100" (handled by my dictionary) and
"one hundred" (handled by en_stem) currently don't generate the same
ts_vector.

Once I figure out how to tweak the parser to parse things they way I
want, I can expand upon those docs too. Looks like I'm going to need to
reach waaaay back into my brain and dust off my flex knowledge for that,
though....

On Mon, 2004-02-16 at 10:33, Oleg Bartunov wrote:
btw, Ben, if you get you dictionary working, could you describe process
of developing so other people will appreciate your work. This part of
tsearch2 documentation is very weak.

Oleg

On Mon, 16 Feb 2004, Teodor Sigaev wrote:


Ben wrote:
Thanks for the replies. Just to clarify what I was doing, quaicode
looked something like:

phrase = palloc(8);
phrase = "foo\0bar\0";
res = palloc(3);
res[0] = phrase[0];
res[1] = phrase[5];
res[2] = 0;

That crashed. Once I changed it to:

res = palloc(3);
res[0] = palloc(4);
res[0] = "foo\0";
res[1] = palloc(4);
res[2] = "bar\0";
res[3] = 0;

it worked.

:)
I hope you mean:
res = palloc(3);
res[0] = palloc(4);
memcpy(res[0] ,"foo", 4);
res[1] = palloc(4);
memcpy(res[1] ,"bar", 4);
res[2] = 0;

Look at indexes of res.


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 22 '05 #11

P: n/a
On Mon, 16 Feb 2004, Ben wrote:
So I noticed. ;) The dictionary's working, and I'd be happy to expand
upon the documentation. Just point me at something to work on.

I think you may just write a paper "How I did custom dictionary for tsearch2".
From what I've read I see your dictionary could be interesting to people
especially if you describe the motivation and usage.
Do you want '100' or 'hundred' will be fully equivalent ? So,
if you search '100' you will find document with 'hundred'. Interesting,
that you will find '123', because '123' will be 'one hundred twenty three'.
But, like I said, I really want to figure out a way to pipe the output
of my dictionary through the another dictionary. If I can't do that, it
doesn't seem as useful, because "100" (handled by my dictionary) and
"one hundred" (handled by en_stem) currently don't generate the same
ts_vector.
What's the problem ? You may configure which dictionaries and in what order
should be used for given type of token (pg_ts_cfgmap table).
Aha, I got your problem:

www=# select * from ts_debug('one hundred');
ts_name | tok_type | description | token | dict_name | tsvector
-----------------+----------+-------------+---------+-----------+----------
default_russian | lword | Latin word | one | {en_stem} | 'one'
default_russian | lword | Latin word | hundred | {en_stem} | 'hundr

'hundred' becames 'hundr'. You may use synonym dictionary which is
rather simple
( see http://www.sai.msu.su/~megera/oddmus...earch_V2_Notes for details ).
Once word is recognized by synonym dictionary it will not pass to
next dictionary ! This is how tsearch2 is working with any dictionary.


Once I figure out how to tweak the parser to parse things they way I
want, I can expand upon those docs too. Looks like I'm going to need to
reach waaaay back into my brain and dust off my flex knowledge for that,
though....
What do you want from parser ?

On Mon, 2004-02-16 at 10:33, Oleg Bartunov wrote:
btw, Ben, if you get you dictionary working, could you describe process
of developing so other people will appreciate your work. This part of
tsearch2 documentation is very weak.

Oleg

On Mon, 16 Feb 2004, Teodor Sigaev wrote:


Ben wrote:
> Thanks for the replies. Just to clarify what I was doing, quaicode
> looked something like:
>
> phrase = palloc(8);
> phrase = "foo\0bar\0";
> res = palloc(3);
> res[0] = phrase[0];
> res[1] = phrase[5];
> res[2] = 0;
>
> That crashed. Once I changed it to:
>
> res = palloc(3);
> res[0] = palloc(4);
> res[0] = "foo\0";
> res[1] = palloc(4);
> res[2] = "bar\0";
> res[3] = 0;
>
> it worked.
>
:)
I hope you mean:
res = palloc(3);
res[0] = palloc(4);
memcpy(res[0] ,"foo", 4);
res[1] = palloc(4);
memcpy(res[1] ,"bar", 4);
res[2] = 0;

Look at indexes of res.


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 22 '05 #12

P: n/a
Ben
On Tue, 2004-02-17 at 03:15, Oleg Bartunov wrote:
Do you want '100' or 'hundred' will be fully equivalent ? So,
if you search '100' you will find document with 'hundred'. Interesting,
that you will find '123', because '123' will be 'one hundred twenty three'.
Yeah, for a general case of documents I'm not sure how accurate it would
make things, but I'm trying to index music artist names and song titles,
where I'd get things like "3 Dog Night".... or is that "Three Dog
Night"? :)
What's the problem ? You may configure which dictionaries and in what order
should be used for given type of token (pg_ts_cfgmap table).
Aha, I got your problem: Once word is recognized by synonym dictionary it will not pass to
next dictionary ! This is how tsearch2 is working with any dictionary.
Yep, that's my problem. :) And it seems that if I could pass the normal
words into an ispell dictionary before passing them on to the en_stem
dictionary, I'd get spell checking for free. Unless there's a better way
to give "did you mean: <your search spelled correctly>?" results....?

I know doing this would increase the size of the generated ts_vector,
but for my case, where what I'm indexing is generally only a few words
anyway, that's not an issue. As it is, I'm already going to get rid of
the stop words file, so that I can actually find things like "The Who."

How hard do you think it would be to change up the behavior to make this
happen? I
What do you want from parser ?


I want to be able to recognize symbols, such as the degree (°) and
vulgar half (½) symbols.
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 22 '05 #13

P: n/a
On Tue, 17 Feb 2004, Ben wrote:
On Tue, 2004-02-17 at 03:15, Oleg Bartunov wrote:
Do you want '100' or 'hundred' will be fully equivalent ? So,
if you search '100' you will find document with 'hundred'. Interesting,
that you will find '123', because '123' will be 'one hundred twenty three'.
Yeah, for a general case of documents I'm not sure how accurate it would
make things, but I'm trying to index music artist names and song titles,
where I'd get things like "3 Dog Night".... or is that "Three Dog
Night"? :)
What's the problem ? You may configure which dictionaries and in what order
should be used for given type of token (pg_ts_cfgmap table).
Aha, I got your problem:

Once word is recognized by synonym dictionary it will not pass to
next dictionary ! This is how tsearch2 is working with any dictionary.


Yep, that's my problem. :) And it seems that if I could pass the normal
words into an ispell dictionary before passing them on to the en_stem
dictionary, I'd get spell checking for free. Unless there's a better way
to give "did you mean: <your search spelled correctly>?" results....?


If ispell dictionary recognizes a word, that word will not pass to en_stem.
We know how to add "query spelling feature" to tsearch2, just waiting
for sponsorships :) meanwhile, you could use our trgm module, which
implements trigram based spelling correction. You need to maintain
separate table with all words of interests (say, from tsvectors) and
search query words in that table using bestmatch finction.
I know doing this would increase the size of the generated ts_vector,
but for my case, where what I'm indexing is generally only a few words
anyway, that's not an issue. As it is, I'm already going to get rid of
the stop words file, so that I can actually find things like "The Who."

How hard do you think it would be to change up the behavior to make this
happen? I
What do you want from parser ?
I want to be able to recognize symbols, such as the degree () and
vulgar half () symbols.


You mean '(TA)', '(TH)' ? I think it's not very difficult. What'd be
a token type ( parenthesis_word :?)


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 22 '05 #14

P: n/a
Ben
On Tue, 17 Feb 2004, Oleg Bartunov wrote:
If ispell dictionary recognizes a word, that word will not pass to en_stem.
We know how to add "query spelling feature" to tsearch2, just waiting
for sponsorships :) meanwhile, you could use our trgm module, which
implements trigram based spelling correction. You need to maintain
separate table with all words of interests (say, from tsvectors) and
search query words in that table using bestmatch finction.


Hm, I'll take a look at this approach. I take it you think piping
dictionary output to more dictionaries in the chain is a bad idea? :)
What do you want from parser ?


I want to be able to recognize symbols, such as the degree () and
vulgar half () symbols.


You mean '(TA)', '(TH)' ? I think it's not very difficult. What'd be
a token type ( parenthesis_word :?)


uh, not sure how you got (TA) and (TH)... if you look at the original
message with utf-8 unicode encoding, the sympols come out fine. Or, maybe
you'd just have better luck pointing a browser at a page like
http://homepages.comnet.co.nz/~r-mah...text/utf8.html. I want to be
able to recognize a subset of these symbols, and I'd want another
dictionary I'd make to handle the symbol token to return both the symbol
and the common name as lexemes, in case people spell out the symbol
instead of entering it.
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 22 '05 #15

P: n/a
On Tue, 17 Feb 2004, Ben wrote:
On Tue, 17 Feb 2004, Oleg Bartunov wrote:
If ispell dictionary recognizes a word, that word will not pass to en_stem.
We know how to add "query spelling feature" to tsearch2, just waiting
for sponsorships :) meanwhile, you could use our trgm module, which
implements trigram based spelling correction. You need to maintain
separate table with all words of interests (say, from tsvectors) and
search query words in that table using bestmatch finction.
Hm, I'll take a look at this approach. I take it you think piping
dictionary output to more dictionaries in the chain is a bad idea? :)


it's unpredictable and I still don't get your idea of pipilining, but
in general, I have nothing agains it.
> What do you want from parser ?

I want to be able to recognize symbols, such as the degree () and
vulgar half () symbols.
You mean '(TA)', '(TH)' ? I think it's not very difficult. What'd be
a token type ( parenthesis_word :?)


uh, not sure how you got (TA) and (TH)... if you look at the original
message with utf-8 unicode encoding, the sympols come out fine. Or, maybe
you'd just have better luck pointing a browser at a page like


Yup:)
http://homepages.comnet.co.nz/~r-mah...text/utf8.html. I want to be
able to recognize a subset of these symbols, and I'd want another
dictionary I'd make to handle the symbol token to return both the symbol
and the common name as lexemes, in case people spell out the symbol
instead of entering it.


Aha, the same way as we handle complex words with hyphen - we return
the whole word and its parts. So you need to introduce new type of token
in parser and use synonym dictionary which in one's turn will returns
the symbol token and human readable word.

Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 22 '05 #16

P: n/a
Ben
On Tue, 17 Feb 2004, Oleg Bartunov wrote:
it's unpredictable and I still don't get your idea of pipilining, but
in general, I have nothing agains it.
Oh, well, the idea is that instead of the dictionary searching stopping at
the first dictionary in the chain that returns a lexeme, it would take
each of the lexemes returned and pass them on to the next dictionary in
the chain.

So if I specified numbers were to be handled by my num2english dictionary,
followed by en_stem, and then tried to deal get a vector for "100",
num2english would return "one" and "hundred". Then both "one" and
"hundred" would each be looked up in en_stem, and the union of these
lexems would be the final result.

Similarly, if a latin word gets piped through an ispell dictionary before
being sent to en_stem, each possible spelling would be stemmed.
Aha, the same way as we handle complex words with hyphen - we return
the whole word and its parts. So you need to introduce new type of token
in parser and use synonym dictionary which in one's turn will returns
the symbol token and human readable word.


Okay, that makes sense. I'll look more into how hyphenated words are being
handled now.
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 22 '05 #17

This discussion thread is closed

Replies have been disabled for this discussion.