Hi,
I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.
The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".
In SQL Server I am using the clause CONTAINSTABLE(p roducts, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).
I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.
I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.
I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.
As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!
Regards
Derek 6 3321
You might want to look into Lucene.NET if you're planning to move away from
database land
--
Sriram Krishnan http://www.dotnetjunkies.com/weblog/sriram
"DC" <dc@upsize.de > wrote in message
news:5b******** *************** ***@posting.goo gle.com... Hi,
I am programming a search catalogue with 200000 items (and growing). I am currently using the SQL Server 2000 fulltext engine for this task but it does not fit the requirements anymore.
The products typically do have a verbose name, "canadian superapples: red tasty juicy macintosh apple from toronto" and the like. If a customer is looking for "canadian apple" this product needs to match, but also if he is looking for "juicy mac".
In SQL Server I am using the clause CONTAINSTABLE(p roducts, name, '"juicy*" ~ "mac*", 50000) to archieve this but first of all this is not really fast (I guess the fulltext index service is more into really long text - or just too slow) and I am running into the limitation that people search for "green apple" and there are 80.000 products in the database which contain the word "green" in their name, therefore sql server will come back with an empty result table (since the fulltext service only digs up to about 20000 items and breaks thereafter).
I would also love to get some search tolerance, if people enter "africa banona" it would be great if the algorithm could ignore the typo.
I am wondering if it is time to implement my own search routine for that, and that is why I post this message to the csharp and the sqlserver newsgroup. I am free to use any .net and/or sql server solution.
I thought it would be a good idea to split the names into single expressions and use a 1:n relation between the expressions and the matching products. So I can use a clustered index search on the table with the expressions. The search for "juicy mac" would then get a lock on "juicy" which matches 20000 products and macintosh which matches 5000 products and the intersection of the two expressions is the result list of 3000 products. It turned out, that this solution is slower than using the fulltext service.
As you can see I am not particulary strong about search algorithms (database spoiled), so any comments on how to get started would help me a lot!
Regards Derek
DC,
As far as ignoring the typos, have you considered the SOUNDEX function
(and the DIFFERENCE function as well)? It will allow you to determine what
words are similar (check the SOUNDEX documentation for more details).
Also, I suspect that CONTAINSTABLE is probably causing the slowdown.
However, I can't see why the query expression is slower. Can you post the
query? It should be a simple join (between the item, and the words that
describe it), with a filter on the specific words (an operation that should
be very fast).
Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard. caspershouse.co m
"DC" <dc@upsize.de > wrote in message
news:5b******** *************** ***@posting.goo gle.com... Hi,
I am programming a search catalogue with 200000 items (and growing). I am currently using the SQL Server 2000 fulltext engine for this task but it does not fit the requirements anymore.
The products typically do have a verbose name, "canadian superapples: red tasty juicy macintosh apple from toronto" and the like. If a customer is looking for "canadian apple" this product needs to match, but also if he is looking for "juicy mac".
In SQL Server I am using the clause CONTAINSTABLE(p roducts, name, '"juicy*" ~ "mac*", 50000) to archieve this but first of all this is not really fast (I guess the fulltext index service is more into really long text - or just too slow) and I am running into the limitation that people search for "green apple" and there are 80.000 products in the database which contain the word "green" in their name, therefore sql server will come back with an empty result table (since the fulltext service only digs up to about 20000 items and breaks thereafter).
I would also love to get some search tolerance, if people enter "africa banona" it would be great if the algorithm could ignore the typo.
I am wondering if it is time to implement my own search routine for that, and that is why I post this message to the csharp and the sqlserver newsgroup. I am free to use any .net and/or sql server solution.
I thought it would be a good idea to split the names into single expressions and use a 1:n relation between the expressions and the matching products. So I can use a clustered index search on the table with the expressions. The search for "juicy mac" would then get a lock on "juicy" which matches 20000 products and macintosh which matches 5000 products and the intersection of the two expressions is the result list of 3000 products. It turned out, that this solution is slower than using the fulltext service.
As you can see I am not particulary strong about search algorithms (database spoiled), so any comments on how to get started would help me a lot!
Regards Derek
On Mon, 18 Oct 2004 21:38:46 +0530, "Sriram Krishnan" <ks*****@NOSPAM gmx.net>
wrote: You might want to look into Lucene.NET if you're planning to move away from database land
I though Lucene was java only... :-P After googling for a while, I couldn't
find any reference to a .net version.
you could insert the programming before the data gets to the database, i.e.
instead of searching through the verbose descriptions, parse the verbose
descriptions out into words that are stored in a separate table with a
primary key and a one to many relationship with the products (i.e. one
product has many words associated with it). Then when the user types their
search into the box, you again parse that up into words and use those words
to match products by doing a join on the two tables and a where clause on
the word table (making a big "word='juic y' or word='mac' or word='apple' "
string is faster than using "word in ('juicy', 'apple', 'mac') ")
"DC" <dc@upsize.de > wrote in message
news:5b******** *************** ***@posting.goo gle.com... Hi,
I am programming a search catalogue with 200000 items (and growing). I am currently using the SQL Server 2000 fulltext engine for this task but it does not fit the requirements anymore.
The products typically do have a verbose name, "canadian superapples: red tasty juicy macintosh apple from toronto" and the like. If a customer is looking for "canadian apple" this product needs to match, but also if he is looking for "juicy mac".
In SQL Server I am using the clause CONTAINSTABLE(p roducts, name, '"juicy*" ~ "mac*", 50000) to archieve this but first of all this is not really fast (I guess the fulltext index service is more into really long text - or just too slow) and I am running into the limitation that people search for "green apple" and there are 80.000 products in the database which contain the word "green" in their name, therefore sql server will come back with an empty result table (since the fulltext service only digs up to about 20000 items and breaks thereafter).
I would also love to get some search tolerance, if people enter "africa banona" it would be great if the algorithm could ignore the typo.
I am wondering if it is time to implement my own search routine for that, and that is why I post this message to the csharp and the sqlserver newsgroup. I am free to use any .net and/or sql server solution.
I thought it would be a good idea to split the names into single expressions and use a 1:n relation between the expressions and the matching products. So I can use a clustered index search on the table with the expressions. The search for "juicy mac" would then get a lock on "juicy" which matches 20000 products and macintosh which matches 5000 products and the intersection of the two expressions is the result list of 3000 products. It turned out, that this solution is slower than using the fulltext service.
As you can see I am not particulary strong about search algorithms (database spoiled), so any comments on how to get started would help me a lot!
Regards Derek http://sourceforge.net/projects/dotlucene/
--
Sriram Krishnan http://www.dotnetjunkies.com/weblog/sriram
"Fernando Rodríguez" <fe************ *******@fernand o-rodriguez.com> wrote in
message news:n6******** *************** *********@4ax.c om... On Mon, 18 Oct 2004 21:38:46 +0530, "Sriram Krishnan" <ks*****@NOSPAM gmx.net> wrote:
You might want to look into Lucene.NET if you're planning to move away from database land
I though Lucene was java only... :-P After googling for a while, I couldn't find any reference to a .net version.
Hi,
thanks for the suggestions! Since the opinion seems to be, that this
should be doable with SQL Server alone, I have tried it again.
I tried different approaches, and the fastest I could find is this:
I take appart all product names. E.g. I have a funtion that returns
the table
ye
yel
yell
yello
yellow
ap
app
appl
apple
when I enter the term "yellow apple". Splitting the tokens saves some
work later, since I don not have to use the "like" operator when
searching (my application gernerally requires right truncation). I
store the result of this function in a table "tbl_products_t okens",
which is 1:n related to an intersection table which is n:1 related to
the product table. The tbl_products_to kens table does have a grouped
index on the tokens. I also store the number of occurences of each
token in tbl_products_to kens.
tbl_products_to kens:
token id incidents
ye 1 200
yel 2 198
yell 3 197
....
apple 100 25000
tbl_products_to kensis:
id_token id_product
1 2
1 3
....
100 1000
tbl_products:
id name
2 yellow apple
3 yellow submarine
....
1000 red apple
Now, if the search phrase is "yel apple", I am first looking up the
matchin tokens, sorted by incidents (asc):
token incidents
yel 197
apple 25000
Now I fill a table variable with the product id's from table
intersections which match "yel". Then I join this with a second table
variable which contains the product id's matching "apple". To
demonstrate this process I append some t-sql.
The overall performance is now about as mediocre as the fulltext index
was, but it does no longer have the limitation of getting corrupted if
a single token (e.g. "apple") matches more than 20000 products.
I do have quiet some search maintenance to do on my own with this,
tbl_products_to kens contains about 800,000 rows, tbl_products_to kensis
10,000,000 rows.
What I cannot archieve with this approach is search tolerance. I
implemented SOUNDEX in an earlier approach but that does not help much
with the (typically non-english) product names.
I also wonder how product like Google Desktop and Lookout (the Outlook
plugin) manage to search huge amounts of fulltext data as lightening
fast as they do. I would carry out the product search in CSharp
outside SQL Server if I could expect to archieve a performance like
that.
Kind regards
Derek
declare @m1 table ([id] int primary key)
declare @m2 table ([id] int primary key)
declare @tokens table (id int identity(1,1), token varchar(63),
incidents int)
insert into @tokens (token, incidents)
select ts.token, sum(incidents)
from tbl_products_to kens t
right join dbo.func_tokeni ze_select(@sear chfor) ts on t.token like
ts.token
group by ts.token
order by sum(incidents) asc
set @mtoks = @@rowcount
set @mtok_id = 1
select * from @tokens
if @mtoks > 0
begin
select top 1 @token = token from @tokens where id = 1
insert into @m1 (id)
select id from dbo.func_search _token(@token)
set @mtok_id = 2
select @mcount = count(id) from @m1
end
while @mtoks >= @mtok_id
begin
select top 1 @token = token from @tokens where id = @mtok_id
if @mtok_id % 2 = 0
begin
insert into @m2
select m1.id from @m1 m1
join dbo.func_search _token(@token) m2 on m2.id = m1.id
delete @m1
select @mcount = count(id) from @m2
end
else
begin
insert into @m1
select m2.id from @m2 m2
join dbo.func_search _token(@token) m1 on m1.id = m2.id
delete @m2
select @mcount = count(id) from @m1
end
set @mtok_id = @mtok_id + 1
if @mcount = 0
break
end
create function func_search_tok en (@token varchar(63))
returns @matches table ([id] int)
as
begin
insert into @matches
select its.id_product
from tbl_products_to kensis its
join tbl_products_to kens it on it.id = its.id_token
where it.token = @token
return
end This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: mark | r |
last post by:
anyone know where i can find a free product catalogue system that has
categorised products and administration of product data and images?
thanks
mark
|
by: SoftComplete Development |
last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text
Indexing and Retrieval library that will completely satisfy your
application's indexing and retrieval needs.
AlphaTIX indexing technology provides you with highest indexing
performance, possibility to index very large sets of data in minimal
time even with memory constraints and...
|
by: Steve |
last post by:
I notice that search engines are now finding robots.txt files and catalogue
their contents. Is this wise I wonder? Is it a possible security risk?
I even found the White House robots.txt file on Google. Surely disclosing
detains of the directory structure is an open invitation for hackers to 'take a
look'?
Does anyone else feel the same...
|
by: Adam Clauss |
last post by:
I am needing to automate some installation tasks. I first need to determine
if there is already a version of a program installed. The program appears
in the Add/Remove Programs in control panel, and I know this is all stored
in the registry. But short of searching through the entire registry, I
cannot figure out how to locate the program.
...
|
by: www.web20developers.com |
last post by:
http://www.web20developers.com
http://www.web20developers.com/index.php?option=com_content&task=view...
Ajallerix : AJAX, simple, fast Web image gallery demo ; at Novell
AJAX - microlink pattern tutorial : A microlink is a link that opens up
| |
by: matt.b.williams |
last post by:
I am writing a program for a handheld device. It is a simple C
program and it runs in a DOS like environment. I'm compiling it in
Microsoft Visual C v1.x. Anyway, as part of this application, I have
to search a rather large lookup file. Simply looping on a flat file
takes far too long and the processor isn't fast enough to compile an...
|
by: lapkorts |
last post by:
Hi,
I am looking for product catalogue/shop, preferable for Joomla or
known CMS with the following two features:
1. Multidimentional. I.e. each product could belong to several trees/
branches same time. For instance I want to find particular product via
brand and classification accepted inside particular brand. It is the
first dimension....
|
by: =?Utf-8?B?MjJQb20=?= |
last post by:
Hi All,
This is all new to me so please be patient with me.
What I have is a very large 'Al-In-One' program, not yet complete, that has
over 70 Forms/Modules/Classes in it and needs to be broken up into smaller
programs/modules that can be called from the main program. I understand that
I would need to, via a Click Button to close one...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it. ...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
| |
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...
| |