473,386 Members | 1,804 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

How to program a really fast product search catalogue

DC
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(products, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek
Nov 16 '05 #1
6 3310
You might want to look into Lucene.NET if you're planning to move away from
database land

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram
"DC" <dc@upsize.de> wrote in message
news:5b**************************@posting.google.c om...
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(products, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek

Nov 16 '05 #2
DC,

As far as ignoring the typos, have you considered the SOUNDEX function
(and the DIFFERENCE function as well)? It will allow you to determine what
words are similar (check the SOUNDEX documentation for more details).

Also, I suspect that CONTAINSTABLE is probably causing the slowdown.
However, I can't see why the query expression is slower. Can you post the
query? It should be a simple join (between the item, and the words that
describe it), with a filter on the specific words (an operation that should
be very fast).

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

"DC" <dc@upsize.de> wrote in message
news:5b**************************@posting.google.c om...
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(products, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek

Nov 16 '05 #3
On Mon, 18 Oct 2004 21:38:46 +0530, "Sriram Krishnan" <ks*****@NOSPAMgmx.net>
wrote:
You might want to look into Lucene.NET if you're planning to move away from
database land


I though Lucene was java only... :-P After googling for a while, I couldn't
find any reference to a .net version.
Nov 16 '05 #4
you could insert the programming before the data gets to the database, i.e.
instead of searching through the verbose descriptions, parse the verbose
descriptions out into words that are stored in a separate table with a
primary key and a one to many relationship with the products (i.e. one
product has many words associated with it). Then when the user types their
search into the box, you again parse that up into words and use those words
to match products by doing a join on the two tables and a where clause on
the word table (making a big "word='juicy' or word='mac' or word='apple' "
string is faster than using "word in ('juicy', 'apple', 'mac') ")

"DC" <dc@upsize.de> wrote in message
news:5b**************************@posting.google.c om...
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(products, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek

Nov 16 '05 #5
http://sourceforge.net/projects/dotlucene/

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram
"Fernando Rodríguez" <fe*******************@fernando-rodriguez.com> wrote in
message news:n6********************************@4ax.com...
On Mon, 18 Oct 2004 21:38:46 +0530, "Sriram Krishnan"
<ks*****@NOSPAMgmx.net>
wrote:
You might want to look into Lucene.NET if you're planning to move away
from
database land


I though Lucene was java only... :-P After googling for a while, I
couldn't
find any reference to a .net version.

Nov 16 '05 #6
DC
Hi,

thanks for the suggestions! Since the opinion seems to be, that this
should be doable with SQL Server alone, I have tried it again.

I tried different approaches, and the fastest I could find is this:

I take appart all product names. E.g. I have a funtion that returns
the table

ye
yel
yell
yello
yellow
ap
app
appl
apple

when I enter the term "yellow apple". Splitting the tokens saves some
work later, since I don not have to use the "like" operator when
searching (my application gernerally requires right truncation). I
store the result of this function in a table "tbl_products_tokens",
which is 1:n related to an intersection table which is n:1 related to
the product table. The tbl_products_tokens table does have a grouped
index on the tokens. I also store the number of occurences of each
token in tbl_products_tokens.

tbl_products_tokens:
token id incidents
ye 1 200
yel 2 198
yell 3 197
....
apple 100 25000

tbl_products_tokensis:
id_token id_product
1 2
1 3
....
100 1000

tbl_products:
id name
2 yellow apple
3 yellow submarine
....
1000 red apple

Now, if the search phrase is "yel apple", I am first looking up the
matchin tokens, sorted by incidents (asc):
token incidents
yel 197
apple 25000

Now I fill a table variable with the product id's from table
intersections which match "yel". Then I join this with a second table
variable which contains the product id's matching "apple". To
demonstrate this process I append some t-sql.

The overall performance is now about as mediocre as the fulltext index
was, but it does no longer have the limitation of getting corrupted if
a single token (e.g. "apple") matches more than 20000 products.

I do have quiet some search maintenance to do on my own with this,
tbl_products_tokens contains about 800,000 rows, tbl_products_tokensis
10,000,000 rows.

What I cannot archieve with this approach is search tolerance. I
implemented SOUNDEX in an earlier approach but that does not help much
with the (typically non-english) product names.

I also wonder how product like Google Desktop and Lookout (the Outlook
plugin) manage to search huge amounts of fulltext data as lightening
fast as they do. I would carry out the product search in CSharp
outside SQL Server if I could expect to archieve a performance like
that.

Kind regards

Derek
declare @m1 table ([id] int primary key)
declare @m2 table ([id] int primary key)
declare @tokens table (id int identity(1,1), token varchar(63),
incidents int)

insert into @tokens (token, incidents)
select ts.token, sum(incidents)
from tbl_products_tokens t
right join dbo.func_tokenize_select(@searchfor) ts on t.token like
ts.token
group by ts.token
order by sum(incidents) asc

set @mtoks = @@rowcount
set @mtok_id = 1

select * from @tokens

if @mtoks > 0
begin
select top 1 @token = token from @tokens where id = 1
insert into @m1 (id)
select id from dbo.func_search_token(@token)
set @mtok_id = 2
select @mcount = count(id) from @m1
end

while @mtoks >= @mtok_id
begin
select top 1 @token = token from @tokens where id = @mtok_id
if @mtok_id % 2 = 0
begin
insert into @m2
select m1.id from @m1 m1
join dbo.func_search_token(@token) m2 on m2.id = m1.id
delete @m1
select @mcount = count(id) from @m2
end
else
begin
insert into @m1
select m2.id from @m2 m2
join dbo.func_search_token(@token) m1 on m1.id = m2.id
delete @m2
select @mcount = count(id) from @m1
end

set @mtok_id = @mtok_id + 1

if @mcount = 0
break
end

create function func_search_token (@token varchar(63))
returns @matches table ([id] int)

as
begin
insert into @matches
select its.id_product
from tbl_products_tokensis its
join tbl_products_tokens it on it.id = its.id_token
where it.token = @token
return
end
Nov 16 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: mark | r | last post by:
anyone know where i can find a free product catalogue system that has categorised products and administration of product data and images? thanks mark
0
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...
8
by: Steve | last post by:
I notice that search engines are now finding robots.txt files and catalogue their contents. Is this wise I wonder? Is it a possible security risk? I even found the White House robots.txt file on...
5
by: Adam Clauss | last post by:
I am needing to automate some installation tasks. I first need to determine if there is already a version of a program installed. The program appears in the Add/Remove Programs in control panel,...
1
by: www.web20developers.com | last post by:
http://www.web20developers.com http://www.web20developers.com/index.php?option=com_content&task=view... Ajallerix : AJAX, simple, fast Web image gallery demo ; at Novell AJAX -...
6
by: matt.b.williams | last post by:
I am writing a program for a handheld device. It is a simple C program and it runs in a DOS like environment. I'm compiling it in Microsoft Visual C v1.x. Anyway, as part of this application, I...
3
by: lapkorts | last post by:
Hi, I am looking for product catalogue/shop, preferable for Joomla or known CMS with the following two features: 1. Multidimentional. I.e. each product could belong to several trees/ branches...
4
by: =?Utf-8?B?MjJQb20=?= | last post by:
Hi All, This is all new to me so please be patient with me. What I have is a very large 'Al-In-One' program, not yet complete, that has over 70 Forms/Modules/Classes in it and needs to be...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.