473,545 Members | 2,113 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to program a really fast product search catalogue

DC
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(p roducts, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek
Nov 16 '05 #1
6 3321
You might want to look into Lucene.NET if you're planning to move away from
database land

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram
"DC" <dc@upsize.de > wrote in message
news:5b******** *************** ***@posting.goo gle.com...
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(p roducts, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek

Nov 16 '05 #2
DC,

As far as ignoring the typos, have you considered the SOUNDEX function
(and the DIFFERENCE function as well)? It will allow you to determine what
words are similar (check the SOUNDEX documentation for more details).

Also, I suspect that CONTAINSTABLE is probably causing the slowdown.
However, I can't see why the query expression is slower. Can you post the
query? It should be a simple join (between the item, and the words that
describe it), with a filter on the specific words (an operation that should
be very fast).

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard. caspershouse.co m

"DC" <dc@upsize.de > wrote in message
news:5b******** *************** ***@posting.goo gle.com...
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(p roducts, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek

Nov 16 '05 #3
On Mon, 18 Oct 2004 21:38:46 +0530, "Sriram Krishnan" <ks*****@NOSPAM gmx.net>
wrote:
You might want to look into Lucene.NET if you're planning to move away from
database land


I though Lucene was java only... :-P After googling for a while, I couldn't
find any reference to a .net version.
Nov 16 '05 #4
you could insert the programming before the data gets to the database, i.e.
instead of searching through the verbose descriptions, parse the verbose
descriptions out into words that are stored in a separate table with a
primary key and a one to many relationship with the products (i.e. one
product has many words associated with it). Then when the user types their
search into the box, you again parse that up into words and use those words
to match products by doing a join on the two tables and a where clause on
the word table (making a big "word='juic y' or word='mac' or word='apple' "
string is faster than using "word in ('juicy', 'apple', 'mac') ")

"DC" <dc@upsize.de > wrote in message
news:5b******** *************** ***@posting.goo gle.com...
Hi,

I am programming a search catalogue with 200000 items (and growing). I
am currently using the SQL Server 2000 fulltext engine for this task
but it does not fit the requirements anymore.

The products typically do have a verbose name, "canadian superapples:
red tasty juicy macintosh apple from toronto" and the like. If a
customer is looking for "canadian apple" this product needs to match,
but also if he is looking for "juicy mac".

In SQL Server I am using the clause CONTAINSTABLE(p roducts, name,
'"juicy*" ~ "mac*", 50000) to archieve this but first of all this is
not really fast (I guess the fulltext index service is more into
really long text - or just too slow) and I am running into the
limitation that people search for "green apple" and there are 80.000
products in the database which contain the word "green" in their name,
therefore sql server will come back with an empty result table (since
the fulltext service only digs up to about 20000 items and breaks
thereafter).

I would also love to get some search tolerance, if people enter
"africa banona" it would be great if the algorithm could ignore the
typo.

I am wondering if it is time to implement my own search routine for
that, and that is why I post this message to the csharp and the
sqlserver newsgroup. I am free to use any .net and/or sql server
solution.

I thought it would be a good idea to split the names into single
expressions and use a 1:n relation between the expressions and the
matching products. So I can use a clustered index search on the table
with the expressions. The search for "juicy mac" would then get a lock
on "juicy" which matches 20000 products and macintosh which matches
5000 products and the intersection of the two expressions is the
result list of 3000 products. It turned out, that this solution is
slower than using the fulltext service.

As you can see I am not particulary strong about search algorithms
(database spoiled), so any comments on how to get started would help
me a lot!

Regards
Derek

Nov 16 '05 #5
http://sourceforge.net/projects/dotlucene/

--
Sriram Krishnan

http://www.dotnetjunkies.com/weblog/sriram
"Fernando Rodríguez" <fe************ *******@fernand o-rodriguez.com> wrote in
message news:n6******** *************** *********@4ax.c om...
On Mon, 18 Oct 2004 21:38:46 +0530, "Sriram Krishnan"
<ks*****@NOSPAM gmx.net>
wrote:
You might want to look into Lucene.NET if you're planning to move away
from
database land


I though Lucene was java only... :-P After googling for a while, I
couldn't
find any reference to a .net version.

Nov 16 '05 #6
DC
Hi,

thanks for the suggestions! Since the opinion seems to be, that this
should be doable with SQL Server alone, I have tried it again.

I tried different approaches, and the fastest I could find is this:

I take appart all product names. E.g. I have a funtion that returns
the table

ye
yel
yell
yello
yellow
ap
app
appl
apple

when I enter the term "yellow apple". Splitting the tokens saves some
work later, since I don not have to use the "like" operator when
searching (my application gernerally requires right truncation). I
store the result of this function in a table "tbl_products_t okens",
which is 1:n related to an intersection table which is n:1 related to
the product table. The tbl_products_to kens table does have a grouped
index on the tokens. I also store the number of occurences of each
token in tbl_products_to kens.

tbl_products_to kens:
token id incidents
ye 1 200
yel 2 198
yell 3 197
....
apple 100 25000

tbl_products_to kensis:
id_token id_product
1 2
1 3
....
100 1000

tbl_products:
id name
2 yellow apple
3 yellow submarine
....
1000 red apple

Now, if the search phrase is "yel apple", I am first looking up the
matchin tokens, sorted by incidents (asc):
token incidents
yel 197
apple 25000

Now I fill a table variable with the product id's from table
intersections which match "yel". Then I join this with a second table
variable which contains the product id's matching "apple". To
demonstrate this process I append some t-sql.

The overall performance is now about as mediocre as the fulltext index
was, but it does no longer have the limitation of getting corrupted if
a single token (e.g. "apple") matches more than 20000 products.

I do have quiet some search maintenance to do on my own with this,
tbl_products_to kens contains about 800,000 rows, tbl_products_to kensis
10,000,000 rows.

What I cannot archieve with this approach is search tolerance. I
implemented SOUNDEX in an earlier approach but that does not help much
with the (typically non-english) product names.

I also wonder how product like Google Desktop and Lookout (the Outlook
plugin) manage to search huge amounts of fulltext data as lightening
fast as they do. I would carry out the product search in CSharp
outside SQL Server if I could expect to archieve a performance like
that.

Kind regards

Derek
declare @m1 table ([id] int primary key)
declare @m2 table ([id] int primary key)
declare @tokens table (id int identity(1,1), token varchar(63),
incidents int)

insert into @tokens (token, incidents)
select ts.token, sum(incidents)
from tbl_products_to kens t
right join dbo.func_tokeni ze_select(@sear chfor) ts on t.token like
ts.token
group by ts.token
order by sum(incidents) asc

set @mtoks = @@rowcount
set @mtok_id = 1

select * from @tokens

if @mtoks > 0
begin
select top 1 @token = token from @tokens where id = 1
insert into @m1 (id)
select id from dbo.func_search _token(@token)
set @mtok_id = 2
select @mcount = count(id) from @m1
end

while @mtoks >= @mtok_id
begin
select top 1 @token = token from @tokens where id = @mtok_id
if @mtok_id % 2 = 0
begin
insert into @m2
select m1.id from @m1 m1
join dbo.func_search _token(@token) m2 on m2.id = m1.id
delete @m1
select @mcount = count(id) from @m2
end
else
begin
insert into @m1
select m2.id from @m2 m2
join dbo.func_search _token(@token) m1 on m1.id = m2.id
delete @m2
select @mcount = count(id) from @m1
end

set @mtok_id = @mtok_id + 1

if @mcount = 0
break
end

create function func_search_tok en (@token varchar(63))
returns @matches table ([id] int)

as
begin
insert into @matches
select its.id_product
from tbl_products_to kensis its
join tbl_products_to kens it on it.id = its.id_token
where it.token = @token
return
end
Nov 16 '05 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
1563
by: mark | r | last post by:
anyone know where i can find a free product catalogue system that has categorised products and administration of product data and images? thanks mark
0
1621
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing technology provides you with highest indexing performance, possibility to index very large sets of data in minimal time even with memory constraints and...
8
1666
by: Steve | last post by:
I notice that search engines are now finding robots.txt files and catalogue their contents. Is this wise I wonder? Is it a possible security risk? I even found the White House robots.txt file on Google. Surely disclosing detains of the directory structure is an open invitation for hackers to 'take a look'? Does anyone else feel the same...
5
20375
by: Adam Clauss | last post by:
I am needing to automate some installation tasks. I first need to determine if there is already a version of a program installed. The program appears in the Add/Remove Programs in control panel, and I know this is all stored in the registry. But short of searching through the entire registry, I cannot figure out how to locate the program. ...
1
16483
by: www.web20developers.com | last post by:
http://www.web20developers.com http://www.web20developers.com/index.php?option=com_content&task=view... Ajallerix : AJAX, simple, fast Web image gallery demo ; at Novell AJAX - microlink pattern tutorial : A microlink is a link that opens up
6
1862
by: matt.b.williams | last post by:
I am writing a program for a handheld device. It is a simple C program and it runs in a DOS like environment. I'm compiling it in Microsoft Visual C v1.x. Anyway, as part of this application, I have to search a rather large lookup file. Simply looping on a flat file takes far too long and the processor isn't fast enough to compile an...
3
1556
by: lapkorts | last post by:
Hi, I am looking for product catalogue/shop, preferable for Joomla or known CMS with the following two features: 1. Multidimentional. I.e. each product could belong to several trees/ branches same time. For instance I want to find particular product via brand and classification accepted inside particular brand. It is the first dimension....
4
1570
by: =?Utf-8?B?MjJQb20=?= | last post by:
Hi All, This is all new to me so please be patient with me. What I have is a very large 'Al-In-One' program, not yet complete, that has over 70 Forms/Modules/Classes in it and needs to be broken up into smaller programs/modules that can be called from the main program. I understand that I would need to, via a Click Button to close one...
0
7468
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...
0
7656
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
0
7757
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the...
0
5972
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
1
5329
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...
0
3443
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
1884
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1014
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
704
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.