By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,609 Members | 3,812 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,609 IT Pros & Developers. It's quick & easy.

search engine challenge

P: n/a
Hello,

I'm running a site with +20.000 articles. The articles (html files) are
saved on the server as txt files. Alle other data (author, date, category
and so on) are in a MySQL db. Before we had the articles put in the db also
and then performed SQL queries for the search engine. But this is no longer
feasable since there are too many articles and the db has gotten too big.
The search engine does all of the db and the server cpu goes max.
I'm looking for a php type search engine that automatically indexes the txt
files, produces 1 index file with all indexed words + the id's of articles
having those words. Like that the search script doesn't have to query all
the articles (the whole db) anymore but just this one index file. Would be
nice also if there would be possibility to have a blacklist of words (the,
a,...) and other admin things.

Anyone has experience with this?

Greetz,
Frank.

Jul 17 '05 #1
Share this Question
Share on Google+
5 Replies


P: n/a
Frank wrote:

I'm running a site with +20.000 articles. The articles (html files)
are saved on the server as txt files. Alle other data (author, date,
category and so on) are in a MySQL db. Before we had the articles put
in the db also and then performed SQL queries for the search engine.
But this is no longer feasable since there are too many articles and
the db has gotten too big. The search engine does all of the db and
the server cpu goes max. I'm looking for a php type search engine
that automatically indexes the txt files, produces 1 index file with
all indexed words + the id's of articles having those words. Like
that the search script doesn't have to query all the articles (the
whole db) anymore but just this one index file. Would be nice also if
there would be possibility to have a blacklist of words (the, a,...)
and other admin things.


If the site is public, have you thought about letting Google do the
hard work, and then either using the Google site search, or the Google
Web API to display results? Google is getting _very_ fast in indexing
large amounts of data on one's site. They picked up thousands of my
pages recently while I was playing around with the htaccess... even too
fast for my taste since I changed it again the next day...

--
Google Blogoscoped
http://blog.outer-court.com
Jul 17 '05 #2

P: n/a
I don't think it's possible to have Google index an MySQL db? And the html
files on the server are not .html

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:bv************@ID-203055.news.uni-berlin.de...
Frank wrote:

I'm running a site with +20.000 articles. The articles (html files)
are saved on the server as txt files. Alle other data (author, date,
category and so on) are in a MySQL db. Before we had the articles put
in the db also and then performed SQL queries for the search engine.
But this is no longer feasable since there are too many articles and
the db has gotten too big. The search engine does all of the db and
the server cpu goes max. I'm looking for a php type search engine
that automatically indexes the txt files, produces 1 index file with
all indexed words + the id's of articles having those words. Like
that the search script doesn't have to query all the articles (the
whole db) anymore but just this one index file. Would be nice also if
there would be possibility to have a blacklist of words (the, a,...)
and other admin things.


If the site is public, have you thought about letting Google do the
hard work, and then either using the Google site search, or the Google
Web API to display results? Google is getting _very_ fast in indexing
large amounts of data on one's site. They picked up thousands of my
pages recently while I was playing around with the htaccess... even too
fast for my taste since I changed it again the next day...

--
Google Blogoscoped
http://blog.outer-court.com

Jul 17 '05 #3

P: n/a
Hello,

On 01/26/2004 10:26 AM, Frank wrote:
I'm running a site with +20.000 articles. The articles (html files) are
saved on the server as txt files. Alle other data (author, date, category
and so on) are in a MySQL db. Before we had the articles put in the db also
and then performed SQL queries for the search engine. But this is no longer
feasable since there are too many articles and the db has gotten too big.
The search engine does all of the db and the server cpu goes max.
I'm looking for a php type search engine that automatically indexes the txt
files, produces 1 index file with all indexed words + the id's of articles
having those words. Like that the search script doesn't have to query all
the articles (the whole db) anymore but just this one index file. Would be
nice also if there would be possibility to have a blacklist of words (the,
a,...) and other admin things.

Anyone has experience with this?


Real search engines do not use SQL. It may be usable for small sites but
for large sites like yours, it is very slow and will suck your server
resources (disk space, memory, overall speed) as you already noticed.

A better solution is to use a dedicated crawler that uses flat files as
databases optimized for full text search operations. I use and recommend
Ht://Dig in the phpclasses.org site . That is also what php.net site and
mirrors use.

Htdig is available at www.htdig.org . You may also want to take a look
at this class to interface with HtDig from PHP. It will save you a lot
of time and patience to configure, index and search your site with htdig:

http://www.phpclasses.org/htdiginterface
--

Regards,
Manuel Lemos

Free ready to use OOP components written in PHP
http://www.phpclasses.org/

MetaL - XML based meta-programming language
http://www.meta-language.net/

Jul 17 '05 #4

P: n/a
Frank wrote:
I don't think it's possible to have Google index an MySQL db? And the
html files on the server are not .html


The HTML files may not have the extension "html", but extensions do not
matter to most search engines these days (not the most important one,
Google). So you serve as text/html and that's fine. If you don't expose
session IDs as parameters, and you don't use a dozen parameters, it
gets indexed fine. You can still use htaccess to display the URLs as
"....html", by the way (which might be nicer for users and for PageRank
etc.)

--
Google Blogoscoped
http://blog.outer-court.com
Jul 17 '05 #5

P: n/a
Frank wrote:
I don't think it's possible to have Google index an MySQL db? And the
html files on the server are not .html


The HTML files may not have the extension "html", but extensions do not
matter to most search engines these days (not the most important one,
Google). So you serve as text/html and that's fine. If you don't expose
session IDs as parameters, and you don't use a dozen parameters, it
gets indexed fine. You can still use htaccess to display the URLs as
"....html", by the way (which might be nicer for users and for PageRank
etc.)

--
Google Blogoscoped
http://blog.outer-court.com
Jul 17 '05 #6

This discussion thread is closed

Replies have been disabled for this discussion.