473,243 Members | 1,850 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,243 software developers and data experts.

search engine challenge

Hello,

I'm running a site with +20.000 articles. The articles (html files) are
saved on the server as txt files. Alle other data (author, date, category
and so on) are in a MySQL db. Before we had the articles put in the db also
and then performed SQL queries for the search engine. But this is no longer
feasable since there are too many articles and the db has gotten too big.
The search engine does all of the db and the server cpu goes max.
I'm looking for a php type search engine that automatically indexes the txt
files, produces 1 index file with all indexed words + the id's of articles
having those words. Like that the search script doesn't have to query all
the articles (the whole db) anymore but just this one index file. Would be
nice also if there would be possibility to have a blacklist of words (the,
a,...) and other admin things.

Anyone has experience with this?

Greetz,
Frank.

Jul 17 '05 #1
5 2561
Frank wrote:

I'm running a site with +20.000 articles. The articles (html files)
are saved on the server as txt files. Alle other data (author, date,
category and so on) are in a MySQL db. Before we had the articles put
in the db also and then performed SQL queries for the search engine.
But this is no longer feasable since there are too many articles and
the db has gotten too big. The search engine does all of the db and
the server cpu goes max. I'm looking for a php type search engine
that automatically indexes the txt files, produces 1 index file with
all indexed words + the id's of articles having those words. Like
that the search script doesn't have to query all the articles (the
whole db) anymore but just this one index file. Would be nice also if
there would be possibility to have a blacklist of words (the, a,...)
and other admin things.


If the site is public, have you thought about letting Google do the
hard work, and then either using the Google site search, or the Google
Web API to display results? Google is getting _very_ fast in indexing
large amounts of data on one's site. They picked up thousands of my
pages recently while I was playing around with the htaccess... even too
fast for my taste since I changed it again the next day...

--
Google Blogoscoped
http://blog.outer-court.com
Jul 17 '05 #2
I don't think it's possible to have Google index an MySQL db? And the html
files on the server are not .html

"Philipp Lenssen" <in**@outer-court.com> wrote in message
news:bv************@ID-203055.news.uni-berlin.de...
Frank wrote:

I'm running a site with +20.000 articles. The articles (html files)
are saved on the server as txt files. Alle other data (author, date,
category and so on) are in a MySQL db. Before we had the articles put
in the db also and then performed SQL queries for the search engine.
But this is no longer feasable since there are too many articles and
the db has gotten too big. The search engine does all of the db and
the server cpu goes max. I'm looking for a php type search engine
that automatically indexes the txt files, produces 1 index file with
all indexed words + the id's of articles having those words. Like
that the search script doesn't have to query all the articles (the
whole db) anymore but just this one index file. Would be nice also if
there would be possibility to have a blacklist of words (the, a,...)
and other admin things.


If the site is public, have you thought about letting Google do the
hard work, and then either using the Google site search, or the Google
Web API to display results? Google is getting _very_ fast in indexing
large amounts of data on one's site. They picked up thousands of my
pages recently while I was playing around with the htaccess... even too
fast for my taste since I changed it again the next day...

--
Google Blogoscoped
http://blog.outer-court.com

Jul 17 '05 #3
Hello,

On 01/26/2004 10:26 AM, Frank wrote:
I'm running a site with +20.000 articles. The articles (html files) are
saved on the server as txt files. Alle other data (author, date, category
and so on) are in a MySQL db. Before we had the articles put in the db also
and then performed SQL queries for the search engine. But this is no longer
feasable since there are too many articles and the db has gotten too big.
The search engine does all of the db and the server cpu goes max.
I'm looking for a php type search engine that automatically indexes the txt
files, produces 1 index file with all indexed words + the id's of articles
having those words. Like that the search script doesn't have to query all
the articles (the whole db) anymore but just this one index file. Would be
nice also if there would be possibility to have a blacklist of words (the,
a,...) and other admin things.

Anyone has experience with this?


Real search engines do not use SQL. It may be usable for small sites but
for large sites like yours, it is very slow and will suck your server
resources (disk space, memory, overall speed) as you already noticed.

A better solution is to use a dedicated crawler that uses flat files as
databases optimized for full text search operations. I use and recommend
Ht://Dig in the phpclasses.org site . That is also what php.net site and
mirrors use.

Htdig is available at www.htdig.org . You may also want to take a look
at this class to interface with HtDig from PHP. It will save you a lot
of time and patience to configure, index and search your site with htdig:

http://www.phpclasses.org/htdiginterface
--

Regards,
Manuel Lemos

Free ready to use OOP components written in PHP
http://www.phpclasses.org/

MetaL - XML based meta-programming language
http://www.meta-language.net/

Jul 17 '05 #4
Frank wrote:
I don't think it's possible to have Google index an MySQL db? And the
html files on the server are not .html


The HTML files may not have the extension "html", but extensions do not
matter to most search engines these days (not the most important one,
Google). So you serve as text/html and that's fine. If you don't expose
session IDs as parameters, and you don't use a dozen parameters, it
gets indexed fine. You can still use htaccess to display the URLs as
"....html", by the way (which might be nicer for users and for PageRank
etc.)

--
Google Blogoscoped
http://blog.outer-court.com
Jul 17 '05 #5
Frank wrote:
I don't think it's possible to have Google index an MySQL db? And the
html files on the server are not .html


The HTML files may not have the extension "html", but extensions do not
matter to most search engines these days (not the most important one,
Google). So you serve as text/html and that's fine. If you don't expose
session IDs as parameters, and you don't use a dozen parameters, it
gets indexed fine. You can still use htaccess to display the URLs as
"....html", by the way (which might be nicer for users and for PageRank
etc.)

--
Google Blogoscoped
http://blog.outer-court.com
Jul 17 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Rod | last post by:
Hi, i am doing a ecommerce website and would like to implement a search engine to find products. All the serach engine I have found on the web are parsing html page! This is not what i want. i...
0
by: R. Rajesh Jeba Anbiah | last post by:
Q: Is PHP search engine friendly? Q: Will search engine spiders crawl my PHP pages? A: Spiders should crawl anything provided they're accessible. Since, nowadays most of the websites are been...
4
by: Laphan | last post by:
Hi Guys Wonder if you can help. I know there are quite a few out there, FusionBot being one that I have taken a shine to, but if not just for the challenge I want to create my own localised...
11
by: Petre Huile | last post by:
I have designed a site for a client, but they have hired an internet marketing person to incrase their search engine ranking and traffic. He wants to put extra-large fonts on every page which will...
5
by: George | last post by:
Hi, Anyone has the background for explaining? I have made a search on my name and I have got a link to another search engine. The link's title was the search phrase for the other search engine...
2
by: Patrick | last post by:
Are the differences between a search engine, a subject directory and a meta search engine significant for an ebusiness web site owner? A meta search engine merely uses ordinary existing search...
83
by: D. Dante Lorenso | last post by:
Trying to use the 'search' in the docs section of PostgreSQL.org is extremely SLOW. Considering this is a website for a database and databases are supposed to be good for indexing content, I'd...
4
by: MDW | last post by:
Posted this on another board, but evidently it was off-topic there...hope you folks will be able to provide some guidance. I've been working on a Web site for a business (my first non-personal...
2
by: anbaxter | last post by:
I have a small challenge and you'll have to excuse me because I haven’t touched JS for some time and have gotten a bit rusty. I have an intranet site at work that has roughly 500,000 htm pages...
0
by: abbasky | last post by:
### Vandf component communication method one: data sharing ​ Vandf components can achieve data exchange through data sharing, state sharing, events, and other methods. Vandf's data exchange method...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: stefan129 | last post by:
Hey forum members, I'm exploring options for SSL certificates for multiple domains. Has anyone had experience with multi-domain SSL certificates? Any recommendations on reliable providers or specific...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.