473,854 Members | 1,516 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

website doc search is extremely SLOW

Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.

I submitted my search over two minutes ago. I just finished this
email to the list. The results have still not come back. I only
searched for:

SECURITY INVOKER

Perhaps this should be worked on?

Dante

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05
83 5982
I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?

Dave
On Wed, 2003-12-31 at 00:04, Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...

On Wed, 30 Dec 2003, Dave Cramer wrote:
search for create index took 59 seconds ?

I've got a fairly (< 1 second for the same search) fast search engine on
the docs at

http://postgresintl.com/search?query=create index

if that link doesn't work, try

postgres.fastcr ypt.com/search?query=cr eate index

for now you will have to type it, I'm working on indexing it then making
it pretty

Dave

On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
Marc G. Fournier wrote:

>On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
>
>>Trying to use the 'search' in the docs section of PostgreSQL.org
>>is extremely SLOW. Considering this is a website for a database
>>and databases are supposed to be good for indexing content, I'd
>>expect a much faster performance.
>>
>>
>What is the full URL for the page you are looking at? Just the 'search
>link' at the top of the page?
>
>
>>Perhaps this should be worked on?
>>
>>
>Looking into it right now ...
>
>

http://www.postgresql.org/ *click Docs on top of page*
http://www.postgresql.org/docs/ * click PostgreSQL static
documentation *

Search this document set: [ SECURITY INVOKER ] Search!
http://www.postgresql.org/search.cgi...CURITY+INVOKER

I loaded that URL on IE and I wait like 2 minutes or more for a response.
then, it usually returns with 1 result. I click the Search! button again
to refresh and it came back a little faster with 0 results?

Searched again from the top and it's a little faster now:

* click search *
> date
Wed Dec 31 22:52:01 CST 2003

* results come back *
> date
Wed Dec 31 22:52:27 CST 2003

Still one result:

PostgreSQL 7.4 Documentation (SQL Key Words)
<http://www.postgresql. org/docs/7.4/static/sql-keywords-appendix.html>
[*0.087%*]
http://www.postgresql.org/docs/7.4/s...-appendix.html
Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST

However, the page that I SHOULD have found was this one:

http://www.postgresql.org/docs/curre...efunction.html

That page has SECURITY INVOKER in a whole section:

[EXTERNAL] SECURITY INVOKER
[EXTERNAL] SECURITY DEFINER

SECURITY INVOKER indicates that the function is to be executed with
the privileges of the user that calls it. That is the default.
SECURITY DEFINER specifies that the function is to be executed with
the privileges of the user that created it.

Dante

----------
D. Dante Lorenso
da***@lorenso.c om

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postg resql.org so that your
message can get through to the mailing list cleanly

--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postg resql.org so that your
message can get through to the mailing list cleanly

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postg resql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #11
On Wed, 31 Dec 2003, Dave Cramer wrote:
I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?


jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postg resql.org

Nov 12 '05 #12
Hello,

Why are we not using Tsearch2?

Besides the obvious of getting everything into the database?

Sincerely,

Joshua D. Drake
On Tue, 2003-12-30 at 21:24, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:
Why are their multiple servers hitting the same db

what servers are searching through the db?


www.postgresql.org and archives.postgr esql.org both hit the same DB ...
the point is more that whatever alternative that someone can suggest, it
has to be able to be accessed centrally from several different machines
... when I just tried a search, I was the only one hitting the database,
and the search was dreadful, so it isn't a problem with multiple
connections :(

Just as an FYI, the database server has sufficient RAM on her, so it isn't
a swapping issue ... swap usuage right now, after 77 days uptime:

Device 1K-blocks Used Avail Capacity Type
/dev/da0s1b 8388480 17556 8370924 0% Interleaved

>
Dave
On Wed, 2003-12-31 at 00:04, Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...

On Wed, 30 Dec 2003, Dave Cramer wrote:

> search for create index took 59 seconds ?
>
> I've got a fairly (< 1 second for the same search) fast search engine on
> the docs at
>
> http://postgresintl.com/search?query=create index
>
> if that link doesn't work, try
>
> postgres.fastcr ypt.com/search?query=cr eate index
>
> for now you will have to type it, I'm working on indexing it then making
> it pretty
>
> Dave
>
> On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
> > Marc G. Fournier wrote:
> >
> > >On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
> > >
> > >>Trying to use the 'search' in the docs section of PostgreSQL.org
> > >>is extremely SLOW. Considering this is a website for a database
> > >>and databases are supposed to be good for indexing content, I'd
> > >>expect a much faster performance.
> > >>
> > >>
> > >What is the full URL for the page you are looking at? Just the 'search
> > >link' at the top of the page?
> > >
> > >
> > >>Perhaps this should be worked on?
> > >>
> > >>
> > >Looking into it right now ...
> > >
> > >
> >
> > http://www.postgresql.org/ *click Docs on top of page*
> > http://www.postgresql.org/docs/ * click PostgreSQL static
> > documentation *
> >
> > Search this document set: [ SECURITY INVOKER ] Search!
> >
> >
> > http://www.postgresql.org/search.cgi...CURITY+INVOKER
> >
> > I loaded that URL on IE and I wait like 2 minutes or more for a response.
> > then, it usually returns with 1 result. I click the Search! button again
> > to refresh and it came back a little faster with 0 results?
> >
> > Searched again from the top and it's a little faster now:
> >
> > * click search *
> > > date
> > Wed Dec 31 22:52:01 CST 2003
> >
> > * results come back *
> > > date
> > Wed Dec 31 22:52:27 CST 2003
> >
> > Still one result:
> >
> > PostgreSQL 7.4 Documentation (SQL Key Words)
> > <http://www.postgresql. org/docs/7.4/static/sql-keywords-appendix.html>
> > [*0.087%*]
> > http://www.postgresql.org/docs/7.4/s...-appendix.html
> > Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST
> >
> > However, the page that I SHOULD have found was this one:
> >
> > http://www.postgresql.org/docs/curre...efunction.html
> >
> > That page has SECURITY INVOKER in a whole section:
> >
> > [EXTERNAL] SECURITY INVOKER
> > [EXTERNAL] SECURITY DEFINER
> >
> > SECURITY INVOKER indicates that the function is to be executed with
> > the privileges of the user that calls it. That is the default.
> > SECURITY DEFINER specifies that the function is to be executed with
> > the privileges of the user that created it.
> >
> > Dante
> >
> > ----------
> > D. Dante Lorenso
> > da***@lorenso.c om
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to ma*******@postg resql.org so that your
> > message can get through to the mailing list cleanly
> >
> --
> Dave Cramer
> 519 939 0336
> ICQ # 1467551
>
>

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandpromp t.com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postg resql.org

Nov 12 '05 #13
Marc,

At our website we had a "in database" search as well... It was terribly
slow (it was a custom built vector space model implemented in mysql+php
so that explains a bit).

We replaced it by the Xapian library (www.xapian.org) with its Omega
frontend as a middle end. I.e. we call with our php-scripts the omega
search frontend and postprocess the results with the scripts (some
rights double checks and so on), from the results we build a very simpel
SELECT ... FROM documents ... WHERE docid IN implode($docids _array)
(you understand enough php to understand this, I suppose)

With our 10GB of tekst, we have a 14GB (uncompressed, 9G compressed
orso) xapian database (the largest part is for the 6.7G positional
table), I'm pretty sure that if we'd store that information in something
like tsearch it'd be more than that 14GB...

Searches take less than a second (unless you do phrase searches of
course, that takes a few seconds and sometimes a few minutes).

I did a query on 'ext3 undelete' just a few minutes ago and it did the
search in 827150 documents in only 0.027 (a second run 0.006) seconds
(ext3 was found in 753 and undelete in 360 documents). Of course that is
excluding the results parsing, the total time to create the webpage was
"much" longer (0.43 seconds orso) due to the fact that the results
needs to be transferred via xinetd and the results needs to be extracted
from mysql (which is terrible with the "search supporting queries" we
issue :/ ) Our search machine is very similar the machine you use as
database, but it doesn't do much heavy work apart from running the
xapian/omega search combination.

If you are interested in this, I can provide (much) more information
about our implementation. Since you don't need right-checks, you could
even get away with just the omega front end all by itself (it has a nice
scripting language, but can't interface with anything but xapian).

The main advantage of taking this out of your sql database is that it
runs on its own custom built storage system (and you could offload it to
another machine, like we did).
Btw, if you really need an "in database" solution, read back the
postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's
working on integrating xapian in postgresql as a FTI)

Best regards,

Arjen van der Meijden
Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #14
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:
I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?


jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #15
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:
> I can modify mine to be client server if you want?
>
> It is a java app, so we need to be able to run jdk1.3 at least?


jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services
(http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddres sHere" to ma*******@postg resql.org)

Nov 12 '05 #16
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:

Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:

On Wed, 31 Dec 2003, Dave Cramer wrote:

I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?
jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives =# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services
(http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddres sHere" to ma*******@postg resql.org)

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #17
Wow, you're right - I could have probably saved myself a load of time! :)

Although you do learn a lot reinventing the wheel... ...or at least you
hit the same issues and insights others did before...

John

Ericson Smith said:
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term
highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:

Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:

>I can modify mine to be client server if you want?
>
>It is a java app, so we need to be able to run jdk1.3 at least?
>
>
jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archive s=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services
(http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddres sHere" to ma*******@postg resql.org)

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #18
The search engine I am using is lucene
http://jakarta.apache.org/lucene/docs/index.html

it too uses it's own internal database format, optimized for searching,
it is quite flexible, and allow searching on arbitrary fields as well.
The section on querying explains more

http://jakarta.apache.org/lucene/doc...sersyntax.html

It is even possible to index text data inside a database.

Dave
On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:
Wow, you're right - I could have probably saved myself a load of time! :)

Although you do learn a lot reinventing the wheel... ...or at least you
hit the same issues and insights others did before...

John

Ericson Smith said:
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term
highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
>On Wed, 31 Dec 2003, Dave Cramer wrote:
>
>
>
>>I can modify mine to be client server if you want?
>>
>>It is a java app, so we need to be able to run jdk1.3 at least?
>>
>>
>jdk1.4 is available on the VMs ... does your spider? for instance, you
>mention that you have the docs indexed right now, but we are currently
>indexing:
>
>Server http://archives.postgresql.org/
>Server http://advocacy.postgresql.org/
>Server http://developer.postgresql.org/
>Server http://gborg.postgresql.org/
>Server http://pgadmin.postgresql.org/
>Server http://techdocs.postgresql.org/
>Server http://www.postgresql.org/
>
>will it be able to handle:
>
>186_archive s=# select count(*) from url;
> count
>--------
> 393551
>(1 row)
>
>as fast as you are finding with just the docs?
>
>----
>Marc G. Fournier Hub.Org Networking Services
>(http://www.hub.org)
>Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
>7615664
>
>
>
--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddres sHere" to ma*******@postg resql.org)

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #19
Well it appears there are quite a few solutions to use so the next
question should be what are we trying to accomplish here?

One thing that I think is that the documentation search should be
limited to the documentation.

Who is in a position to make the decision of which solution to use?

Dave
On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:
Wow, you're right - I could have probably saved myself a load of time! :)

Although you do learn a lot reinventing the wheel... ...or at least you
hit the same issues and insights others did before...

John

Ericson Smith said:
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term
highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
>On Wed, 31 Dec 2003, Dave Cramer wrote:
>
>
>
>>I can modify mine to be client server if you want?
>>
>>It is a java app, so we need to be able to run jdk1.3 at least?
>>
>>
>jdk1.4 is available on the VMs ... does your spider? for instance, you
>mention that you have the docs indexed right now, but we are currently
>indexing:
>
>Server http://archives.postgresql.org/
>Server http://advocacy.postgresql.org/
>Server http://developer.postgresql.org/
>Server http://gborg.postgresql.org/
>Server http://pgadmin.postgresql.org/
>Server http://techdocs.postgresql.org/
>Server http://www.postgresql.org/
>
>will it be able to handle:
>
>186_archive s=# select count(*) from url;
> count
>--------
> 393551
>(1 row)
>
>as fast as you are finding with just the docs?
>
>----
>Marc G. Fournier Hub.Org Networking Services
>(http://www.hub.org)
>Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
>7615664
>
>
>
--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddres sHere" to ma*******@postg resql.org)

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postg resql.org

Nov 12 '05 #20

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
2239
by: bettina | last post by:
I'm re-programming my Website (www.coaster.ch) in PHP and I find it too slow (although I have ADSL). That's more or less how it functions: Here my tables: 'COASTERS' (code of coaster, code of country, etc...) 'COUNTRIES' (code of country, names of countries in different languages, code of continent) 'CONTINENTS' (code of continent, names of continents in different languages)
12
6490
by: Vjay77 | last post by:
Hi, I haven't posted any problem in quite a while now, but I came to the point that I really need to ask for help. I need to create an application which will search through .txt log file and find all lines where email from hotmail occured. All these emails need to be printed to list box on the form. Problem with code you'll see below, is that it takes long time to
4
4778
by: sommes | last post by:
It's only happen on .asp website, what's the problem? Thank you
2
2080
by: tmb | last post by:
When publishing a website the process is excrutiatingly slow - we are talking 3-4 minutes from when the actual transfer to the site has begun to completion. Apparently i'm not the only one experiencing this and searching on the net i found a possible solution: http://blog.n-technologies.be/CommentView.aspx?guid=3df1930b-9517-4b9b-9dd6-b59cbcbbe34d However, i don't quite understand how to actually apply the solution mentioned. I have...
2
5042
by: yasmike | last post by:
I am having a problem with my secure website on our internal network. The secure website is hosted on our Windows 2000 Server running IIS 5.0. If you try and access the website from a browser from another computer on the same internal network using its domain name, https://www.domainname .com, it is extremely slow. If you access it using its IP https://192.168.1.2 it is very quick. It is also quick for anyone outside the internal network to...
0
9901
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9751
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11027
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10682
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9513
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
7082
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5743
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5942
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4563
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.