473,324 Members | 2,124 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

website doc search is extremely SLOW

Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.

I submitted my search over two minutes ago. I just finished this
email to the list. The results have still not come back. I only
searched for:

SECURITY INVOKER

Perhaps this should be worked on?

Dante

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #1
83 5825
On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.

I submitted my search over two minutes ago. I just finished this
email to the list. The results have still not come back. I only
searched for:

SECURITY INVOKER

Perhaps this should be worked on?
Your query takes 0.01 sec to complete (134 documents found) on my development
server I hope to present to the community soon after New Year. We've
crawled 27 postgresql related sites. Screenshot is available
http://www.sai.msu.su/~megera/postgres/pgsql.ru.gif


Dante

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: ol**@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #2
On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.
What is the full URL for the page you are looking at? Just the 'search
link' at the top of the page?
Perhaps this should be worked on?


Looking into it right now ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #3
On Tue, 30 Dec 2003, Marc G. Fournier wrote:
On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.


What is the full URL for the page you are looking at? Just the 'search
link' at the top of the page?
Perhaps this should be worked on?


Looking into it right now ...


just ran it from archives.postgresql.org (security invoker) and it comes
back in 10 seconds ... I think it might be a problem with doing a search
while indexing is happening ... am looking at that ...
----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #4
When you got to docs and then click static, it has the ability to
search. It is slowwwwwwwww....
Sincerely,

Joshua D. Drake

On Tue, 2003-12-30 at 19:05, Marc G. Fournier wrote:
On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.


What is the full URL for the page you are looking at? Just the 'search
link' at the top of the page?
Perhaps this should be worked on?


Looking into it right now ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #5
Marc G. Fournier wrote:
On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.

What is the full URL for the page you are looking at? Just the 'search
link' at the top of the page?

Perhaps this should be worked on?

Looking into it right now ...


http://www.postgresql.org/ *click Docs on top of page*
http://www.postgresql.org/docs/ * click PostgreSQL static
documentation *

Search this document set: [ SECURITY INVOKER ] Search!
http://www.postgresql.org/search.cgi...CURITY+INVOKER

I loaded that URL on IE and I wait like 2 minutes or more for a response.
then, it usually returns with 1 result. I click the Search! button again
to refresh and it came back a little faster with 0 results?

Searched again from the top and it's a little faster now:

* click search * date Wed Dec 31 22:52:01 CST 2003

* results come back * date

Wed Dec 31 22:52:27 CST 2003

Still one result:

PostgreSQL 7.4 Documentation (SQL Key Words)
<http://www.postgresql.org/docs/7.4/static/sql-keywords-appendix.html>
[*0.087%*]
http://www.postgresql.org/docs/7.4/s...-appendix.html
Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST

However, the page that I SHOULD have found was this one:

http://www.postgresql.org/docs/curre...efunction.html

That page has SECURITY INVOKER in a whole section:

[EXTERNAL] SECURITY INVOKER
[EXTERNAL] SECURITY DEFINER

SECURITY INVOKER indicates that the function is to be executed with
the privileges of the user that calls it. That is the default.
SECURITY DEFINER specifies that the function is to be executed with
the privileges of the user that created it.

Dante

----------
D. Dante Lorenso
da***@lorenso.com

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #6
search for create index took 59 seconds ?

I've got a fairly (< 1 second for the same search) fast search engine on
the docs at

http://postgresintl.com/search?query=create index

if that link doesn't work, try

postgres.fastcrypt.com/search?query=create index

for now you will have to type it, I'm working on indexing it then making
it pretty

Dave

On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
Marc G. Fournier wrote:
On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
Trying to use the 'search' in the docs section of PostgreSQL.org
is extremely SLOW. Considering this is a website for a database
and databases are supposed to be good for indexing content, I'd
expect a much faster performance.

What is the full URL for the page you are looking at? Just the 'search
link' at the top of the page?

Perhaps this should be worked on?

Looking into it right now ...


http://www.postgresql.org/ *click Docs on top of page*
http://www.postgresql.org/docs/ * click PostgreSQL static
documentation *

Search this document set: [ SECURITY INVOKER ] Search!
http://www.postgresql.org/search.cgi...CURITY+INVOKER

I loaded that URL on IE and I wait like 2 minutes or more for a response.
then, it usually returns with 1 result. I click the Search! button again
to refresh and it came back a little faster with 0 results?

Searched again from the top and it's a little faster now:

* click search *
> date

Wed Dec 31 22:52:01 CST 2003

* results come back *
> date

Wed Dec 31 22:52:27 CST 2003

Still one result:

PostgreSQL 7.4 Documentation (SQL Key Words)
<http://www.postgresql.org/docs/7.4/static/sql-keywords-appendix.html>
[*0.087%*]
http://www.postgresql.org/docs/7.4/s...-appendix.html
Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST

However, the page that I SHOULD have found was this one:

http://www.postgresql.org/docs/curre...efunction.html

That page has SECURITY INVOKER in a whole section:

[EXTERNAL] SECURITY INVOKER
[EXTERNAL] SECURITY DEFINER

SECURITY INVOKER indicates that the function is to be executed with
the privileges of the user that calls it. That is the default.
SECURITY DEFINER specifies that the function is to be executed with
the privileges of the user that created it.

Dante

----------
D. Dante Lorenso
da***@lorenso.com

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #7

does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...

On Wed, 30 Dec 2003, Dave Cramer wrote:
search for create index took 59 seconds ?

I've got a fairly (< 1 second for the same search) fast search engine on
the docs at

http://postgresintl.com/search?query=create index

if that link doesn't work, try

postgres.fastcrypt.com/search?query=create index

for now you will have to type it, I'm working on indexing it then making
it pretty

Dave

On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
Marc G. Fournier wrote:
On Mon, 29 Dec 2003, D. Dante Lorenso wrote:

>Trying to use the 'search' in the docs section of PostgreSQL.org
>is extremely SLOW. Considering this is a website for a database
>and databases are supposed to be good for indexing content, I'd
>expect a much faster performance.
>
>
What is the full URL for the page you are looking at? Just the 'search
link' at the top of the page?
>Perhaps this should be worked on?
>
>
Looking into it right now ...


http://www.postgresql.org/ *click Docs on top of page*
http://www.postgresql.org/docs/ * click PostgreSQL static
documentation *

Search this document set: [ SECURITY INVOKER ] Search!
http://www.postgresql.org/search.cgi...CURITY+INVOKER

I loaded that URL on IE and I wait like 2 minutes or more for a response.
then, it usually returns with 1 result. I click the Search! button again
to refresh and it came back a little faster with 0 results?

Searched again from the top and it's a little faster now:

* click search *
> date

Wed Dec 31 22:52:01 CST 2003

* results come back *
> date

Wed Dec 31 22:52:27 CST 2003

Still one result:

PostgreSQL 7.4 Documentation (SQL Key Words)
<http://www.postgresql.org/docs/7.4/static/sql-keywords-appendix.html>
[*0.087%*]
http://www.postgresql.org/docs/7.4/s...-appendix.html
Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST

However, the page that I SHOULD have found was this one:

http://www.postgresql.org/docs/curre...efunction.html

That page has SECURITY INVOKER in a whole section:

[EXTERNAL] SECURITY INVOKER
[EXTERNAL] SECURITY DEFINER

SECURITY INVOKER indicates that the function is to be executed with
the privileges of the user that calls it. That is the default.
SECURITY DEFINER specifies that the function is to be executed with
the privileges of the user that created it.

Dante

----------
D. Dante Lorenso
da***@lorenso.com

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #8
Why are their multiple servers hitting the same db

what servers are searching through the db?

Dave
On Wed, 2003-12-31 at 00:04, Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...

On Wed, 30 Dec 2003, Dave Cramer wrote:
search for create index took 59 seconds ?

I've got a fairly (< 1 second for the same search) fast search engine on
the docs at

http://postgresintl.com/search?query=create index

if that link doesn't work, try

postgres.fastcrypt.com/search?query=create index

for now you will have to type it, I'm working on indexing it then making
it pretty

Dave

On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
Marc G. Fournier wrote:

>On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
>
>>Trying to use the 'search' in the docs section of PostgreSQL.org
>>is extremely SLOW. Considering this is a website for a database
>>and databases are supposed to be good for indexing content, I'd
>>expect a much faster performance.
>>
>>
>What is the full URL for the page you are looking at? Just the 'search
>link' at the top of the page?
>
>
>>Perhaps this should be worked on?
>>
>>
>Looking into it right now ...
>
>

http://www.postgresql.org/ *click Docs on top of page*
http://www.postgresql.org/docs/ * click PostgreSQL static
documentation *

Search this document set: [ SECURITY INVOKER ] Search!
http://www.postgresql.org/search.cgi...CURITY+INVOKER

I loaded that URL on IE and I wait like 2 minutes or more for a response.
then, it usually returns with 1 result. I click the Search! button again
to refresh and it came back a little faster with 0 results?

Searched again from the top and it's a little faster now:

* click search *
> date
Wed Dec 31 22:52:01 CST 2003

* results come back *
> date
Wed Dec 31 22:52:27 CST 2003

Still one result:

PostgreSQL 7.4 Documentation (SQL Key Words)
<http://www.postgresql.org/docs/7.4/static/sql-keywords-appendix.html>
[*0.087%*]
http://www.postgresql.org/docs/7.4/s...-appendix.html
Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST

However, the page that I SHOULD have found was this one:

http://www.postgresql.org/docs/curre...efunction.html

That page has SECURITY INVOKER in a whole section:

[EXTERNAL] SECURITY INVOKER
[EXTERNAL] SECURITY DEFINER

SECURITY INVOKER indicates that the function is to be executed with
the privileges of the user that calls it. That is the default.
SECURITY DEFINER specifies that the function is to be executed with
the privileges of the user that created it.

Dante

----------
D. Dante Lorenso
da***@lorenso.com

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #9
On Wed, 31 Dec 2003, Dave Cramer wrote:
Why are their multiple servers hitting the same db

what servers are searching through the db?
www.postgresql.org and archives.postgresql.org both hit the same DB ...
the point is more that whatever alternative that someone can suggest, it
has to be able to be accessed centrally from several different machines
.... when I just tried a search, I was the only one hitting the database,
and the search was dreadful, so it isn't a problem with multiple
connections :(

Just as an FYI, the database server has sufficient RAM on her, so it isn't
a swapping issue ... swap usuage right now, after 77 days uptime:

Device 1K-blocks Used Avail Capacity Type
/dev/da0s1b 8388480 17556 8370924 0% Interleaved


Dave
On Wed, 2003-12-31 at 00:04, Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...

On Wed, 30 Dec 2003, Dave Cramer wrote:
search for create index took 59 seconds ?

I've got a fairly (< 1 second for the same search) fast search engine on
the docs at

http://postgresintl.com/search?query=create index

if that link doesn't work, try

postgres.fastcrypt.com/search?query=create index

for now you will have to type it, I'm working on indexing it then making
it pretty

Dave

On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
> Marc G. Fournier wrote:
>
> >On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
> >
> >>Trying to use the 'search' in the docs section of PostgreSQL.org
> >>is extremely SLOW. Considering this is a website for a database
> >>and databases are supposed to be good for indexing content, I'd
> >>expect a much faster performance.
> >>
> >>
> >What is the full URL for the page you are looking at? Just the 'search
> >link' at the top of the page?
> >
> >
> >>Perhaps this should be worked on?
> >>
> >>
> >Looking into it right now ...
> >
> >
>
> http://www.postgresql.org/ *click Docs on top of page*
> http://www.postgresql.org/docs/ * click PostgreSQL static
> documentation *
>
> Search this document set: [ SECURITY INVOKER ] Search!
>
>
> http://www.postgresql.org/search.cgi...CURITY+INVOKER
>
> I loaded that URL on IE and I wait like 2 minutes or more for a response.
> then, it usually returns with 1 result. I click the Search! button again
> to refresh and it came back a little faster with 0 results?
>
> Searched again from the top and it's a little faster now:
>
> * click search *
> > date
> Wed Dec 31 22:52:01 CST 2003
>
> * results come back *
> > date
> Wed Dec 31 22:52:27 CST 2003
>
> Still one result:
>
> PostgreSQL 7.4 Documentation (SQL Key Words)
> <http://www.postgresql.org/docs/7.4/static/sql-keywords-appendix.html>
> [*0.087%*]
> http://www.postgresql.org/docs/7.4/s...-appendix.html
> Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST
>
> However, the page that I SHOULD have found was this one:
>
> http://www.postgresql.org/docs/curre...efunction.html
>
> That page has SECURITY INVOKER in a whole section:
>
> [EXTERNAL] SECURITY INVOKER
> [EXTERNAL] SECURITY DEFINER
>
> SECURITY INVOKER indicates that the function is to be executed with
> the privileges of the user that calls it. That is the default.
> SECURITY DEFINER specifies that the function is to be executed with
> the privileges of the user that created it.
>
> Dante
>
> ----------
> D. Dante Lorenso
> da***@lorenso.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to ma*******@postgresql.org so that your
> message can get through to the mailing list cleanly
>
--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #10
I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?

Dave
On Wed, 2003-12-31 at 00:04, Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...

On Wed, 30 Dec 2003, Dave Cramer wrote:
search for create index took 59 seconds ?

I've got a fairly (< 1 second for the same search) fast search engine on
the docs at

http://postgresintl.com/search?query=create index

if that link doesn't work, try

postgres.fastcrypt.com/search?query=create index

for now you will have to type it, I'm working on indexing it then making
it pretty

Dave

On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
Marc G. Fournier wrote:

>On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
>
>>Trying to use the 'search' in the docs section of PostgreSQL.org
>>is extremely SLOW. Considering this is a website for a database
>>and databases are supposed to be good for indexing content, I'd
>>expect a much faster performance.
>>
>>
>What is the full URL for the page you are looking at? Just the 'search
>link' at the top of the page?
>
>
>>Perhaps this should be worked on?
>>
>>
>Looking into it right now ...
>
>

http://www.postgresql.org/ *click Docs on top of page*
http://www.postgresql.org/docs/ * click PostgreSQL static
documentation *

Search this document set: [ SECURITY INVOKER ] Search!
http://www.postgresql.org/search.cgi...CURITY+INVOKER

I loaded that URL on IE and I wait like 2 minutes or more for a response.
then, it usually returns with 1 result. I click the Search! button again
to refresh and it came back a little faster with 0 results?

Searched again from the top and it's a little faster now:

* click search *
> date
Wed Dec 31 22:52:01 CST 2003

* results come back *
> date
Wed Dec 31 22:52:27 CST 2003

Still one result:

PostgreSQL 7.4 Documentation (SQL Key Words)
<http://www.postgresql.org/docs/7.4/static/sql-keywords-appendix.html>
[*0.087%*]
http://www.postgresql.org/docs/7.4/s...-appendix.html
Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST

However, the page that I SHOULD have found was this one:

http://www.postgresql.org/docs/curre...efunction.html

That page has SECURITY INVOKER in a whole section:

[EXTERNAL] SECURITY INVOKER
[EXTERNAL] SECURITY DEFINER

SECURITY INVOKER indicates that the function is to be executed with
the privileges of the user that calls it. That is the default.
SECURITY DEFINER specifies that the function is to be executed with
the privileges of the user that created it.

Dante

----------
D. Dante Lorenso
da***@lorenso.com

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to ma*******@postgresql.org so that your
message can get through to the mailing list cleanly

Nov 12 '05 #11
On Wed, 31 Dec 2003, Dave Cramer wrote:
I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?


jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #12
Hello,

Why are we not using Tsearch2?

Besides the obvious of getting everything into the database?

Sincerely,

Joshua D. Drake
On Tue, 2003-12-30 at 21:24, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:
Why are their multiple servers hitting the same db

what servers are searching through the db?


www.postgresql.org and archives.postgresql.org both hit the same DB ...
the point is more that whatever alternative that someone can suggest, it
has to be able to be accessed centrally from several different machines
... when I just tried a search, I was the only one hitting the database,
and the search was dreadful, so it isn't a problem with multiple
connections :(

Just as an FYI, the database server has sufficient RAM on her, so it isn't
a swapping issue ... swap usuage right now, after 77 days uptime:

Device 1K-blocks Used Avail Capacity Type
/dev/da0s1b 8388480 17556 8370924 0% Interleaved

>
Dave
On Wed, 2003-12-31 at 00:04, Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...

On Wed, 30 Dec 2003, Dave Cramer wrote:

> search for create index took 59 seconds ?
>
> I've got a fairly (< 1 second for the same search) fast search engine on
> the docs at
>
> http://postgresintl.com/search?query=create index
>
> if that link doesn't work, try
>
> postgres.fastcrypt.com/search?query=create index
>
> for now you will have to type it, I'm working on indexing it then making
> it pretty
>
> Dave
>
> On Tue, 2003-12-30 at 22:39, D. Dante Lorenso wrote:
> > Marc G. Fournier wrote:
> >
> > >On Mon, 29 Dec 2003, D. Dante Lorenso wrote:
> > >
> > >>Trying to use the 'search' in the docs section of PostgreSQL.org
> > >>is extremely SLOW. Considering this is a website for a database
> > >>and databases are supposed to be good for indexing content, I'd
> > >>expect a much faster performance.
> > >>
> > >>
> > >What is the full URL for the page you are looking at? Just the 'search
> > >link' at the top of the page?
> > >
> > >
> > >>Perhaps this should be worked on?
> > >>
> > >>
> > >Looking into it right now ...
> > >
> > >
> >
> > http://www.postgresql.org/ *click Docs on top of page*
> > http://www.postgresql.org/docs/ * click PostgreSQL static
> > documentation *
> >
> > Search this document set: [ SECURITY INVOKER ] Search!
> >
> >
> > http://www.postgresql.org/search.cgi...CURITY+INVOKER
> >
> > I loaded that URL on IE and I wait like 2 minutes or more for a response.
> > then, it usually returns with 1 result. I click the Search! button again
> > to refresh and it came back a little faster with 0 results?
> >
> > Searched again from the top and it's a little faster now:
> >
> > * click search *
> > > date
> > Wed Dec 31 22:52:01 CST 2003
> >
> > * results come back *
> > > date
> > Wed Dec 31 22:52:27 CST 2003
> >
> > Still one result:
> >
> > PostgreSQL 7.4 Documentation (SQL Key Words)
> > <http://www.postgresql.org/docs/7.4/static/sql-keywords-appendix.html>
> > [*0.087%*]
> > http://www.postgresql.org/docs/7.4/s...-appendix.html
> > Size: 65401 bytes, modified: Tue, 25 Nov 2003, 15:02:33 AST
> >
> > However, the page that I SHOULD have found was this one:
> >
> > http://www.postgresql.org/docs/curre...efunction.html
> >
> > That page has SECURITY INVOKER in a whole section:
> >
> > [EXTERNAL] SECURITY INVOKER
> > [EXTERNAL] SECURITY DEFINER
> >
> > SECURITY INVOKER indicates that the function is to be executed with
> > the privileges of the user that calls it. That is the default.
> > SECURITY DEFINER specifies that the function is to be executed with
> > the privileges of the user that created it.
> >
> > Dante
> >
> > ----------
> > D. Dante Lorenso
> > da***@lorenso.com
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to ma*******@postgresql.org so that your
> > message can get through to the mailing list cleanly
> >
> --
> Dave Cramer
> 519 939 0336
> ICQ # 1467551
>
>

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551


----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #13
Marc,

At our website we had a "in database" search as well... It was terribly
slow (it was a custom built vector space model implemented in mysql+php
so that explains a bit).

We replaced it by the Xapian library (www.xapian.org) with its Omega
frontend as a middle end. I.e. we call with our php-scripts the omega
search frontend and postprocess the results with the scripts (some
rights double checks and so on), from the results we build a very simpel
SELECT ... FROM documents ... WHERE docid IN implode($docids_array)
(you understand enough php to understand this, I suppose)

With our 10GB of tekst, we have a 14GB (uncompressed, 9G compressed
orso) xapian database (the largest part is for the 6.7G positional
table), I'm pretty sure that if we'd store that information in something
like tsearch it'd be more than that 14GB...

Searches take less than a second (unless you do phrase searches of
course, that takes a few seconds and sometimes a few minutes).

I did a query on 'ext3 undelete' just a few minutes ago and it did the
search in 827150 documents in only 0.027 (a second run 0.006) seconds
(ext3 was found in 753 and undelete in 360 documents). Of course that is
excluding the results parsing, the total time to create the webpage was
"much" longer (0.43 seconds orso) due to the fact that the results
needs to be transferred via xinetd and the results needs to be extracted
from mysql (which is terrible with the "search supporting queries" we
issue :/ ) Our search machine is very similar the machine you use as
database, but it doesn't do much heavy work apart from running the
xapian/omega search combination.

If you are interested in this, I can provide (much) more information
about our implementation. Since you don't need right-checks, you could
even get away with just the omega front end all by itself (it has a nice
scripting language, but can't interface with anything but xapian).

The main advantage of taking this out of your sql database is that it
runs on its own custom built storage system (and you could offload it to
another machine, like we did).
Btw, if you really need an "in database" solution, read back the
postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's
working on integrating xapian in postgresql as a FTI)

Best regards,

Arjen van der Meijden
Marc G. Fournier wrote:
does anyone know anything better then mnogosearch, that works with
PostgreSQL, for doing indexing? the database server is a Dual Xeon 2.4G,
4G of RAM, and a load avg right now of a lowly 1.5 ... the file system is
3x72G drive in a RAID5 configuration, and the database server is 7.4 ...
the mnogosearch folk use mysql for their development, so its possible
there is something they are doing that is slowing this process down, to
compensate for a fault in mysql, but this is ridiculous ...

note that I have it setup with what the mnogosearch folk lists as being
'the fastest schema for large indexes' or 'crc-multi' ...

right now, we're running only 373k docs:

isvr5# indexer -S

Database statistics

Status Expired Total
-----------------------------
415 0 311 Unsupported Media Type
302 0 1171 Moved Temporarily
502 0 43 Bad Gateway
414 0 3 Request-URI Too Long
301 0 307 Moved Permanently
404 0 1960 Not found
410 0 1 Gone
401 0 51 Unauthorized
304 0 16591 Not Modified
200 0 373015 OK
504 0 48 Gateway Timeout
400 0 3 Bad Request
0 2 47 Not indexed yet
-----------------------------
Total 2 393551

and a vacuum analyze runs nightly ...

anyone with suggestions/ideas? has to be something client/server, like
mnogosearch, as we're dealing with multiple servers searching against the
same database ... so I don't *think* that ht/Dig is a solution, but may be
wrong there ...


---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #14
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:
I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?


jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #15
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:
> I can modify mine to be client server if you want?
>
> It is a java app, so we need to be able to run jdk1.3 at least?


jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services
(http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #16
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:

Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:

On Wed, 31 Dec 2003, Dave Cramer wrote:

I can modify mine to be client server if you want?

It is a java app, so we need to be able to run jdk1.3 at least?
jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services
(http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #17
Wow, you're right - I could have probably saved myself a load of time! :)

Although you do learn a lot reinventing the wheel... ...or at least you
hit the same issues and insights others did before...

John

Ericson Smith said:
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term
highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:

Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Dave Cramer wrote:

>I can modify mine to be client server if you want?
>
>It is a java app, so we need to be able to run jdk1.3 at least?
>
>
jdk1.4 is available on the VMs ... does your spider? for instance, you
mention that you have the docs indexed right now, but we are currently
indexing:

Server http://archives.postgresql.org/
Server http://advocacy.postgresql.org/
Server http://developer.postgresql.org/
Server http://gborg.postgresql.org/
Server http://pgadmin.postgresql.org/
Server http://techdocs.postgresql.org/
Server http://www.postgresql.org/

will it be able to handle:

186_archives=# select count(*) from url;
count
--------
393551
(1 row)

as fast as you are finding with just the docs?

----
Marc G. Fournier Hub.Org Networking Services
(http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
7615664

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #18
The search engine I am using is lucene
http://jakarta.apache.org/lucene/docs/index.html

it too uses it's own internal database format, optimized for searching,
it is quite flexible, and allow searching on arbitrary fields as well.
The section on querying explains more

http://jakarta.apache.org/lucene/doc...sersyntax.html

It is even possible to index text data inside a database.

Dave
On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:
Wow, you're right - I could have probably saved myself a load of time! :)

Although you do learn a lot reinventing the wheel... ...or at least you
hit the same issues and insights others did before...

John

Ericson Smith said:
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term
highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
>On Wed, 31 Dec 2003, Dave Cramer wrote:
>
>
>
>>I can modify mine to be client server if you want?
>>
>>It is a java app, so we need to be able to run jdk1.3 at least?
>>
>>
>jdk1.4 is available on the VMs ... does your spider? for instance, you
>mention that you have the docs indexed right now, but we are currently
>indexing:
>
>Server http://archives.postgresql.org/
>Server http://advocacy.postgresql.org/
>Server http://developer.postgresql.org/
>Server http://gborg.postgresql.org/
>Server http://pgadmin.postgresql.org/
>Server http://techdocs.postgresql.org/
>Server http://www.postgresql.org/
>
>will it be able to handle:
>
>186_archives=# select count(*) from url;
> count
>--------
> 393551
>(1 row)
>
>as fast as you are finding with just the docs?
>
>----
>Marc G. Fournier Hub.Org Networking Services
>(http://www.hub.org)
>Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
>7615664
>
>
>
--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #19
Well it appears there are quite a few solutions to use so the next
question should be what are we trying to accomplish here?

One thing that I think is that the documentation search should be
limited to the documentation.

Who is in a position to make the decision of which solution to use?

Dave
On Wed, 2003-12-31 at 08:44, John Sidney-Woollett wrote:
Wow, you're right - I could have probably saved myself a load of time! :)

Although you do learn a lot reinventing the wheel... ...or at least you
hit the same issues and insights others did before...

John

Ericson Smith said:
You should probably take a look at the Swish project. For a certain
project, we tried Tsearch2/Tsearch, even (gasp) MySQL fulltext search,
but with over 600,000 documents to index, both took too long to conduct
searches, especially as the database was swapped in and out of memory
based on search segment. MySQL full text was the most unusable.

Swish uses its own internal DB format, and comes with a simple spider as
well. You can make it search by category, date and other nifty criteria
also.
http://swish-e.org

You can take a look over at the project and do some searches to see what
I mean:
http://cbd-net.com

Warmest regards,
Ericson Smith
Tracking Specialist/DBA
+-----------------------+----------------------------+
| http://www.did-it.com | "When I'm paid, I always |
| er**@did-it.com | follow the job through. |
| 516-255-0500 | You know that." -Angel Eyes|
+-----------------------+----------------------------+

John Sidney-Woollett wrote:
I think that Oleg's new search offering looks really good and fast. (I
can't wait till I have some task that needs tsearch!).

I agree with Dave that searching the docs is more important for me than
the sites - but it would be really nice to have both, in one tool.

I built something similar for the Tate Gallery in the UK - here you can
select the type of content that you want returned, either static pages or
dynamic. You can see the idea at
http://www.tate.org.uk/search/defaul...oil&action=new

This is custom built (using java/Oracle), supports stemming, boolean
operators, exact phrase matching, relevancy and matched term
highlighting.

You can switch on/off the types of documents that you are not interested
in. Using this analogy, a search facility that could offer you results
from i) the docs and/or ii) the postgres sites static pages would be very
useful.

John Sidney-Woollett

Dave Cramer said:
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?

It does do things like query all documents where CREATE AND TABLE are n
words apart, just as fast, I would think these are more valuable to
document searching?

I think the challenge here is what do we want to search. I am betting
that folks use this page as they would man? ie. what is the command for
create trigger?

As I said my offer stands to help out, but I think if the goal is to
search the entire website, then this particular tool is not useful.

At this point I am working on indexing the sgml directly as it has less
cruft in it. For instance all the links that appear in every summary are
just noise.
Dave

On Wed, 2003-12-31 at 00:44, Marc G. Fournier wrote:
>On Wed, 31 Dec 2003, Dave Cramer wrote:
>
>
>
>>I can modify mine to be client server if you want?
>>
>>It is a java app, so we need to be able to run jdk1.3 at least?
>>
>>
>jdk1.4 is available on the VMs ... does your spider? for instance, you
>mention that you have the docs indexed right now, but we are currently
>indexing:
>
>Server http://archives.postgresql.org/
>Server http://advocacy.postgresql.org/
>Server http://developer.postgresql.org/
>Server http://gborg.postgresql.org/
>Server http://pgadmin.postgresql.org/
>Server http://techdocs.postgresql.org/
>Server http://www.postgresql.org/
>
>will it be able to handle:
>
>186_archives=# select count(*) from url;
> count
>--------
> 393551
>(1 row)
>
>as fast as you are finding with just the docs?
>
>----
>Marc G. Fournier Hub.Org Networking Services
>(http://www.hub.org)
>Email: sc*****@hub.org Yahoo!: yscrappy ICQ:
>7615664
>
>
>
--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if
your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #20
On Dec 31, 2003, at 5:40 AM, Arjen van der Meijden wrote:
The main advantage of taking this out of your sql database is that it
runs on its own custom built storage system (and you could offload it
to another machine, like we did).
Btw, if you really need an "in database" solution, read back the
postings of Eric Ridge at 26-12-2003 20:54 on the hackers list (he's
working on integrating xapian in postgresql as a FTI)


Hi, that's me! I'm working it right now, and it's coming along really
well. I actually hope to have it integrated with Postgres' storage
subsystem by the end of the day and to have it returning useful
results.

eric
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #21
On Wed, 31 Dec 2003, Dave Cramer wrote:
Marc,

No it doesn't spider, it is a specialized tool for searching documents.

I'm curious, what value is there to being able to count the number of
url's ?


Sorry, that was just an example of the # of docs that have to be searched
through ... again, the *biggest* thing that is searched is the mailing
list archives, so without spidering, not sure how we'll be able to pull
that in ... ?
----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #22
On Tue, 30 Dec 2003, Joshua D. Drake wrote:
Hello,

Why are we not using Tsearch2?


Because nobody has built it yet? Oleg's stuff is nice, but we want
something that we can build into the existing web sites, not a standalone
site ...

I keep searching the web hoping someone has come up with a 'tsearch2'
based search engine that does the spidering, but, unless its sitting right
in front of my eyes and I'm not seeing it, I haven't found it yet :(

Out of everything I've found so far, mnogosearch is one of the best ... I
just wish I could figure out where the bottleneck for it was, since, from
reading their docs, their method of storing the data doesn't appear to be
particularly off. I'm tempted to try their caching storage manager, and
getting away from SQL totally, but I *really* want to showcase PostgreSQL
on this :(

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #23
Marc G. Fournier wrote:
On Tue, 30 Dec 2003, Joshua D. Drake wrote:
Hello,

Why are we not using Tsearch2?


Because nobody has built it yet? Oleg's stuff is nice, but we want
something that we can build into the existing web sites, not a standalone
site ...

I keep searching the web hoping someone has come up with a 'tsearch2'
based search engine that does the spidering, but, unless its sitting right
in front of my eyes and I'm not seeing it, I haven't found it yet :(

Out of everything I've found so far, mnogosearch is one of the best ... I
just wish I could figure out where the bottleneck for it was, since, from
reading their docs, their method of storing the data doesn't appear to be
particularly off. I'm tempted to try their caching storage manager, and
getting away from SQL totally, but I *really* want to showcase PostgreSQL
on this :(


Well, PostgreSQL is being un-showcased in the current setup, that's for
sure. :-(

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #24
On Wed, 31 Dec 2003, Bruce Momjian wrote:
Out of everything I've found so far, mnogosearch is one of the best ... I
just wish I could figure out where the bottleneck for it was, since, from
reading their docs, their method of storing the data doesn't appear to be
particularly off. I'm tempted to try their caching storage manager, and
getting away from SQL totally, but I *really* want to showcase PostgreSQL
on this :(


Well, PostgreSQL is being un-showcased in the current setup, that's for
sure. :-(


Agreed ... I could install the MySQL backend, whichits designed for, and
advertise it as PostgreSQL? :)

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #25
Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Bruce Momjian wrote:
Out of everything I've found so far, mnogosearch is one of the best ... I
just wish I could figure out where the bottleneck for it was, since, from
reading their docs, their method of storing the data doesn't appear to be
particularly off. I'm tempted to try their caching storage manager, and
getting away from SQL totally, but I *really* want to showcase PostgreSQL
on this :(


Well, PostgreSQL is being un-showcased in the current setup, that's for
sure. :-(


Agreed ... I could install the MySQL backend, whichits designed for, and
advertise it as PostgreSQL? :)


I would be curious to know if it is faster --- that would tell use if it
is tuned only for MySQL.

Have you tried CLUSTER? I think the MySQL ISAM files are
auto-clustered, and clustering is usually important for full-text
searches.
--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #26
On Wed, 2003-12-31 at 18:43, Bruce Momjian wrote:
Marc G. Fournier wrote:
On Tue, 30 Dec 2003, Joshua D. Drake wrote:
Hello,

Why are we not using Tsearch2?


Because nobody has built it yet? Oleg's stuff is nice, but we want
something that we can build into the existing web sites, not a standalone
site ...

I keep searching the web hoping someone has come up with a 'tsearch2'
based search engine that does the spidering, but, unless its sitting right
in front of my eyes and I'm not seeing it, I haven't found it yet :(

Out of everything I've found so far, mnogosearch is one of the best ... I
just wish I could figure out where the bottleneck for it was, since, from
reading their docs, their method of storing the data doesn't appear to be
particularly off. I'm tempted to try their caching storage manager, and
getting away from SQL totally, but I *really* want to showcase PostgreSQL
on this :(


Well, PostgreSQL is being un-showcased in the current setup, that's for
sure. :-(

In fact this is a very bad advertisement for postgres

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #27
On Wed, 31 Dec 2003, Bruce Momjian wrote:
Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Bruce Momjian wrote:
> Out of everything I've found so far, mnogosearch is one of the best ... I
> just wish I could figure out where the bottleneck for it was, since, from
> reading their docs, their method of storing the data doesn't appear to be
> particularly off. I'm tempted to try their caching storage manager, and
> getting away from SQL totally, but I *really* want to showcase PostgreSQL
> on this :(

Well, PostgreSQL is being un-showcased in the current setup, that's for
sure. :-(


Agreed ... I could install the MySQL backend, whichits designed for, and
advertise it as PostgreSQL? :)


I would be curious to know if it is faster --- that would tell use if it
is tuned only for MySQL.

Have you tried CLUSTER? I think the MySQL ISAM files are
auto-clustered, and clustering is usually important for full-text
searches.


Actually, check out http://www.mnogosearch.com ... the way they do the
indexing doesn't (at least, as far as I can tell) make use of full-text
searching. Simplistically, it appears to take the web page, sort -u all
the words it finds, removes all 'stopwords' (and, the, in, etc) from the
result, and then dumps the resultant words to the database, link'd to the
URL ...

We're using crc-multi, so a CRC value of the word is what is stored in the
database, not the actual word itself ... the '-multi' part spreads the
words across several tables depending on the word size, to keep total # of
rows down ...

The slow part on the database is finding those words, as can be seen by
the following search on 'SECURITY INVOKER':

Jan 1 01:21:05 pgsql74 postgres[59959]: [44-1] LOG: statement: SELECT ndict8.url_id,ndict8.intag FROM ndict8, url WHERE ndict8.word_id=417851441 AND url.rec_id=ndict8
..url_id
Jan 1 01:21:05 pgsql74 postgres[59959]: [44-2] AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%')
Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
Jan 1 01:22:00 pgsql74 postgres[59959]: [46-1] LOG: statement: SELECT ndict7.url_id,ndict7.intag FROM ndict7, url WHERE ndict7.word_id=-509484498 AND url.rec_id=ndict
7.url_id
Jan 1 01:22:00 pgsql74 postgres[59959]: [46-2] AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%')
Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms

ndict8 looks like:

186_archives=# select count(1) from ndict8;
count
---------
6320380
(1 row)
rchives=# select count(1) from ndict8 where word_id=417851441;
count
-------
15532
(1 row)

186_archives=# \d ndict8
Table "public.ndict8"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n8_url" btree (url_id)
"n8_word" btree (word_id)
and ndict7 looks like:

186_archives=# select count(1) from ndict7;
count
---------
8400616
(1 row)
186_archives=# select count(1) from ndict7 where word_id=-509484498;
count
-------
333
(1 row)

186_archives=# \d ndict7
Table "public.ndict7"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n7_url" btree (url_id)
"n7_word" btree (word_id)
The slowdown is the LIKE condition, as the ndict[78] word_id conditions
return near instantly when run individually, and when I run the 'url/LIKE'
condition, it takes "forever" ...

186_archives-# ;
count
--------
304811
(1 row)

186_archives=# explain analyze select count(1) from url where ((url.url || '') LIKE 'http://archives.postgresql.org/%%');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Aggregate (cost=93962.19..93962.19 rows=1 width=0) (actual time=5833.084..5833.088 rows=1 loops=1)
-> Seq Scan on url (cost=0.00..93957.26 rows=1968 width=0) (actual time=0.069..4387.378 rows=304811 loops=1)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 5833.179 ms
(4 rows)
Hrmmm ... I don't have much (any) experience with tsearch, but could it be
used to replace the LIKE? Then again, when its returning 300k rows out of
393k, it wouldn't help much on the above, would it?

The full first query:

SELECT ndict8.url_id,ndict8.intag
FROM ndict8, url
WHERE ndict8.word_id=417851441
AND url.rec_id=ndict8.url_id
AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%');

returns 13415 rows, and explain analyze shows:

-----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..30199.82 rows=17 width=8) (actual time=0.312..1459.504 rows=13415 loops=1)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.186..387.673 rows=15532 loops=1)
Index Cond: (word_id = 417851441)
-> Index Scan using url_rec_id on url (cost=0.00..5.45 rows=1 width=4) (actual time=0.029..0.050 rows=1 loops=15532)
Index Cond: (url.rec_id = "outer".url_id)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 1520.145 ms
(7 rows)

Which, of course, doesn't come close to matching what the duration showed
in the original, most likely due to catching :(

The server that the database is on rarely jumps abov a loadavg of 1, isn't
using any swap (after 77 days up, used swap is 0% -or- 17meg) and the
database itself is on a strip'd file system ...

I'm open to ideas/things to try here ...

The whole 'process' of the search shows the following times for the
queries:

pgsql74# grep 59959 /var/log/pgsql | grep duration
Jan 1 01:21:05 pgsql74 postgres[59959]: [39-1] LOG: duration: 25.663 ms
Jan 1 01:21:05 pgsql74 postgres[59959]: [41-1] LOG: duration: 4.376 ms
Jan 1 01:21:05 pgsql74 postgres[59959]: [43-1] LOG: duration: 11.179 ms
Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [49-1] LOG: duration: 7.886 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [51-1] LOG: duration: 1.516 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [53-1] LOG: duration: 3.539 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [55-1] LOG: duration: 109.890 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [57-1] LOG: duration: 15.582 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [59-1] LOG: duration: 1.631 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [61-1] LOG: duration: 0.838 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [63-1] LOG: duration: 2.148 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [65-1] LOG: duration: 0.810 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [67-1] LOG: duration: 1.211 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [69-1] LOG: duration: 0.798 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [71-1] LOG: duration: 0.861 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [73-1] LOG: duration: 0.748 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [75-1] LOG: duration: 0.555 ms

With the two >1000ms queries being the above two ndict[78] queries ...

Doing two subsequent searches, on "setuid functions" and "privilege
rules", just so that caching isn't involved, shows pretty much the same
distribution:

grep 61697 /var/log/pgsql | grep duration
Jan 1 01:44:25 pgsql74 postgres[61697]: [41-1] LOG: duration: 1.244 ms
Jan 1 01:44:25 pgsql74 postgres[61697]: [43-1] LOG: duration: 21.868 ms
Jan 1 01:44:25 pgsql74 postgres[61697]: [45-1] LOG: duration: 17.956 ms
Jan 1 01:44:29 pgsql74 postgres[61697]: [47-1] LOG: duration: 4452.326 ms
Jan 1 01:44:57 pgsql74 postgres[61697]: [49-1] LOG: duration: 27992.581 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [51-1] LOG: duration: 357.158 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [53-1] LOG: duration: 1.338 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [55-1] LOG: duration: 11.438 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [57-1] LOG: duration: 63.389 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [59-1] LOG: duration: 134.941 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [61-1] LOG: duration: 0.570 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [63-1] LOG: duration: 0.489 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [65-1] LOG: duration: 0.477 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [67-1] LOG: duration: 0.470 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [69-1] LOG: duration: 0.471 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [71-1] LOG: duration: 0.468 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [73-1] LOG: duration: 0.473 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [75-1] LOG: duration: 0.466 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [77-1] LOG: duration: 0.469 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [79-1] LOG: duration: 0.515 ms

and:

grep 61869 /var/log/pgsql | grep duration
Jan 1 01:46:50 pgsql74 postgres[61869]: [41-1] LOG: duration: 19.776 ms
Jan 1 01:46:50 pgsql74 postgres[61869]: [43-1] LOG: duration: 58.352 ms
Jan 1 01:46:50 pgsql74 postgres[61869]: [45-1] LOG: duration: 0.897 ms
Jan 1 01:46:53 pgsql74 postgres[61869]: [47-1] LOG: duration: 2859.331 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [49-1] LOG: duration: 54774.241 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [51-1] LOG: duration: 14.926 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [53-1] LOG: duration: 1.502 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [55-1] LOG: duration: 3.865 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [57-1] LOG: duration: 110.435 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [59-1] LOG: duration: 0.646 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [61-1] LOG: duration: 0.503 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [63-1] LOG: duration: 0.498 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [65-1] LOG: duration: 0.484 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [67-1] LOG: duration: 0.487 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [69-1] LOG: duration: 0.478 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [71-1] LOG: duration: 0.479 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [73-1] LOG: duration: 0.480 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [75-1] LOG: duration: 0.478 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [77-1] LOG: duration: 0.477 ms

So it looks like its those joins that are really killing things ...

Note that I haven't made many changes to the postgresql.conf file, so
there might be something really obvious I've overlooked, but here are the
uncommented ones (ie. ones I've modified from defaults):

tcpip_socket = true
max_connections = 512
shared_buffers = 10000 # min 16, at least max_connections*2, 8KB each
sort_mem = 10240 # min 64, size in KB
vacuum_mem = 81920 # min 1024, size in KB
syslog = 2 # range 0-2; 0=stdout; 1=both; 2=syslog
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
log_connections = true
log_duration = false
log_statement = false
lc_messages = 'C' # locale for system error message strings
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #28
On Wed, 31 Dec 2003, Dave Cramer wrote:
In fact this is a very bad advertisement for postgres


I just posted a very very long email of what I'm seeing in the logs, as
well as various query runs ... it may just be something that I need to
tune that I'm overlooking:( the queries aren't particularly complex :(

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #29
What is the locale of the database?

like won't use an index, unless it is 'C' locale, or you use 7.4 and
change the operator of the index.

Dave
On Wed, 2003-12-31 at 20:49, Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Bruce Momjian wrote:
Marc G. Fournier wrote:
On Wed, 31 Dec 2003, Bruce Momjian wrote:

> > Out of everything I've found so far, mnogosearch is one of the best ... I
> > just wish I could figure out where the bottleneck for it was, since, from
> > reading their docs, their method of storing the data doesn't appear to be
> > particularly off. I'm tempted to try their caching storage manager, and
> > getting away from SQL totally, but I *really* want to showcase PostgreSQL
> > on this :(
>
> Well, PostgreSQL is being un-showcased in the current setup, that's for
> sure. :-(

Agreed ... I could install the MySQL backend, whichits designed for, and
advertise it as PostgreSQL? :)


I would be curious to know if it is faster --- that would tell use if it
is tuned only for MySQL.

Have you tried CLUSTER? I think the MySQL ISAM files are
auto-clustered, and clustering is usually important for full-text
searches.


Actually, check out http://www.mnogosearch.com ... the way they do the
indexing doesn't (at least, as far as I can tell) make use of full-text
searching. Simplistically, it appears to take the web page, sort -u all
the words it finds, removes all 'stopwords' (and, the, in, etc) from the
result, and then dumps the resultant words to the database, link'd to the
URL ...

We're using crc-multi, so a CRC value of the word is what is stored in the
database, not the actual word itself ... the '-multi' part spreads the
words across several tables depending on the word size, to keep total # of
rows down ...

The slow part on the database is finding those words, as can be seen by
the following search on 'SECURITY INVOKER':

Jan 1 01:21:05 pgsql74 postgres[59959]: [44-1] LOG: statement: SELECT ndict8.url_id,ndict8.intag FROM ndict8, url WHERE ndict8.word_id=417851441 AND url.rec_id=ndict8
.url_id
Jan 1 01:21:05 pgsql74 postgres[59959]: [44-2] AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%')
Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
Jan 1 01:22:00 pgsql74 postgres[59959]: [46-1] LOG: statement: SELECT ndict7.url_id,ndict7.intag FROM ndict7, url WHERE ndict7.word_id=-509484498 AND url.rec_id=ndict
7.url_id
Jan 1 01:22:00 pgsql74 postgres[59959]: [46-2] AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%')
Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms

ndict8 looks like:

186_archives=# select count(1) from ndict8;
count
---------
6320380
(1 row)
rchives=# select count(1) from ndict8 where word_id=417851441;
count
-------
15532
(1 row)

186_archives=# \d ndict8
Table "public.ndict8"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n8_url" btree (url_id)
"n8_word" btree (word_id)
and ndict7 looks like:

186_archives=# select count(1) from ndict7;
count
---------
8400616
(1 row)
186_archives=# select count(1) from ndict7 where word_id=-509484498;
count
-------
333
(1 row)

186_archives=# \d ndict7
Table "public.ndict7"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n7_url" btree (url_id)
"n7_word" btree (word_id)
The slowdown is the LIKE condition, as the ndict[78] word_id conditions
return near instantly when run individually, and when I run the 'url/LIKE'
condition, it takes "forever" ...

186_archives-# ;
count
--------
304811
(1 row)

186_archives=# explain analyze select count(1) from url where ((url.url || '') LIKE 'http://archives.postgresql.org/%%');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Aggregate (cost=93962.19..93962.19 rows=1 width=0) (actual time=5833.084..5833.088 rows=1 loops=1)
-> Seq Scan on url (cost=0.00..93957.26 rows=1968 width=0) (actual time=0.069..4387.378 rows=304811 loops=1)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 5833.179 ms
(4 rows)
Hrmmm ... I don't have much (any) experience with tsearch, but could it be
used to replace the LIKE? Then again, when its returning 300k rows out of
393k, it wouldn't help much on the above, would it?

The full first query:

SELECT ndict8.url_id,ndict8.intag
FROM ndict8, url
WHERE ndict8.word_id=417851441
AND url.rec_id=ndict8.url_id
AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%');

returns 13415 rows, and explain analyze shows:

-----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..30199.82 rows=17 width=8) (actual time=0.312..1459.504 rows=13415 loops=1)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.186..387.673 rows=15532 loops=1)
Index Cond: (word_id = 417851441)
-> Index Scan using url_rec_id on url (cost=0.00..5.45 rows=1 width=4) (actual time=0.029..0.050 rows=1 loops=15532)
Index Cond: (url.rec_id = "outer".url_id)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 1520.145 ms
(7 rows)

Which, of course, doesn't come close to matching what the duration showed
in the original, most likely due to catching :(

The server that the database is on rarely jumps abov a loadavg of 1, isn't
using any swap (after 77 days up, used swap is 0% -or- 17meg) and the
database itself is on a strip'd file system ...

I'm open to ideas/things to try here ...

The whole 'process' of the search shows the following times for the
queries:

pgsql74# grep 59959 /var/log/pgsql | grep duration
Jan 1 01:21:05 pgsql74 postgres[59959]: [39-1] LOG: duration: 25.663 ms
Jan 1 01:21:05 pgsql74 postgres[59959]: [41-1] LOG: duration: 4.376 ms
Jan 1 01:21:05 pgsql74 postgres[59959]: [43-1] LOG: duration: 11.179 ms
Jan 1 01:22:00 pgsql74 postgres[59959]: [45-1] LOG: duration: 55015.644 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [47-1] LOG: duration: 1167.407 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [49-1] LOG: duration: 7.886 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [51-1] LOG: duration: 1.516 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [53-1] LOG: duration: 3.539 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [55-1] LOG: duration: 109.890 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [57-1] LOG: duration: 15.582 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [59-1] LOG: duration: 1.631 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [61-1] LOG: duration: 0.838 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [63-1] LOG: duration: 2.148 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [65-1] LOG: duration: 0.810 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [67-1] LOG: duration: 1.211 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [69-1] LOG: duration: 0.798 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [71-1] LOG: duration: 0.861 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [73-1] LOG: duration: 0.748 ms
Jan 1 01:22:01 pgsql74 postgres[59959]: [75-1] LOG: duration: 0.555 ms

With the two >1000ms queries being the above two ndict[78] queries ...

Doing two subsequent searches, on "setuid functions" and "privilege
rules", just so that caching isn't involved, shows pretty much the same
distribution:

grep 61697 /var/log/pgsql | grep duration
Jan 1 01:44:25 pgsql74 postgres[61697]: [41-1] LOG: duration: 1.244 ms
Jan 1 01:44:25 pgsql74 postgres[61697]: [43-1] LOG: duration: 21.868 ms
Jan 1 01:44:25 pgsql74 postgres[61697]: [45-1] LOG: duration: 17.956 ms
Jan 1 01:44:29 pgsql74 postgres[61697]: [47-1] LOG: duration: 4452.326 ms
Jan 1 01:44:57 pgsql74 postgres[61697]: [49-1] LOG: duration: 27992.581 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [51-1] LOG: duration: 357.158 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [53-1] LOG: duration: 1.338 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [55-1] LOG: duration: 11.438 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [57-1] LOG: duration: 63.389 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [59-1] LOG: duration: 134.941 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [61-1] LOG: duration: 0.570 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [63-1] LOG: duration: 0.489 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [65-1] LOG: duration: 0.477 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [67-1] LOG: duration: 0.470 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [69-1] LOG: duration: 0.471 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [71-1] LOG: duration: 0.468 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [73-1] LOG: duration: 0.473 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [75-1] LOG: duration: 0.466 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [77-1] LOG: duration: 0.469 ms
Jan 1 01:44:58 pgsql74 postgres[61697]: [79-1] LOG: duration: 0.515 ms

and:

grep 61869 /var/log/pgsql | grep duration
Jan 1 01:46:50 pgsql74 postgres[61869]: [41-1] LOG: duration: 19.776 ms
Jan 1 01:46:50 pgsql74 postgres[61869]: [43-1] LOG: duration: 58.352 ms
Jan 1 01:46:50 pgsql74 postgres[61869]: [45-1] LOG: duration: 0.897 ms
Jan 1 01:46:53 pgsql74 postgres[61869]: [47-1] LOG: duration: 2859.331 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [49-1] LOG: duration: 54774.241 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [51-1] LOG: duration: 14.926 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [53-1] LOG: duration: 1.502 ms
Jan 1 01:47:47 pgsql74 postgres[61869]: [55-1] LOG: duration: 3.865 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [57-1] LOG: duration: 110.435 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [59-1] LOG: duration: 0.646 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [61-1] LOG: duration: 0.503 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [63-1] LOG: duration: 0.498 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [65-1] LOG: duration: 0.484 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [67-1] LOG: duration: 0.487 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [69-1] LOG: duration: 0.478 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [71-1] LOG: duration: 0.479 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [73-1] LOG: duration: 0.480 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [75-1] LOG: duration: 0.478 ms
Jan 1 01:47:48 pgsql74 postgres[61869]: [77-1] LOG: duration: 0.477 ms

So it looks like its those joins that are really killing things ...

Note that I haven't made many changes to the postgresql.conf file, so
there might be something really obvious I've overlooked, but here are the
uncommented ones (ie. ones I've modified from defaults):

tcpip_socket = true
max_connections = 512
shared_buffers = 10000 # min 16, at least max_connections*2, 8KB each
sort_mem = 10240 # min 64, size in KB
vacuum_mem = 81920 # min 1024, size in KB
syslog = 2 # range 0-2; 0=stdout; 1=both; 2=syslog
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
log_connections = true
log_duration = false
log_statement = false
lc_messages = 'C' # locale for system error message strings
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
Dave Cramer
519 939 0336
ICQ # 1467551
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #30
On Wed, 31 Dec 2003, Dave Cramer wrote:
What is the locale of the database?

like won't use an index, unless it is 'C' locale, or you use 7.4 and
change the operator of the index.


one thing I failed to note ... this is all running on 7.4 ... under 7.3,
it was much much worse :)
----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #31
Marc G. Fournier wrote:
186_archives=# \d ndict7
Table "public.ndict7"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n7_url" btree (url_id)
"n7_word" btree (word_id)
The slowdown is the LIKE condition, as the ndict[78] word_id conditions
return near instantly when run individually, and when I run the 'url/LIKE'
condition, it takes "forever" ...


Does it help to CLUSTER url.url? Is your data being loaded in so
identical values used by LIKE are next to each other?

--
Bruce Momjian | http://candle.pha.pa.us
pg***@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #32
On Thu, 1 Jan 2004, Bruce Momjian wrote:
Marc G. Fournier wrote:
186_archives=# \d ndict7
Table "public.ndict7"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n7_url" btree (url_id)
"n7_word" btree (word_id)
The slowdown is the LIKE condition, as the ndict[78] word_id conditions
return near instantly when run individually, and when I run the 'url/LIKE'
condition, it takes "forever" ...


Does it help to CLUSTER url.url? Is your data being loaded in so
identical values used by LIKE are next to each other?


Just tried CLUSTER, and no difference, but ... chat'd with Dave on ICQ
this evening, and was thinking of something ... and it comes back to
something that I mentioned awhile back ...

Taking the ndict8 query that I originally presented, now post CLUSTER, and
an explain analyze looks like:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=13918.23..26550.58 rows=17 width=8) (actual time=4053.403..83481.769 rows=13415 loops=1)
Hash Cond: ("outer".url_id = "inner".rec_id)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Hash (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3920.597..3920.597 rows=0 loops=1)
-> Seq Scan on url (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.837..2377.853 rows=304811 loops=1)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 83578.572 ms
(8 rows)

Now, if I knock off the LIKE, so that I'm returning all rows from ndict8,
join'd to all the URLs that contain them, you get:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..30183.13 rows=3219 width=8) (actual time=0.299..1217.116 rows=15533 loops=1)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.144..458.891 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Index Scan using url_rec_id on url (cost=0.00..5.44 rows=1 width=4) (actual time=0.024..0.029 rows=1 loops=15533)
Index Cond: (url.rec_id = "outer".url_id)
Total runtime: 1286.647 ms
(6 rows)

So, there are 15333 URLs that contain that word ... now, what I want to
find out is how many of those 15333 URLs contain
'http://archives.postgresql.org/%%', which is 13415 ...

The problem is that right now, we look at the LIKE first, giving us ~300k
rows, and then search through those for those who have the word matching
.... is there some way of reducing the priority of the LIKE part of the
query, as far as the planner is concerned, so that it will "resolve" the =
first, and then work the LIKE on the resultant set, instead of the other
way around? So that the query is only checking 15k records for the 13k
that match, instead of searching through 300k?

I'm guessing that the reason that the LIKE is taking precidence(sp?) is
because the URL table has less rows in it then ndict8?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #33
Might be worth trying a larger statistics target (say 100), in the hope
that the planner then has better information to work with.

best wishes

Mark

Marc G. Fournier wrote:

he problem is that right now, we look at the LIKE first, giving us ~300k
rows, and then search through those for those who have the word matching
... is there some way of reducing the priority of the LIKE part of the
query, as far as the planner is concerned, so that it will "resolve" the =
first, and then work the LIKE on the resultant set, instead of the other
way around? So that the query is only checking 15k records for the 13k
that match, instead of searching through 300k?

I'm guessing that the reason that the LIKE is taking precidence(sp?) is
because the URL table has less rows in it then ndict8?

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #34
Marc G. Fournier wrote:

Now, if I knock off the LIKE, so that I'm returning all rows from ndict8,
join'd to all the URLs that contain them, you get:


Can't you build seperate databases for each domain you want to index?
Than you wouldn't need the like operator at all.

The like-operator doesn't seem to allow a very scalable production
environment. And besides that point, I don't really believe a "record
per word/document-couple" is very scalable (not in SQL, not anywhere).

Anyway, that doesn't help you much, perhaps decreasing the size of the
index-tables can help, are they with OIDs ? If so, wouldn't it help to
recreate them without, so you save yourselves 4 bytes per word-document
couple, therefore allowing it to fit in less pages and by that speeding
up the seqscans.

Are _all_ your queries with the like on the url? Wouldn't it help to
create an index on both the wordid and the urlid for ndict8?

Perhaps you can create your own 'host table' (which could be filled
using a trigger or a slightly adjusted indexer), and a foreign key from
your url table to that, so you can search on url.hostid = X (or a join
with that host table) instead of the like that is used now?

By the way, can a construction like (tablefield || '') ever use an index
in postgresql?

Best regards and good luck,

Arjen van der Meijden

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #35
On Thu, 1 Jan 2004, Arjen van der Meijden wrote:
Marc G. Fournier wrote:

Now, if I knock off the LIKE, so that I'm returning all rows from ndict8,
join'd to all the URLs that contain them, you get:
Can't you build seperate databases for each domain you want to index?
Than you wouldn't need the like operator at all.


First off, that would make searching across multiple domains difficult,
no?

Second, the LIKE is still required ... the LIKE allows the search to
"group" URLs ... for instance, if I wanted to just search on the docs, the
LIKE would look for all URLs that contain:

http://www.postgresql.org/docs/%%

whereas searching the whole site would be:

http://www.postgresql.org/%%
Anyway, that doesn't help you much, perhaps decreasing the size of the
index-tables can help, are they with OIDs ? If so, wouldn't it help to
recreate them without, so you save yourselves 4 bytes per word-document
couple, therefore allowing it to fit in less pages and by that speeding
up the seqscans.
This one I hadn't thought about ... for some reason, I thought that
WITHOUT OIDs was now the default ... looking at that one now ...
Are _all_ your queries with the like on the url? Wouldn't it help to
create an index on both the wordid and the urlid for ndict8?
as mentioned in a previous email, the schema for ndict8 is:

186_archives=# \d ndict8
Table "public.ndict8"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n8_url" btree (url_id)
"n8_word" btree (word_id)
By the way, can a construction like (tablefield || '') ever use an index
in postgresql?


again, as shown in a previous email, the index is being used for the LIKE
query ... the big problem as I see it is that the result set from the LIKE
is ~20x larger then the result set for the the = ... if there was some way
to telling the planner that going the LIKE route was the more expensive of
the two (even though table size seems to indicate the other way around), I
suspect that that would improve things also ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to ma*******@postgresql.org)

Nov 12 '05 #36
Mark Kirkwood <ma****@paradise.net.nz> writes:
Might be worth trying a larger statistics target (say 100), in the hope
that the planner then has better information to work with.


I concur with that suggestion. Looking at Marc's problem:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=13918.23..26550.58 rows=17 width=8) (actual time=4053.403..83481.769 rows=13415 loops=1)
Hash Cond: ("outer".url_id = "inner".rec_id)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Hash (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3920.597..3920.597 rows=0 loops=1)
-> Seq Scan on url (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.837..2377.853 rows=304811 loops=1)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 83578.572 ms
(8 rows)

the slowness is not really in the LIKE, it's in the indexscan on
ndict8 (79 out of 83 seconds spent there). The planner probably would
not have chosen this plan if it hadn't been off by a factor of 5 on the
rows estimate. So try knocking up the stats target for ndict8.word_id,
re-analyze, and see what happens.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #37
On Thu, 1 Jan 2004, Tom Lane wrote:
Mark Kirkwood <ma****@paradise.net.nz> writes:
Might be worth trying a larger statistics target (say 100), in the hope
that the planner then has better information to work with.


I concur with that suggestion. Looking at Marc's problem:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=13918.23..26550.58 rows=17 width=8) (actual time=4053.403..83481.769 rows=13415 loops=1)
Hash Cond: ("outer".url_id = "inner".rec_id)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Hash (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3920.597..3920.597 rows=0 loops=1)
-> Seq Scan on url (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.837..2377.853 rows=304811 loops=1)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 83578.572 ms
(8 rows)

the slowness is not really in the LIKE, it's in the indexscan on ndict8
(79 out of 83 seconds spent there). The planner probably would not have
chosen this plan if it hadn't been off by a factor of 5 on the rows
estimate. So try knocking up the stats target for ndict8.word_id,
re-analyze, and see what happens.


'k, and for todays question ... how does one 'knock up the stats target'?
This is stuff I've not played with yet, so a URL to read up on this would
be nice, vs just how to do it?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #38
"Marc G. Fournier" <sc*****@postgresql.org> writes:
'k, and for todays question ... how does one 'knock up the stats target'?


ALTER TABLE [ ONLY ] name [ * ]
ALTER [ COLUMN ] column SET STATISTICS integer

The default is 10; try 100, or even 1000 (don't think it will let you
go higher than 1000).

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #39
On Thu, 1 Jan 2004, Bruce Momjian wrote:
Marc G. Fournier wrote:
186_archives=# \d ndict7
Table "public.ndict7"
Column | Type | Modifiers
---------+---------+--------------------
url_id | integer | not null default 0
word_id | integer | not null default 0
intag | integer | not null default 0
Indexes:
"n7_url" btree (url_id)
"n7_word" btree (word_id)
The slowdown is the LIKE condition, as the ndict[78] word_id conditions
return near instantly when run individually, and when I run the 'url/LIKE'
condition, it takes "forever" ...


Does it help to CLUSTER url.url? Is your data being loaded in so
identical values used by LIKE are next to each other?


I'm loading up a MySQL 4.1 database right now, along side of a PgSQL 7.4
one WITHOUT OIDs ... should take several days to fully load, but it will
be interesting to compare them all ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #40
On Thu, 1 Jan 2004, Tom Lane wrote:
"Marc G. Fournier" <sc*****@postgresql.org> writes:
'k, and for todays question ... how does one 'knock up the stats target'?


ALTER TABLE [ ONLY ] name [ * ]
ALTER [ COLUMN ] column SET STATISTICS integer

The default is 10; try 100, or even 1000 (don't think it will let you
go higher than 1000).


k, so:

186_archives=# alter table ndict8 alter column word_id set statistics 1000;
ALTER TABLE

followed by an 'vacuum verbose analyze ndict8', which showed an analyze
of:

INFO: analyzing "public.ndict8"
INFO: "ndict8": 34718 pages, 300000 rows sampled, 6354814 estimated total rows

vs when set at 10:

INFO: analyzing "public.ndict8"
INFO: "ndict8": 34718 pages, 3000 rows sampled, 6229711 estimated total rows

The query @ 1000:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=13918.23..76761.02 rows=81 width=8) (actual time=5199.443..5835.444 rows=13415 loops=1)
Hash Cond: ("outer".url_id = "inner".rec_id)
-> Index Scan using n8_word on ndict8 (cost=0.00..62761.60 rows=16075 width=8) (actual time=0.230..344.485 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Hash (cost=13913.31..13913.31 rows=1968 width=4) (actual time=5198.289..5198.289 rows=0 loops=1)
-> Seq Scan on url (cost=0.00..13913.31 rows=1968 width=4) (actual time=3.933..3414.657 rows=304811 loops=1)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 5908.778 ms
(8 rows)

Same query @ 10:

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=13918.23..26502.18 rows=17 width=8) (actual time=3657.984..4293.529 rows=13415 loops=1)
Hash Cond: ("outer".url_id = "inner".rec_id)
-> Index Scan using n8_word on ndict8 (cost=0.00..12567.73 rows=3210 width=8) (actual time=0.239..362.375 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Hash (cost=13913.31..13913.31 rows=1968 width=4) (actual time=3657.480..3657.480 rows=0 loops=1)
-> Seq Scan on url (cost=0.00..13913.31 rows=1968 width=4) (actual time=2.646..2166.632 rows=304811 loops=1)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 4362.375 ms
(8 rows)

I don't see a difference between the two, other then time changes, but
that could just be that runA had a server a bit more idle then runB ...
something I'm not seeing here?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #41
Marc G. Fournier wrote:
On Thu, 1 Jan 2004, Arjen van der Meijden wrote:

Marc G. Fournier wrote:
Now, if I knock off the LIKE, so that I'm returning all rows from ndict8,
join'd to all the URLs that contain them, you get:
Can't you build seperate databases for each domain you want to index?
Than you wouldn't need the like operator at all.

First off, that would make searching across multiple domains difficult,
no?

If mnogosearch would allow searching in multiple databases; no. But it
doesn't seem to feature that and indeed; yes that might become a bit
difficult.
It was something I thought of because our solution allows it, but that
is no solution for you, I checked the mnogosearch features after sending
that email, instead of before. Perhaps I should've turned that around.
Second, the LIKE is still required ... the LIKE allows the search to
"group" URLs ... for instance, if I wanted to just search on the docs, the
LIKE would look for all URLs that contain:

http://www.postgresql.org/docs/%%

whereas searching the whole site would be:

http://www.postgresql.org/%%

That depends. If it were possible, you could decide from the search
usage stats to split /docs from the "the rest" of www.postgresql.org and
by that avoiding quite a bit of like's.
Anyway, that doesn't help you much, perhaps decreasing the size of the
index-tables can help, are they with OIDs ? If so, wouldn't it help to
recreate them without, so you save yourselves 4 bytes per word-document
couple, therefore allowing it to fit in less pages and by that speeding
up the seqscans.

This one I hadn't thought about ... for some reason, I thought that
WITHOUT OIDs was now the default ... looking at that one now ...

No, it's still the default to do it with oids.
By the way, can a construction like (tablefield || '') ever use an index
in postgresql?

again, as shown in a previous email, the index is being used for the LIKE
query ... the big problem as I see it is that the result set from the LIKE
is ~20x larger then the result set for the the = ... if there was some way
to telling the planner that going the LIKE route was the more expensive of
the two (even though table size seems to indicate the other way around), I
suspect that that would improve things also ...

Yeah, I noticed. Hopefully Tom's suggestion will work to achieve that.

I can imagine how you feel about all this, I had to do a similar job a
year ago, but was less restricted by a preference like the "it'd be a
nice postgresql showcase". But then again, our search engine is loaded
with an average of 24 queries per minute (peaking to over 100/m in the
afternoon and evenings) and we didn't have any working solution (not
even a slow one).

Good luck,

Arjen van der Meijden

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #42
"Marc G. Fournier" <sc*****@postgresql.org> writes:
I don't see a difference between the two, other then time changes, but
that could just be that runA had a server a bit more idle then runB ...
something I'm not seeing here?


Well, the difference I was hoping for was a more accurate rows estimate
for the indexscan, which indeed we got (estimate went from 3210 to
16075, vs reality of 15533). But it didn't change the plan :-(.

Looking more closely, I see the rows estimate for the seqscan on "url"
is pretty awful too (1968 vs reality of 304811). I think it would get
better if you were doing just
AND (url.url LIKE 'http://archives.postgresql.org/%%');
without the concatenation of an empty string. Is there a reason for the
concatenation part of the expression?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #43
On Thu, 1 Jan 2004, Arjen van der Meijden wrote:
That depends. If it were possible, you could decide from the search
usage stats to split /docs from the "the rest" of www.postgresql.org and
by that avoiding quite a bit of like's.


then what if you want to search:

http://www.postgresql.org/docs/7.4/%%

vs

http://www.postgresql.org/docs/7.3/%%

:)

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #44
"Marc G. Fournier" <sc*****@postgresql.org> writes:
The full first query: SELECT ndict8.url_id,ndict8.intag
FROM ndict8, url
WHERE ndict8.word_id=417851441
AND url.rec_id=ndict8.url_id
AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%'); returns 13415 rows, and explain analyze shows: Nested Loop (cost=0.00..30199.82 rows=17 width=8) (actual time=0.312..1459.504 rows=13415 loops=1)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.186..387.673 rows=15532 loops=1)
Index Cond: (word_id = 417851441)
-> Index Scan using url_rec_id on url (cost=0.00..5.45 rows=1 width=4) (actual time=0.029..0.050 rows=1 loops=15532)
Index Cond: (url.rec_id = "outer".url_id)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 1520.145 ms
(7 rows)


The more I look at it, the more it seems that this is the best plan for
the query. Since the URL condition is very unselective (and will
probably be so in most all variants of this query), it just doesn't pay
to try to apply it before doing the join. What we want is to make the
join happen quickly, and not even bother applying the URL test until
after we have a joinable url entry.

(In the back of my mind here is the knowledge that mnogosearch is
optimized for mysql, which is too stupid to do the query in any way
other than a plan like the above.)

I think Bruce's original suggestion of clustering was right on, except
he guessed wrong about what to cluster. The slow part is the scan on
ndict8, as we saw in the later message:

-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533 loops=1)
Index Cond: (word_id = 417851441)

Presumably, the first EXPLAIN shows the behavior when this portion of
ndict8 and its index have been cached, while the second EXPLAIN shows
what happens when they're not in cache. So my suggestion is to CLUSTER
ndict8 on n8_word. It might also help to CLUSTER url on url_rec_id.
Make sure the plan goes back to the nested indexscan as above (you might
need to undo the statistics-target changes).

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #45
On Thu, 1 Jan 2004, Tom Lane wrote:
"Marc G. Fournier" <sc*****@postgresql.org> writes:
I don't see a difference between the two, other then time changes, but
that could just be that runA had a server a bit more idle then runB ...
something I'm not seeing here?


Well, the difference I was hoping for was a more accurate rows estimate
for the indexscan, which indeed we got (estimate went from 3210 to
16075, vs reality of 15533). But it didn't change the plan :-(.

Looking more closely, I see the rows estimate for the seqscan on "url"
is pretty awful too (1968 vs reality of 304811). I think it would get
better if you were doing just
AND (url.url LIKE 'http://archives.postgresql.org/%%');
without the concatenation of an empty string. Is there a reason for the
concatenation part of the expression?


Believe it or not, the concatenation was based on a discussion *way* back
(2 years, maybe?) when we first started using Mnogosearch, in which you
suggested going that route ... in fact, at the time (bear in mind, this is
back in 7.2 days), it actually sped things up ...

Ok, with statistics set to 10, we now have:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..31672.49 rows=1927 width=8) (actual time=117.064..54476.806 rows=13415 loops=1)
-> Index Scan using n8_word on ndict8 (cost=0.00..12567.73 rows=3210 width=8) (actual time=80.230..47844.752 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Index Scan using url_rec_id on url (cost=0.00..5.94 rows=1 width=4) (actual time=0.392..0.398 rows=1 loops=15533)
Index Cond: (url.rec_id = "outer".url_id)
Filter: (url ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 54555.011 ms
(7 rows)

And, at 1000 (and appropriate vacuum analyze on ndict8):

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=91613.33..92959.41 rows=9026 width=8) (actual time=12834.316..16726.018 rows=13415 loops=1)
Merge Cond: ("outer".url_id = "inner".rec_id)
-> Sort (cost=59770.57..59808.18 rows=15043 width=8) (actual time=776.823..849.798 rows=15533 loops=1)
Sort Key: ndict8.url_id
-> Index Scan using n8_word on ndict8 (cost=0.00..58726.82 rows=15043 width=8) (actual time=0.296..680.139 rows=15533 loops=1)
Index Cond: (word_id = 417851441)
-> Sort (cost=31842.76..32433.09 rows=236133 width=4) (actual time=12056.594..14159.852 rows=311731 loops=1)
Sort Key: url.rec_id
-> Index Scan using url_url on url (cost=0.00..10768.79 rows=236133 width=4) (actual time=225.243..8353.024 rows=304811 loops=1)
Index Cond: ((url >= 'http://archives.postgresql.org/'::text) AND (url < 'http://archives.postgresql.org0'::text))
Filter: (url ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 16796.932 ms
(12 rows)

Closer to what you were looking/hoping for?

Second run, @1000, shows:

Total runtime: 12194.016 ms
(12 rows)

Second run, after knocking her back down to 10, shows:

Total runtime: 58119.150 ms
(7 rows)

so we're definitely improved ... if this is the kinda results you were
hoping to see, then I guess next step would be to increase/reanalyze all
the word_id columns ... what about the url.url column? should that be
done as well? what does that setting affect, *just* the time it takes to
analyze the table? from the verbose output, it looks like it is scanning
more rows on an analyze then @ 10 ... is this something that can be set
database wide, before loading data? and/or something that the default is
currently just too low?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Nov 12 '05 #46
On Thu, 1 Jan 2004, Tom Lane wrote:
"Marc G. Fournier" <sc*****@postgresql.org> writes:
The full first query:

SELECT ndict8.url_id,ndict8.intag
FROM ndict8, url
WHERE ndict8.word_id=417851441
AND url.rec_id=ndict8.url_id
AND ((url.url || '') LIKE 'http://archives.postgresql.org/%%');

returns 13415 rows, and explain analyze shows:

Nested Loop (cost=0.00..30199.82 rows=17 width=8) (actual time=0.312..1459.504 rows=13415 loops=1)
-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=0.186..387.673 rows=15532 loops=1)
Index Cond: (word_id = 417851441)
-> Index Scan using url_rec_id on url (cost=0.00..5.45 rows=1 width=4) (actual time=0.029..0.050 rows=1 loops=15532)
Index Cond: (url.rec_id = "outer".url_id)
Filter: ((url || ''::text) ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 1520.145 ms
(7 rows)


The more I look at it, the more it seems that this is the best plan for
the query. Since the URL condition is very unselective (and will
probably be so in most all variants of this query), it just doesn't pay
to try to apply it before doing the join. What we want is to make the
join happen quickly, and not even bother applying the URL test until
after we have a joinable url entry.

(In the back of my mind here is the knowledge that mnogosearch is
optimized for mysql, which is too stupid to do the query in any way
other than a plan like the above.)

I think Bruce's original suggestion of clustering was right on, except
he guessed wrong about what to cluster. The slow part is the scan on
ndict8, as we saw in the later message:

-> Index Scan using n8_word on ndict8 (cost=0.00..12616.09 rows=3219 width=8) (actual time=113.645..79163.431 rows=15533 loops=1)
Index Cond: (word_id = 417851441)

Presumably, the first EXPLAIN shows the behavior when this portion of
ndict8 and its index have been cached, while the second EXPLAIN shows
what happens when they're not in cache. So my suggestion is to CLUSTER
ndict8 on n8_word. It might also help to CLUSTER url on url_rec_id.
Make sure the plan goes back to the nested indexscan as above (you might
need to undo the statistics-target changes).


k, so return statistics to the default, and run a CLUSTER on n8_word and
url_rec_id ... now, question I asked previously, but I think Bruce might
have overlooked it ...

what sort of impact does CLUSTER have on the system? For instance, an
index happens nightly, so I'm guessing that I'll have to CLUSTER each
right after? Will successive CLUSTERs take less time then the initial
one? I'm guessing so, since the initial one will have 100% to sort, while
subsequent ones will have a smaller set to work with, but figured I'd ask
.... from the man page, all I figure I need to do (other then the initial
time) is:

VACUUM;
CLUSTER;

With 7.4, VACUUM full isn't a requirement, but is it if I'm going to do a
CLUSTER after?

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #47
"Marc G. Fournier" <sc*****@postgresql.org> writes:
On Thu, 1 Jan 2004, Tom Lane wrote:
Is there a reason for the
concatenation part of the expression?
Believe it or not, the concatenation was based on a discussion *way* back
(2 years, maybe?) when we first started using Mnogosearch, in which you
suggested going that route ... in fact, at the time (bear in mind, this is
back in 7.2 days), it actually sped things up ...
Hmm, I vaguely remember that ... I think we were deliberately trying to
fool the planner at that time, because it was making some stupid
assumption about the selectivity of the LIKE clause. It looks like that
problem is now mostly fixed, since your second example shows estimate of
236133 vs reality of 304811 rows for the URL condition:
-> Index Scan using url_url on url (cost=0.00..10768.79 rows=236133 width=4) (actual time=225.243..8353.024 rows=304811 loops=1)
Index Cond: ((url >= 'http://archives.postgresql.org/'::text) AND (url < 'http://archives.postgresql.org0'::text))
Filter: (url ~~ 'http://archives.postgresql.org/%%'::text)
Total runtime: 16796.932 ms
(12 rows) Closer to what you were looking/hoping for?
This probably says that we can stop using the concatenation hack, at
least. I'd still suggest clustering the two tables as per my later
message. (Note that clustering would help this mergejoin plan too,
so it could come out to be a win relative to the nestloop indexscan,
but we ought to try both and see.)
what does that setting affect, *just* the time it takes to
analyze the table?
Well, it will also bloat pg_statistic and slow down planning a little.
Can you try 100 and see if that gives reasonable estimates? 1000 is a
rather extreme setting I think; I'd go for 100 to start with.
is this something that can be set database wide,


Yeah, see default_statistics_target in postgresql.conf.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #48
"Marc G. Fournier" <sc*****@postgresql.org> writes:
what sort of impact does CLUSTER have on the system? For instance, an
index happens nightly, so I'm guessing that I'll have to CLUSTER each
right after?


Depends; what does the "index" process do --- are ndict8 and friends
rebuilt from scratch?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #49
On Thu, 1 Jan 2004, Tom Lane wrote:
"Marc G. Fournier" <sc*****@postgresql.org> writes:
what sort of impact does CLUSTER have on the system? For instance, an
index happens nightly, so I'm guessing that I'll have to CLUSTER each
right after?


Depends; what does the "index" process do --- are ndict8 and friends
rebuilt from scratch?


nope, but heavily updated ... basically, the indexer looks at url for what
urls need to be 're-indexed' ... if it does, it removed all words from the
ndict# tables that belong to that url, and re-adds accordingly ...

----
Marc G. Fournier Hub.Org Networking Services (http://www.hub.org)
Email: sc*****@hub.org Yahoo!: yscrappy ICQ: 7615664

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #50

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: bettina | last post by:
I'm re-programming my Website (www.coaster.ch) in PHP and I find it too slow (although I have ADSL). That's more or less how it functions: Here my tables: 'COASTERS' (code of coaster, code of...
12
by: Vjay77 | last post by:
Hi, I haven't posted any problem in quite a while now, but I came to the point that I really need to ask for help. I need to create an application which will search through .txt log file and...
4
by: sommes | last post by:
It's only happen on .asp website, what's the problem? Thank you
2
by: tmb | last post by:
When publishing a website the process is excrutiatingly slow - we are talking 3-4 minutes from when the actual transfer to the site has begun to completion. Apparently i'm not the only one...
2
by: yasmike | last post by:
I am having a problem with my secure website on our internal network. The secure website is hosted on our Windows 2000 Server running IIS 5.0. If you try and access the website from a browser from...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.