New to the list; would this be an okay question?

Madison Kelly

Hi all,

I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is really
bad. I have been reading as much as I can on optimizing performance but
still it isn't very reasonable. At one point I had my program able to
process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
tried a lot of things to get the performance up but now it is
substantially slower and I can't seem to figure out what I am doing wrong.

Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is for
my company but it will be released for free to home users and the source
code will be made available (similar to other split-license programs)
though once my company makes it's money back I think they will fully GPL
it (I am on my boss's case about it :p ).

Thanks all!

Madison Kelly

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #1

Subscribe Post Reply

2229

Martijn van Oosterhout

Standard questions:
- Have you VACUUMed?
- Have you VACUUM ANALYZEd?
- Have you done EXPLAIN ANALYZE on the complex queries?
- Have you put INDEXes on the appropriate columns.

You need to give more details is you want more detailed answers.

On Mon, Jun 21, 2004 at 09:38:14AM -0400, Madison Kelly wrote:

Hi all,

I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is really
bad. I have been reading as much as I can on optimizing performance but
still it isn't very reasonable. At one point I had my program able to
process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
tried a lot of things to get the performance up but now it is
substantially slower and I can't seem to figure out what I am doing wrong.

Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is for
my company but it will be released for free to home users and the source
code will be made available (similar to other split-license programs)
though once my company makes it's money back I think they will fully GPL
it (I am on my boss's case about it :p ).

Thanks all!

Madison Kelly

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFA1ujOY5Twig3Ge+YRAvAqAKC1taBrD99lD9UpU+LX1J fHfW26CQCgoaSj
bfMFOqq/anFUVDdoDBREkRY=
=kX5D
-----END PGP SIGNATURE-----

Nov 23 '05 #2

Martijn van Oosterhout

Hi all,

I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is really
bad. I have been reading as much as I can on optimizing performance but
still it isn't very reasonable. At one point I had my program able to
process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
tried a lot of things to get the performance up but now it is
substantially slower and I can't seem to figure out what I am doing wrong.

Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is for
my company but it will be released for free to home users and the source
code will be made available (similar to other split-license programs)
though once my company makes it's money back I think they will fully GPL
it (I am on my boss's case about it :p ).

Thanks all!

Madison Kelly

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

Nov 23 '05 #3

Richard Huxton

Madison Kelly wrote:

Hi all,

I am new to the list and I didn't want to seem rude at all so I wanted
to ask if this was okay first.
No problem. Reading your message below, you might want to try the
performance list, but general is a good place to start.
I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is really
bad. I have been reading as much as I can on optimizing performance but
still it isn't very reasonable. At one point I had my program able to
process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
tried a lot of things to get the performance up but now it is
substantially slower and I can't seem to figure out what I am doing wrong.
A few places to start:
1. VACUUM FULL
This will make sure any unused space is reclaimed
2. ANALYZE
This will recalculate stats for the tables
3. Basic performce tuning:
http://www.varlena.com/varlena/Gener...bits/index.php
There's also a good guide to the postgresql.conf file on varlena.com
Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is for
my company but it will be released for free to home users and the source
code will be made available (similar to other split-license programs)
though once my company makes it's money back I think they will fully GPL
it (I am on my boss's case about it :p ).

No problem - what you licence your software under is your concern. Once
you've taken the basic steps described above, try to pick out a specific
query that you think is too slow and provide:

1. PostgreSQL version
2. Basic hardware info (as you have)
3. Sizes of tables.
4. Output of EXPLAIN ANALYZE <query here>

The EXPLAIN ANALYZE runs the query and shows how much work PG thought it
would be and how much it actually turned out to be.

HTH
--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #4

Madison Kelly

Sorry; I didn't include details at first because I wanted to make sure
that was an approprate request for this list.

I have the program run 'VACUUM ANALYZE' after every major update/insert
job and I have in fact indexed the three columns that I search through
when I need to decide to update (if the record exists) or insert (if it
does not). I have read about the EXPLAIN option/tool but I haven't been
able to get my head around how to properly use it yet.

Here is what I am trying to do:

The program is a Linux backup program that uses a web front-end as it's
interface (so that a client can access it from any system, inc. MS
workstations). The program is also designed to allow the user to search
for a given file or file by spec (file size, date modified, etc) on
media that is offline. The program is designed with externally connected
USB2 and firewire drives so it is all partition-based.

In order to make the web front-end stateful and to allow for the ability
to search I needed to keep in the database detailed information on every
file and directory on a given partition. Some of the information also
needs to be maintained so I can't just clear and rescan a partition. For
example, I need to keep track of what directories and files a user has
selected or not selected to be backed up in a given partition. This
means that whenever I need to update the contents of a partition I need
to run 'ls' starting at the mount point for the partition and scanning
down through all sub directories.

As each file is scanned I check the database to see if the file name I
am looking at already exists in the database. I do this by searching for
the file_name, File_parent_dir (parent directory) and file_src_uuid (the
UUID [serial number] of the partition the file is on). If there is a
match I run an "UPDATE" where the backup state is not touched. If the
file is new then the file is added to the database along with all of
it's particular information such as owning user, group, permissions,
filesize and so on.

Given that some file systems have 250,000 files and directories I need
to make sure that the database calls are as optimized as I can make
them. I have verified that the lag is in the database by commenting out
the actual database calls and letting the program traverse the file
system. In that case a job that with the database calls in place and
took nearly 200 seconds finishes in roughly 2 seconds.

If this is an okay request I would be happy to post the schema I am
using and the perl code I am using to make the DB calls.

Thanks!!

Madison Kelly

Martijn van Oosterhout wrote:

Standard questions:
- Have you VACUUMed?
- Have you VACUUM ANALYZEd?
- Have you done EXPLAIN ANALYZE on the complex queries?
- Have you put INDEXes on the appropriate columns.

You need to give more details is you want more detailed answers.

On Mon, Jun 21, 2004 at 09:38:14AM -0400, Madison Kelly wrote:
Hi all,

I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is really
bad. I have been reading as much as I can on optimizing performance but
still it isn't very reasonable. At one point I had my program able to
process 175,000 records in 16min 10sec on a Pentium3 650MHz, 448MB RAM
test machine. Since then I got a Pentium3 1GHz, 512MB system and I have
tried a lot of things to get the performance up but now it is
substantially slower and I can't seem to figure out what I am doing wrong.

Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is for
my company but it will be released for free to home users and the source
code will be made available (similar to other split-license programs)
though once my company makes it's money back I think they will fully GPL
it (I am on my boss's case about it :p ).

Thanks all!

Madison Kelly

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #5

Madison Kelly

Richard Huxton wrote:

Madison Kelly wrote:
Hi all,

I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

No problem. Reading your message below, you might want to try the
performance list, but general is a good place to start.
I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is
really bad. I have been reading as much as I can on optimizing
performance but still it isn't very reasonable. At one point I had my
program able to process 175,000 records in 16min 10sec on a Pentium3
650MHz, 448MB RAM test machine. Since then I got a Pentium3 1GHz,
512MB system and I have tried a lot of things to get the performance
up but now it is substantially slower and I can't seem to figure out
what I am doing wrong.

A few places to start:
1. VACUUM FULL
This will make sure any unused space is reclaimed
2. ANALYZE
This will recalculate stats for the tables
3. Basic performce tuning:
http://www.varlena.com/varlena/Gener...bits/index.php
There's also a good guide to the postgresql.conf file on varlena.com
Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is
for my company but it will be released for free to home users and the
source code will be made available (similar to other split-license
programs) though once my company makes it's money back I think they
will fully GPL it (I am on my boss's case about it :p ).

No problem - what you licence your software under is your concern. Once
you've taken the basic steps described above, try to pick out a specific
query that you think is too slow and provide:

1. PostgreSQL version
2. Basic hardware info (as you have)
3. Sizes of tables.
4. Output of EXPLAIN ANALYZE <query here>

The EXPLAIN ANALYZE runs the query and shows how much work PG thought it
would be and how much it actually turned out to be.

HTH

Thank you very much!! I am using Psql 7.4 on a stock install of Fedora
Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the fastest
HDD). The drive carrier I am using is connected via USB2 and uses a few
different hard drives with the fastest being a couple of Barracuda
7200.7 drives (2MB cache, 7,200rpm). I described the program in my reply
to Martijn so here is some of the code (code not related to psql
snipped, let me know if posting it would help - sorry for the wrapping...):

=-[ Calling the database ]-=
# Open the connection to the database
my $DB = DBI->connect("DBI:Pg:dbname=$db_name","$user")|| die("Connect
error (Is PostgresSQL running?): $DBI::errstr");

# Prepare the select statements before using them for speed:
$select_sth = $DB->prepare("SELECT null FROM file_dir WHERE
file_src_uuid=? AND file_parent_dir=? AND file_name=?") || die
"$DBI::errstr";
$select_up = $DB->prepare("UPDATE file_dir SET file_perm=?,
file_own_user=?, file_own_grp=?, file_size=?, file_mod_date=?,
file_mod_time=?, file_mod_time_zone=?, file_exist=? WHERE
file_src_uuid=? AND file_parent_dir=? AND file_name=?") || die
"$DBI::errstr";
$select_in = $DB->prepare("INSERT INTO file_dir ( file_src_uuid,
file_name, file_dir, file_parent_dir, file_perm, file_own_user,
file_own_grp, file_size, file_mod_date, file_mod_time,
file_mod_time_zone, file_backup, file_restore, file_display, file_exist
) VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )") || die
"$DBI::errstr";

# Set the 'file_exist' flag to 'false' and reset exiting files to 'true'.
$DB->do("UPDATE file_dir SET file_exist='f' WHERE
file_src_uuid='$file_src_uuid'") || die "$DBI::errstr";

# Start scanning the drive
$num = $select_sth->execute($file_src_uuid,$relative_dir,$file_name ) ||
die "$DBI::errstr";
if ( $num > 0 )
{
$select_up->execute($file_perm,$file_own_user,$file_own_grp,$ file_size,$file_mod_date,$file_mod_time,$file_mod_ time_zone,$file_exist,$file_src_uuid,$file_parent_ dir,$file_name)
|| die "$DBI::errstr";
}
else
{

$select_in->execute($file_src_uuid,$file_name,$file_dir,$file _parent_dir,$file_perm,$file_own_user,$file_own_gr p,$file_size,$file_mod_date,$file_mod_time,$file_m od_time_zone,$file_backup,$file_restore,$file_disp lay,$file_exist)
|| die "$DBI::errstr";
}

# We need to grab the existing file settings for the special file '/.'
$DBreq=$DB->prepare("SELECT file_backup, file_restore, file_display FROM
file_dir WHERE file_parent_dir='/' AND file_name='.' AND
file_src_uuid='$file_src_uuid'") || die $DBI::errstr;
$file_backup=$DBreq->execute();
@file_backup=$DBreq->fetchrow_array();
$file_backup=@file_backup[0];
$file_restore=@file_backup[1];
$file_display=@file_backup[2];

# Jump into the re-entrant subroutine to scan directories and sub-dirs
&list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup,
$file_restore, $file_display);

# Inside the sub routine

# Does the directory/file/symlink already exist? (there are three of
these for each file type)
$num = $select_sth->execute($file_src_uuid,$relative_dir,$file_name ) ||
die "$DBI::errstr";
if ( $num > 0 )
{

$select_up->execute($file_perm,$file_own_user,$file_own_grp,$ file_size,$file_mod_date,$file_mod_time,$file_mod_ time_zone,$file_exist,$file_src_uuid,$file_parent_ dir,$file_name)
|| die "$DBI::errstr";
}
else
{
# The file did not exist so we will use the passed parent settings
for the 'file_backup' flag and leave the 'file_display' flag set to 'f'

$select_in->execute($file_src_uuid,$file_name,$file_dir,$file _parent_dir,$file_perm,$file_own_user,$file_own_gr p,$file_size,$file_mod_date,$file_mod_time,$file_m od_time_zone,$file_backup,$file_restore,$file_disp lay,$file_exist)
|| die "$DBI::errstr";
}

# If this was a file I would loop and process the next file in the
directory, if it was a directory itself I would now re-enter the
subroutine to process it's contents and when I fell back I would pick up
where I left off

# Returning from the final subroutine and finishing up

$DB->do("VACUUM ANALYZE");

=-[ finished DB related source code ]-=

Here is the schema for the 'file_dir' table which I hit repeatedly here:

=-[ file_dir table and index schemas ]-=

CREATE TABLE file_dir ( -- Used to store info
on every file on source partitions
file_id serial unique, -- make
this 'bigserial' if there may be more than 2 billion files in the database
file_src_uuid varchar(40) not null, -- the
UUID of the source partition hosting the original file
file_org_uuid varchar(40), -- the UUID
that the file came from (when the file was moved by TLE-BU)
file_name varchar(255) not null, -- Name of
the file or directory
file_dir bool not null, -- t = is
directory, f = file
file_parent_dir varchar(255) not null, -- if
directory '/foo/bar', parent is '/foo', if file '/foo/bar/file', parent
is '/foo/bar'. The mount directory is treated as '/' so any directories
below it will be ignored for this record.
file_perm varchar(10) not null, -- file or
directory permissions
file_own_user varchar(255) not null, -- The
file's owning user (by name, not UID!!)
file_own_grp varchar(255) not null, -- The
file's owning group (by name, not GID!!)
file_size bigint not null, -- File size in
bytes
file_mod_date varchar(12) not null, -- File's
last edited date
file_mod_time varchar(20) not null, -- File's
last edited time
file_mod_time_zone varchar(6) not null, -- File's
last edited time zone
file_backup boolean not null default 'f', --
't' = Include in backup jobs, 'f' = Do not include in backup jobs
file_restore boolean not null default 'f', --
't' = Include in restore jobs, 'f' = Do not include in restore jobs
file_display boolean not null default 'f', --
't' = display, 'f' = hide
file_exist boolean default 't' -- Used to
catch files that have been deleted since the last scan. Before rescan,
all files in a given src_uuid are set to 0 (deleted) and then as each
file is found or updated it is reset back to 1 (exists) and anything
left with a value of '0' at the end of the scan is deleted and we will
remove their record.
);

-- CREATE INDEX file_dir_idx ON file_dir
(file_src_uuid,file_name,file_parent_dir);

=-[ Finish file_dir table and index schemas ]-=

Thanks so much!!

Madison
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 23 '05 #6

Richard Huxton

Madison Kelly wrote:

Richard Huxton wrote:
Madison Kelly wrote:
Hi all,

I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

No problem. Reading your message below, you might want to try the
performance list, but general is a good place to start.
I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is
really bad. I have been reading as much as I can on optimizing
performance but still it isn't very reasonable. At one point I had my
program able to process 175,000 records in 16min 10sec on a Pentium3
650MHz, 448MB RAM test machine. Since then I got a Pentium3 1GHz,
512MB system and I have tried a lot of things to get the performance
up but now it is substantially slower and I can't seem to figure out
what I am doing wrong.

A few places to start:
1. VACUUM FULL
This will make sure any unused space is reclaimed
2. ANALYZE
This will recalculate stats for the tables
3. Basic performce tuning:
http://www.varlena.com/varlena/Gener...bits/index.php
There's also a good guide to the postgresql.conf file on varlena.com
Would it be appropriate to ask for help on my program on this list?
Full disclosure: The program won't be initially GPL'ed because it is
for my company but it will be released for free to home users and the
source code will be made available (similar to other split-license
programs) though once my company makes it's money back I think they
will fully GPL it (I am on my boss's case about it :p ).

No problem - what you licence your software under is your concern.
Once you've taken the basic steps described above, try to pick out a
specific query that you think is too slow and provide:

1. PostgreSQL version
2. Basic hardware info (as you have)
3. Sizes of tables.
4. Output of EXPLAIN ANALYZE <query here>

The EXPLAIN ANALYZE runs the query and shows how much work PG thought
it would be and how much it actually turned out to be.

HTH

Thank you very much!! I am using Psql 7.4 on a stock install of
Fedora Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the
fastest HDD). The drive carrier I am using is connected via USB2 and
uses a few different hard drives with the fastest being a couple of
Barracuda 7200.7 drives (2MB cache, 7,200rpm). I described the program
in my reply to Martijn so here is some of the code (code not related to
psql snipped, let me know if posting it would help - sorry for the
wrapping...):

I'm not clear if the database is on the local disk or attached to the
USB2. Not sure it's important, since neither will be that fast.

If I understand, you scan thousands or millions of files for backup
purposes and then issue a select + update/insert for each.

Once a partition is scanned, a flag is cleared on all rows.

Once all selected files have been dealt with a vaccum/analyse is issued.
Some things to look at:
1. How many files are you handling per second? Are the disks involved in
the backup as well as the database?
2. What does the output of "vmstat 10" show when the system is running.
Is your I/O saturated? CPU?
3. Is your main index (file_src_uuid,file_name,file_parent_dir) being
used? Your best bet is to select from "pg_stat_indexes" before and after.
4. If you are updating several hundred thousand rows then you probably
don't have enought vacuum memory set aside - try a vacuum full after
each set of updates.
5. You might want to batch together queries into transactions of a few
hundred or even few thousand updates.
--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 23 '05 #7

Christopher Browne

After a long battle with technology, de*@archonet.com (Richard Huxton), an earthling, wrote:

5. You might want to batch together queries into transactions of a few
hundred or even few thousand updates.

When this particular application got discussed on local LUG mailing
list, this emerged as being one of the factors most likely to be a Big
Deal.
--
If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me
http://www3.sympatico.ca/cbbrowne/lsf.html
"A hack is a terrible thing to waste, please give to the
implementation of your choice..." -- GJC

Nov 23 '05 #8

Richard Huxton

Christopher Browne wrote:

After a long battle with technology, de*@archonet.com (Richard Huxton), an earthling, wrote:
5. You might want to batch together queries into transactions of a few
hundred or even few thousand updates.

When this particular application got discussed on local LUG mailing
list, this emerged as being one of the factors most likely to be a Big
Deal.

Yep, except... Madison said a laptop was involved, so I'm guessing it's
an IDE drive lying about sync-ing. If fsync is effectively off that
shouldn't have such a huge effect should it?

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 23 '05 #9

Martijn van Oosterhout

On Mon, Jun 21, 2004 at 08:29:54PM +0100, Richard Huxton wrote:

Christopher Browne wrote:
When this particular application got discussed on local LUG mailing
list, this emerged as being one of the factors most likely to be a Big
Deal.
Yep, except... Madison said a laptop was involved, so I'm guessing it's
an IDE drive lying about sync-ing. If fsync is effectively off that
shouldn't have such a huge effect should it?

The IDE drive lying about syncing is different from fsync being turned
off. What the drive thinks doesn't matter until after Postgres has
written the WAL, closed the transaction and written the pages out. The
fsync will still cause Linux to wait for all the data to be written to
the disk, which is still a finite amount of time, the disk buffer is
only a few MB. Turning fsync off means Linux will never wait, just
buffer in system memory. Similarly, putting it all in one transaction
means that within the transaction there is no waiting, only in
transaction commit.

With fsync on, in/not it transaction can make a really big difference.
--
Martijn van Oosterhout <kl*****@svana.org> http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQFA10RsY5Twig3Ge+YRAub8AKDdT8VvaF1ntIuYJICjgR GsAyPHHwCeOb3K
H+ZhAmqCE9e+BxKVSha1E5E=
=MOdb
-----END PGP SIGNATURE-----

Nov 23 '05 #10

Madison Kelly

Richard Huxton wrote:

Madison Kelly wrote:
Richard Huxton wrote:
Madison Kelly wrote:

Hi all,

I am new to the list and I didn't want to seem rude at all so I
wanted to ask if this was okay first.

No problem. Reading your message below, you might want to try the
performance list, but general is a good place to start.

I have a program I have written in perl which uses a postgresSQL
database as the backend. The program works but the performance is
really bad. I have been reading as much as I can on optimizing
performance but still it isn't very reasonable. At one point I had
my program able to process 175,000 records in 16min 10sec on a
Pentium3 650MHz, 448MB RAM test machine. Since then I got a Pentium3
1GHz, 512MB system and I have tried a lot of things to get the
performance up but now it is substantially slower and I can't seem
to figure out what I am doing wrong.

A few places to start:
1. VACUUM FULL
This will make sure any unused space is reclaimed
2. ANALYZE
This will recalculate stats for the tables
3. Basic performce tuning:
http://www.varlena.com/varlena/Gener...bits/index.php
There's also a good guide to the postgresql.conf file on varlena.com

Would it be appropriate to ask for help on my program on this
list? Full disclosure: The program won't be initially GPL'ed because
it is for my company but it will be released for free to home users
and the source code will be made available (similar to other
split-license programs) though once my company makes it's money back
I think they will fully GPL it (I am on my boss's case about it :p ).

No problem - what you licence your software under is your concern.
Once you've taken the basic steps described above, try to pick out a
specific query that you think is too slow and provide:

1. PostgreSQL version
2. Basic hardware info (as you have)
3. Sizes of tables.
4. Output of EXPLAIN ANALYZE <query here>

The EXPLAIN ANALYZE runs the query and shows how much work PG thought
it would be and how much it actually turned out to be.

HTH

Thank you very much!! I am using Psql 7.4 on a stock install of
Fedora Core 2 on my IBM thinkpad a22m (P3 1GHz, 512MB RAM, not the
fastest HDD). The drive carrier I am using is connected via USB2 and
uses a few different hard drives with the fastest being a couple of
Barracuda 7200.7 drives (2MB cache, 7,200rpm). I described the program
in my reply to Martijn so here is some of the code (code not related
to psql snipped, let me know if posting it would help - sorry for the
wrapping...):

I'm not clear if the database is on the local disk or attached to the
USB2. Not sure it's important, since neither will be that fast.

If I understand, you scan thousands or millions of files for backup
purposes and then issue a select + update/insert for each.

Once a partition is scanned, a flag is cleared on all rows.

Once all selected files have been dealt with a vaccum/analyse is issued.
Some things to look at:
1. How many files are you handling per second? Are the disks involved in
the backup as well as the database?
2. What does the output of "vmstat 10" show when the system is running.
Is your I/O saturated? CPU?
3. Is your main index (file_src_uuid,file_name,file_parent_dir) being
used? Your best bet is to select from "pg_stat_indexes" before and after.
4. If you are updating several hundred thousand rows then you probably
don't have enought vacuum memory set aside - try a vacuum full after
each set of updates.
5. You might want to batch together queries into transactions of a few
hundred or even few thousand updates.

Hi Richard,

The database is on the system's local disk and the destination drives
are on the USB-connected drives. It is possible to include the server's
local disk in a backup job. Without the DB calls I was able to process
~4000 files in ~2secs.

'vmstat 10' shows (system running):
r 0; b 0; swapd 3396; free 192356; buff 7084; cache 186508; si 0; so 0;
bi 0; bo 5; in 1023; cs 82; us 1; sy 0; id 99; wa 0

'vmstat 10' shows (while the program is running):
r 1; b 0; swapd 480; free 180252; buff 8600; cache 195412; si 0; so 0;
bi 6; bo 163; in 1037; cs 281; us 97; sy 3; id 0; wa 0

I'm afraid that I am pretty new to postgres (and programming period)
so I am not sure if the indexes are being used. I assumed they where
simple because I created them but then again assumptions always get me
in trouble. :p When you say "select from "pg_stat_indexes"" do you mean
select from there those three field instead of 'SELECT <blah> FROM
file_dir'?.

I currently have "VACUUM ANALYZE" set to run after every big set of
UPDATE/INSERTs but that isn't enough? Would it also help if I kept a
count of how many records had been processed and running "VACUUM
{ANALYZE:FULL}" every so many 'x' records?

I have been trying to bunch jobs into a single transaction but for
some reason whenever I try to do that PERL dies on me with an empty
"$DBI:errstr" error (it shows what line it died on but it is blank where
it normally says what went wrong). I was trying to do this like this:

$DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
# Jump into the sub-routine:
&list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup,
$file_restore, $file_display);
$DB->do("COMMIT;") || die $DBI::errstr;

Inside the subroutine I kept a count of how much time went by and to
commit the current transaction every 250secs and start a new transaction
block:

if ( $rec_num > $rec_disp )
{
$rec_disp=($rec_disp+500);
$nowtime=time;
$processtime=($nowtime - $starttime);
if ( $processtime >= $committime )
{
$committime=$committime+250;
$DB->do("COMMIT") || die $DBI::errstr;
$DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
system 'echo " |- Committed processed records to the database,
updating committime to \''.$committime.'\' and proceeding:" >> '.$log;
}
system 'echo " |- Processing record number \''.$rec_num.'\',
\''.$processtime.'\' seconds into operation." >> '.$log;
}
Is what I am doing wrong obvious at all? When I enter "BEGIN
TRANSACTION" directly into 'psql' it seems to work fine so I am sure the
syntax is right. Is it a perl prolem maybe?

Thank you so much for your help!!

Madison

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 23 '05 #11

Richard Huxton

Madison Kelly wrote:

The database is on the system's local disk and the destination drives
are on the USB-connected drives. It is possible to include the server's
local disk in a backup job. Without the DB calls I was able to process
~4000 files in ~2secs.

'vmstat 10' shows (system running):
r 0; b 0; swapd 3396; free 192356; buff 7084; cache 186508; si 0; so 0;
bi 0; bo 5; in 1023; cs 82; us 1; sy 0; id 99; wa 0

'vmstat 10' shows (while the program is running):
r 1; b 0; swapd 480; free 180252; buff 8600; cache 195412; si 0; so 0;
bi 6; bo 163; in 1037; cs 281; us 97; sy 3; id 0; wa 0
Hmm - well, your CPU usage is up to 97% userland (us 97) and your disk
activity is up (bi 6; bo 163) as you'd expect.
I'm afraid that I am pretty new to postgres (and programming period)
so I am not sure if the indexes are being used. I assumed they where
simple because I created them but then again assumptions always get me
in trouble. :p When you say "select from "pg_stat_indexes"" do you mean
select from there those three field instead of 'SELECT <blah> FROM
file_dir'?.
No - there are a set of statistics views called pg_stat_xxx (see ch 23.2
of the online manuals). For example, I have a unique index on
content_core.cc_id:
SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup _fetch
FROM pg_stat_user_indexes
WHERE relname = 'content_core';
indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------------------+----------+--------------+---------------
content_core_pkey | 717 | 717 | 717
(1 row)

SELECT * FROM content_core WHERE cc_id=2;
....

SELECT relname,indexrelname,idx_scan,idx_tup_read,idx_tup _fetch FROM
pg_stat_user_indexes
WHERE relname = 'content_core';
indexrelname | idx_scan | idx_tup_read | idx_tup_fetch
-------------------+----------+--------------+---------------
content_core_pkey | 718 | 718 | 718
As you can see, the index was used for this query.

I currently have "VACUUM ANALYZE" set to run after every big set of
UPDATE/INSERTs but that isn't enough? Would it also help if I kept a
count of how many records had been processed and running "VACUUM
{ANALYZE:FULL}" every so many 'x' records?

I have been trying to bunch jobs into a single transaction but for
some reason whenever I try to do that PERL dies on me with an empty
"$DBI:errstr" error (it shows what line it died on but it is blank where
it normally says what went wrong). I was trying to do this like this:

$DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
# Jump into the sub-routine:
&list_files($real_dir, $exclude_list_num, $relative_dir, $file_backup,
$file_restore, $file_display);
$DB->do("COMMIT;") || die $DBI::errstr;
You can do this, I think better practice is supposed to be:
$DB->begin_work;
...
$DB->commit;

You might want to turn on statement logging for PostgreSQL's logs -
details in the manuals (Ch 16.4)
Inside the subroutine I kept a count of how much time went by and to
commit the current transaction every 250secs and start a new transaction
block:

if ( $rec_num > $rec_disp )
{
$rec_disp=($rec_disp+500);
$nowtime=time;
$processtime=($nowtime - $starttime);
if ( $processtime >= $committime )
{
$committime=$committime+250;
$DB->do("COMMIT") || die $DBI::errstr;
$DB->do("BEGIN TRANSACTION") || die $DBI::errstr;
system 'echo " |- Committed processed records to the database,
updating committime to \''.$committime.'\' and proceeding:" >> '.$log;
}
system 'echo " |- Processing record number \''.$rec_num.'\',
\''.$processtime.'\' seconds into operation." >> '.$log;
}
Is what I am doing wrong obvious at all? When I enter "BEGIN
TRANSACTION" directly into 'psql' it seems to work fine so I am sure the
syntax is right. Is it a perl prolem maybe?

Nothing leaping out at me - although I'd do something like:
use IO::File;
my $log = new IO::File("> logfile.txt");
print $log "Processing record number $rec_num\n";
$log->close;
Or even just:
print STDERR "Processing...";
Note there is just a space between the filehandle and string to print.

Turn logging up in PostgreSQL (which will slow things down of course)
and then get the transaction blocks working, and we'll see what happens
then. Oh, and don't forget the tuning document on varlena.com

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 23 '05 #12

New to the list; would this be an okay question?

Similar topics