By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,846 Members | 1,631 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,846 IT Pros & Developers. It's quick & easy.

Good way to insert/update when you're not sure of duplicates?

P: n/a
I'm in a situation where I need to load an ASCII file into a database. No
sweat so far right? The records are indexed by date and I have a unique index
on date. The ASCII can overlap, meaning it can contain duplicate dates that
have been loaded before.

I started out with this ingenious idea. Just try to insert the record. If I
get an error about duplicate keys, do an update and switch to update mode. If
I'm updating and get zero rows updated, switch back to insert. Works fine,
except this whole thing is in a transaction and any errors abort the
transaction and I can't commit it. So, I decided I'd have to try to update
every single record and, if zero rows were updated, do the insert. This takes
a looooong time. Many hours for just 86000 records or so.

Is there a standard way to do this. I can't imagine I'm the only guy that need
to do this kind of thing.

I'm using Java with the postgresql JDBC driver if that matters.

Thanks for any suggestions.

Curtis Stanford
cu****@stanfordcomputing.com
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #1
Share this Question
Share on Google+
9 Replies


P: n/a
On Thursday 25 September 2003 02:23, Curtis Stanford wrote:
I'm in a situation where I need to load an ASCII file into a database. No
sweat so far right? The records are indexed by date and I have a unique
index on date. The ASCII can overlap, meaning it can contain duplicate
dates that have been loaded before.

I started out with this ingenious idea. Just try to insert the record. If I
get an error about duplicate keys, do an update and switch to update mode.
If I'm updating and get zero rows updated, switch back to insert. Works
fine, except this whole thing is in a transaction and any errors abort the
transaction and I can't commit it. So, I decided I'd have to try to update
every single record and, if zero rows were updated, do the insert. This
takes a looooong time. Many hours for just 86000 records or so.

Is there a standard way to do this. I can't imagine I'm the only guy that
need to do this kind of thing.


Try inserting a batch of 1024. If you get an error, drop down to 512 and try
that. Repeat until the batch works or you've reached a size of 1.
If the batch worked, try the next set of records and repeat. If you reached a
batch size of 1 with no success then, switch to updating, and repeat the
cycle increasing your batch-size as you go.

You might find it quickest to halve batch-size while having problems then
doubling while it works. The balance is going to depend on how many insert vs
update rows you have.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #2

P: n/a
On Thursday 25 September 2003 18:32, Curtis Stanford wrote:
On September 25, 2003 11:27 am, Richard Huxton wrote:

[snip]
You might find it quickest to halve batch-size while having problems then
doubling while it works. The balance is going to depend on how many
insert vs update rows you have.


Hey thanks! I actually got the time down to around 1000 rows/sec. I was
passing in one of the int arguments in the update where clause as a double.
I don't know why, but changing it to an int as it should be drastically
reduced the time. Your solution is very interesting. I'll probably try it
and see which is faster.


That'll be the infamous PG typecast issue with indexes. PG is quite flexible
about you defining your own types, operators etc. To balance this it's also
quite careful about typecasting stuff without being told to. You tend to
notice this with function-calls and indexes not being used when they should.

If the index had been on a double and you'd passed an int, it might have
figured it out, but I tend to be very strict about my types in any case.
Actually, that's probably a good thing most of the time.

--
Richard Huxton
Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Nov 12 '05 #3

P: n/a
Richard Huxton wrote:
On Thursday 25 September 2003 02:23, Curtis Stanford wrote:
I'm in a situation where I need to load an ASCII file into a database. No
sweat so far right? The records are indexed by date and I have a unique
index on date. The ASCII can overlap, meaning it can contain duplicate
dates that have been loaded before.

I started out with this ingenious idea. Just try to insert the record. If I
get an error about duplicate keys, do an update and switch to update mode.
If I'm updating and get zero rows updated, switch back to insert. Works
fine, except this whole thing is in a transaction and any errors abort the
transaction and I can't commit it. So, I decided I'd have to try to update
every single record and, if zero rows were updated, do the insert. This
takes a looooong time. Many hours for just 86000 records or so.

Is there a standard way to do this. I can't imagine I'm the only guy that
need to do this kind of thing.

Try inserting a batch of 1024. If you get an error, drop down to 512 and try
that. Repeat until the batch works or you've reached a size of 1.
If the batch worked, try the next set of records and repeat. If you reached a
batch size of 1 with no success then, switch to updating, and repeat the
cycle increasing your batch-size as you go.

You might find it quickest to halve batch-size while having problems then
doubling while it works. The balance is going to depend on how many insert vs
update rows you have.


In addition to that, you can try inserting from multiple backends simaltenously
to speed up the whole process.

And I don't like the modes idea OP gave. I would rather follow
insert->if-error-update mode in a transaction for each record. And fork over say
20/40 parallel backends to achieve good speed.

Shridhar
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #4

P: n/a
On September 25, 2003 11:27 am, Richard Huxton wrote:
On Thursday 25 September 2003 02:23, Curtis Stanford wrote:
I'm in a situation where I need to load an ASCII file into a database. No
sweat so far right? The records are indexed by date and I have a unique
index on date. The ASCII can overlap, meaning it can contain duplicate
dates that have been loaded before.

I started out with this ingenious idea. Just try to insert the record. If
I get an error about duplicate keys, do an update and switch to update
mode. If I'm updating and get zero rows updated, switch back to insert.
Works fine, except this whole thing is in a transaction and any errors
abort the transaction and I can't commit it. So, I decided I'd have to
try to update every single record and, if zero rows were updated, do the
insert. This takes a looooong time. Many hours for just 86000 records or
so.

Is there a standard way to do this. I can't imagine I'm the only guy that
need to do this kind of thing.


Try inserting a batch of 1024. If you get an error, drop down to 512 and
try that. Repeat until the batch works or you've reached a size of 1.
If the batch worked, try the next set of records and repeat. If you reached
a batch size of 1 with no success then, switch to updating, and repeat the
cycle increasing your batch-size as you go.

You might find it quickest to halve batch-size while having problems then
doubling while it works. The balance is going to depend on how many insert
vs update rows you have.


Hey thanks! I actually got the time down to around 1000 rows/sec. I was
passing in one of the int arguments in the update where clause as a double. I
don't know why, but changing it to an int as it should be drastically reduced
the time. Your solution is very interesting. I'll probably try it and see
which is faster.
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to ma*******@postgresql.org

Nov 12 '05 #5

P: n/a
Shridhar Daithankar wrote:

In addition to that, you can try inserting from multiple backends
simaltenously to speed up the whole process.

And I don't like the modes idea OP gave. I would rather follow
insert->if-error-update mode in a transaction for each record. And
fork over say 20/40 parallel backends to achieve good speed.

I didn't think mulitple backends worked.
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Nov 12 '05 #6

P: n/a
On Friday 26 September 2003 20:48, Dennis Gearon wrote:
Shridhar Daithankar wrote:
In addition to that, you can try inserting from multiple backends
simaltenously to speed up the whole process.

And I don't like the modes idea OP gave. I would rather follow
insert->if-error-update mode in a transaction for each record. And
fork over say 20/40 parallel backends to achieve good speed.


I didn't think mulitple backends worked.


Why? Why can't you insert using multiple backends? I mean is there a problem?

Shridhar
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

Nov 12 '05 #7

P: n/a
On Friday 26 September 2003 21:08, Dennis Gearon wrote:
corruption of the databases, if I'm correct. Shridhar Daithankar wrote:
On Friday 26 September 2003 20:48, Dennis Gearon wrote:
I didn't think mulitple backends worked.


Why? Why can't you insert using multiple backends? I mean is there a
problem?


How does database gets corrupted? If you have 20 backend inserting their own
chunk of data in separate transactions, where does database corruption comes
in picture?

That's the whole idea of transactions.

Shridhar
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #8

P: n/a
Well, I may have my merms tixed up!

There is ONE component of Postgres that can only be accessing the
database by itself. Maybe it's not the 'backend', it's probably the
postmaster. Are they the same? I don't know.

Shridhar Daithankar wrote:
On Friday 26 September 2003 21:08, Dennis Gearon wrote:

corruption of the databases, if I'm correct.

Shridhar Daithankar wrote:

On Friday 26 September 2003 20:48, Dennis Gearon wrote:
I didn't think mulitple backends worked.
Why? Why can't you insert using multiple backends? I mean is there a
problem?


How does database gets corrupted? If you have 20 backend inserting their own
chunk of data in separate transactions, where does database corruption comes
in picture?

That's the whole idea of transactions.

Shridhar

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Nov 12 '05 #9

P: n/a
On Fri, 26 Sep 2003, Shridhar Daithankar wrote:
On Friday 26 September 2003 21:08, Dennis Gearon wrote:
corruption of the databases, if I'm correct.

Shridhar Daithankar wrote:
On Friday 26 September 2003 20:48, Dennis Gearon wrote:
>I didn't think mulitple backends worked.

Why? Why can't you insert using multiple backends? I mean is there a
problem?


How does database gets corrupted? If you have 20 backend inserting their own
chunk of data in separate transactions, where does database corruption comes
in picture?

That's the whole idea of transactions.


I think Dennis is confusing multiple backends (legal and safe, each
connection gets its own backend) with multiple postmasters (not legal,
very unsafe, causes corruption, etc...)
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Nov 12 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.