By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,675 Members | 2,255 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,675 IT Pros & Developers. It's quick & easy.

Read from database, write to another database, simultaneously

P: n/a
I am working on a simple script to read from one database (oracle) and
write to another (postgresql). I retrieve the data from oracle in
chunks and drop the data to postgresql continuously. The author of one
of the python database clients mentioned that using one thread to
retrieve the data from the oracle database and another to insert the
data into postgresql with something like a pipe between the two threads
might make sense, keeping both IO streams busy. Any hints on how to
get started?

Thanks,
Sean

Jan 10 '07 #1
Share this Question
Share on Google+
12 Replies


P: n/a
Sean Davis wrote:
The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,
Björn

--
BOFH excuse #194:

We only support a 1200 bps connection.

Jan 10 '07 #2

P: n/a
Bjoern Schliessmann wrote:
Sean Davis wrote:
>The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,
Björn
Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.

One time transfer live to the time issue.

Johnf
Jan 11 '07 #3

P: n/a
Bjoern Schliessmann a écrit :
Sean Davis wrote:
>The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.

But, if its only a do-once job, maybe the optimization is net really
necessary.
Jan 11 '07 #4

P: n/a


On Jan 10, 9:27 pm, johnf <jfabi...@yolo.comwrote:
Bjoern Schliessmann wrote:
Sean Davis wrote:
The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.
Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.
Regards,

Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.
Yes, some detail about the problem is definitely in order!

We have a collaborator that is going to maintain a large genome
database that is a component of a postgresql database that we currently
maintain. There are going to be consumers of the oracle data using
both mysql and postgresql. The oracle database is LARGE with around
100,000,000 rows spread over some 50-70 tables in multiple schemas.
The idea is that as publicly available data (from various datasources
on the web) become available, the oracle team will update the oracle
database, doing all the parsing and necessary data cleanup of the raw
data. We then want to be able to update postgres with these oracle
data. So the process may be done only once per month on some tables,
but as often as once a day on others.

As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data. Note the
disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.

I hope that clarifies things.

Thanks,
Sean

Jan 11 '07 #5

P: n/a


On Jan 11, 3:20 am, Laurent Pointal <laurent.poin...@limsi.frwrote:
Bjoern Schliessmann a écrit :
Sean Davis wrote:
The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.
Some more detail:

The machine running the script is distinct from the Oracle machine
which is distinct from the Postgresql machine. So, CPU usage is low
and because of the independent machines for the database end, it is
definitely possible to read from one database while writing to the
other. That is the solution that I am looking for, and Dennis's post
seems pretty close to what I need. I will have to use some kind of
buffer. A Queue isn't quite right as it stands, as the data is coming
in as records, but for postgres loading, a file-like stream is what I
need, so there will need to be either a wrapper around the Queue on the
get() side. Or is there a better way to go about this detail? What
seems to make sense to me is to stringify the incoming oracle data into
some kind of buffer and then read on the postgresql side.

Thanks,
Sean

Jan 11 '07 #6

P: n/a

Sean Davis wrote:
at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.
IMO the problem that you need to solve is not well suited for the
python DBAPI as this API is meant to support programming and
interacting with the database not streaming large quantities from one
database into another.

I agree with another opinion in this thread.

All you need is a simple way to pipe the output from Oracle into
Postgresql. Just run the oracle client and start dumping to the
standard output. Pipe it through sed (or a python program) to reformat
the output for whatever minor fixes (you might not even need this step)
then continue piping it right into psql.

i.

Jan 11 '07 #7

P: n/a
Sean Davis wrote:
>
As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data.
And which seems to use the COPY FROM command in PostgreSQL...
Note the disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in?
My experience along with the PostgreSQL documentation tells me that you
shouldn't worry about this problem too much: using COPY FROM is *far*
faster than performing many inserts or updates. The only thing you'd
need to worry about is data being copied into the database that
duplicates existing data, causing constraint violations, but since
you're already using this method I imagine that this is not a likely
problem.

Paul

Jan 11 '07 #8

P: n/a
Laurent Pointal wrote:
Not so sure, there is low CPU in the Python script,
Yes.
but there may be CPU+disk activity on the database sides [with
cache management and other optimizations on disk access].
That's it. So data queues up on the database side and you won't get
much value from faked concurrency with CPU cycles.
So, with a reader thread and a writer thread, he can have a select
on a database performed in parallel with an insert on the other
database.
Explain. Remember, threads aren't really working concurrently. Even
on a multiprocessor machine you have constraints for IO traffic.
(And the GIL exists too)
But, if its only a do-once job, maybe the optimization is net
really necessary.
I still don't understand how threads would help optimizing a task
that largely depends on IO and will probably be executed on one
CPU.

Regards,
Björn

--
BOFH excuse #14:

sounds like a Windows problem, try calling Microsoft support

Jan 11 '07 #9

P: n/a
Sean Davis wrote:
I solved this problem by creating a temporary file as an
intermediary, but why wait for Oracle to finish dumping data when
I can potentially be loading into postgres at the same time that
the data is coming in? So, I am actually
looking for a solution to this problem that doesn't require an
intermediate file and allows simultaneous reading and writing,
with the caveat that the data cannot all be read into memory
simultaneously, so will need to be buffered.
The functions you use don't seem very suited for such a
streaminglike task.

Regards,
Björn

--
BOFH excuse #282:

High altitude condensation from U.S.A.F prototype aircraft has
contaminated the primary subnet mask. Turn off your computer for 9
days to avoid damaging it.

Jan 11 '07 #10

P: n/a
On 1/11/07, Dennis Lee Bieber <wl*****@ix.netcom.comwrote:
On 11 Jan 2007 04:49:21 -0800, "Sean Davis" <se******@gmail.com>
declaimed the following in comp.lang.python:


The machine running the script is distinct from the Oracle machine
which is distinct from the Postgresql machine. So, CPU usage is low

Which somewhat cancels the idea of just using pipes to pass data
directly... Unless the command-line clients can be run "locally" and
connect to the remote servers.
buffer. A Queue isn't quite right as it stands, as the data is coming
in as records, but for postgres loading, a file-like stream is what I
need, so there will need to be either a wrapper around the Queue on the

"need"? No... the "copy ... from ..." statement "needs" such, but do
you really "need" to use "copy ... from ..." to load the data -- or is
this just a concept that caught your fancy?

If you can get Oracle to dump the data in the form of SQL insert
statements, maybe with some clean-up done by a script, and feed them to
PostgreSQL the entire job starts to look something like:

odump theDatabase | python cleanup.py | psql theNewDatabase

{note: command line tool names are made up for the example}

"copy ... from ..." appears to be designed to work with formatted
text files... Something that might have been produced as a tabular
report by almost any tool. As such, you are basically bypassing most of
the DB-API capability. If Oracle has a "select" variant that can specify
a start-offset and length (or a server-side cursor in the DB-API so the
full data is not transferred in one "chunk"), and the PostgreSQL DP-API
supports an "executemany" operation, the threading pseudocode I showed
could still be used. Instead of queueing single records, one could queue
a multiple record "chunk" (a "fetchmany" specifying 100, 500, or
whatever, records); the data would be a list of tuples, so no concerns
about parsing fields from a text record. Then an "executemany" would
submit the chunk at one time.
get() side. Or is there a better way to go about this detail? What
seems to make sense to me is to stringify the incoming oracle data into
some kind of buffer and then read on the postgresql side.

My view: avoid using any commands that result in formatting the data
as text reports... Use DB-API interfaces to keep the data as fields.

ocrsr.execute("select * from table")
while True:
batch = ocrsr.fetchmany(100)
if not batch: break
pcrsr.executemany("insert into table", batch)

{this is a sequential sample -- but using threads just means the ocrsr
operation are in one thread, batch is the data relayed via a queue, and
pcrsr operations are in the other thread}
--
Wulfraed Dennis Lee Bieber KD6MOG
wl*****@ix.netcom.com wu******@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: we******@bestiaria.com)
HTTP://www.bestiaria.com/

Using db-api for this is the wrong approach. COPY FROM, which is the
postgresql equivilent of oracles SQLLoader, is orders of magnitude
faster than sequential inserts, and it's really what you want for bulk
data loads. What you're trying to do is read & write from the file at
the same time, but across a network interface that's just asking for
trouble. In particular, I doubt that postgres expects to have
something writing to the end of the file it is loading and will
probably error out if it hits an unexpected EOF, which is what will
happen if the writer lags behind.

Just dump the files from oracle, copy them, and load them. It'll
probably be faster in the long run when you count all the time you
lose re-starting the processing and troubleshooting your
"optimization".
Jan 11 '07 #11

P: n/a
Dennis Lee Bieber wrote:
>
"need"? No... the "copy ... from ..." statement "needs" such, but do
you really "need" to use "copy ... from ..." to load the data -- or is
this just a concept that caught your fancy?
In chapter 13 of the PostgreSQL 8.1 documentation ("Performance Tips"),
descending into section 13.4 ("Populating a Database"), whose
introduction may not admittedly coincide precisely with the inquirer's
use case ("first populating a database" doesn't strictly apply, but it
may yet describe a situation that is close enough given the kind of
data involved), and within section 13.4.2 ("Use COPY"), the text reads
as follows:

"Use COPY to load all the rows in one command, instead of using a
series of INSERT commands."

http://www.postgresql.org/docs/8.1/s...LATE-COPY-FROM
If you can get Oracle to dump the data in the form of SQL insert
statements, maybe with some clean-up done by a script, and feed them to
PostgreSQL the entire job starts to look something like:

odump theDatabase | python cleanup.py | psql theNewDatabase
The problem with this approach (as the manual notes) is that COPY is a
lot faster than lots of INSERT statements:

"Note that loading a large number of rows using COPY is almost always
faster than using INSERT, even if PREPARE is used and multiple
insertions are batched into a single transaction."

My experience with inserting large numbers of records into PostgreSQL
suggests that this advice should not be readily ignored.

Paul

Jan 11 '07 #12

P: n/a
Dennis Lee Bieber wrote:
The problem though, is that the original poster claimed the Oracle
data was being emitted in multiple chunks, not as a single output -- and
they want to avoid collecting the data in a temporary file...
I think he wanted to know whether concurrent reads from Oracle and
writes to PostgreSQL would be faster or more efficient. I'd argue that
the temporary file approach is likely to be faster and more efficient
(disk space usage for temporary files disregarded). Sure, if you can
have two concurrent processes with minimal contention, one reading from
Oracle, one writing to PostgreSQL, where the PostgreSQL one never lags
behind and thus completes shortly after the Oracle data has been
completely downloaded, then that might be the ideal solution, but I'm
inclined to believe that unless the stream of data from Oracle were
arriving really slowly, it wouldn't work out that way.

Anyway, that's what my experiences with PostgreSQL suggest so far.
Further commentary on the topic is, as noted, available in the manual.

Paul

Jan 12 '07 #13

This discussion thread is closed

Replies have been disabled for this discussion.