Read from database, write to another database, simultaneously

Sean Davis

I am working on a simple script to read from one database (oracle) and
write to another (postgresql). I retrieve the data from oracle in
chunks and drop the data to postgresql continuously. The author of one
of the python database clients mentioned that using one thread to
retrieve the data from the oracle database and another to insert the
data into postgresql with something like a pipe between the two threads
might make sense, keeping both IO streams busy. Any hints on how to
get started?

Thanks,
Sean

Jan 10 '07 #1

Subscribe Post Reply

5435

Bjoern Schliessmann

Sean Davis wrote:

The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,
Björn

--
BOFH excuse #194:

We only support a 1200 bps connection.

Jan 10 '07 #2

johnf

Bjoern Schliessmann wrote:

Sean Davis wrote:

>The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,
BjÃ¶rn

Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.

One time transfer live to the time issue.

Johnf

Jan 11 '07 #3

Laurent Pointal

Bjoern Schliessmann a écrit :

Sean Davis wrote:

>The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.

But, if its only a do-once job, maybe the optimization is net really
necessary.

Jan 11 '07 #4

Sean Davis

On Jan 10, 9:27 pm, johnf <jfabi...@yolo.comwrote:

Bjoern Schliessmann wrote:
Sean Davis wrote:

The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,

Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.

Yes, some detail about the problem is definitely in order!

We have a collaborator that is going to maintain a large genome
database that is a component of a postgresql database that we currently
maintain. There are going to be consumers of the oracle data using
both mysql and postgresql. The oracle database is LARGE with around
100,000,000 rows spread over some 50-70 tables in multiple schemas.
The idea is that as publicly available data (from various datasources
on the web) become available, the oracle team will update the oracle
database, doing all the parsing and necessary data cleanup of the raw
data. We then want to be able to update postgres with these oracle
data. So the process may be done only once per month on some tables,
but as often as once a day on others.

As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data. Note the
disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.

I hope that clarifies things.

Thanks,
Sean

Jan 11 '07 #5

Sean Davis

On Jan 11, 3:20 am, Laurent Pointal <laurent.poin...@limsi.frwrote:

Bjoern Schliessmann a écrit :

Sean Davis wrote:

The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.

Some more detail:

The machine running the script is distinct from the Oracle machine
which is distinct from the Postgresql machine. So, CPU usage is low
and because of the independent machines for the database end, it is
definitely possible to read from one database while writing to the
other. That is the solution that I am looking for, and Dennis's post
seems pretty close to what I need. I will have to use some kind of
buffer. A Queue isn't quite right as it stands, as the data is coming
in as records, but for postgres loading, a file-like stream is what I
need, so there will need to be either a wrapper around the Queue on the
get() side. Or is there a better way to go about this detail? What
seems to make sense to me is to stringify the incoming oracle data into
some kind of buffer and then read on the postgresql side.

Thanks,
Sean

Jan 11 '07 #6

Istvan Albert

Sean Davis wrote:

at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.

IMO the problem that you need to solve is not well suited for the
python DBAPI as this API is meant to support programming and
interacting with the database not streaming large quantities from one
database into another.

I agree with another opinion in this thread.

All you need is a simple way to pipe the output from Oracle into
Postgresql. Just run the oracle client and start dumping to the
standard output. Pipe it through sed (or a python program) to reformat
the output for whatever minor fixes (you might not even need this step)
then continue piping it right into psql.

i.

Jan 11 '07 #7

Paul Boddie

Sean Davis wrote:

>
As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data.

And which seems to use the COPY FROM command in PostgreSQL...

Note the disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in?

My experience along with the PostgreSQL documentation tells me that you
shouldn't worry about this problem too much: using COPY FROM is *far*
faster than performing many inserts or updates. The only thing you'd
need to worry about is data being copied into the database that
duplicates existing data, causing constraint violations, but since
you're already using this method I imagine that this is not a likely
problem.

Paul

Jan 11 '07 #8

Bjoern Schliessmann

Laurent Pointal wrote:

Not so sure, there is low CPU in the Python script,

Yes.

but there may be CPU+disk activity on the database sides [with
cache management and other optimizations on disk access].

That's it. So data queues up on the database side and you won't get
much value from faked concurrency with CPU cycles.

So, with a reader thread and a writer thread, he can have a select
on a database performed in parallel with an insert on the other
database.

Explain. Remember, threads aren't really working concurrently. Even
on a multiprocessor machine you have constraints for IO traffic.
(And the GIL exists too)

But, if its only a do-once job, maybe the optimization is net
really necessary.

I still don't understand how threads would help optimizing a task
that largely depends on IO and will probably be executed on one
CPU.

Regards,
Björn

--
BOFH excuse #14:

sounds like a Windows problem, try calling Microsoft support

Jan 11 '07 #9

Bjoern Schliessmann

Sean Davis wrote:

I solved this problem by creating a temporary file as an
intermediary, but why wait for Oracle to finish dumping data when
I can potentially be loading into postgres at the same time that
the data is coming in? So, I am actually
looking for a solution to this problem that doesn't require an
intermediate file and allows simultaneous reading and writing,
with the caveat that the data cannot all be read into memory
simultaneously, so will need to be buffered.

The functions you use don't seem very suited for such a
streaminglike task.

Regards,
Björn

--
BOFH excuse #282:

High altitude condensation from U.S.A.F prototype aircraft has
contaminated the primary subnet mask. Turn off your computer for 9
days to avoid damaging it.

Jan 11 '07 #10

Chris Mellon

On 1/11/07, Dennis Lee Bieber <wl*****@ix.netcom.comwrote:

On 11 Jan 2007 04:49:21 -0800, "Sean Davis" <se******@gmail.com>
declaimed the following in comp.lang.python:

The machine running the script is distinct from the Oracle machine
which is distinct from the Postgresql machine. So, CPU usage is low

Which somewhat cancels the idea of just using pipes to pass data
directly... Unless the command-line clients can be run "locally" and
connect to the remote servers.

buffer. A Queue isn't quite right as it stands, as the data is coming
in as records, but for postgres loading, a file-like stream is what I
need, so there will need to be either a wrapper around the Queue on the

"need"? No... the "copy ... from ..." statement "needs" such, but do
you really "need" to use "copy ... from ..." to load the data -- or is
this just a concept that caught your fancy?

If you can get Oracle to dump the data in the form of SQL insert
statements, maybe with some clean-up done by a script, and feed them to
PostgreSQL the entire job starts to look something like:

odump theDatabase | python cleanup.py | psql theNewDatabase

{note: command line tool names are made up for the example}

"copy ... from ..." appears to be designed to work with formatted
text files... Something that might have been produced as a tabular
report by almost any tool. As such, you are basically bypassing most of
the DB-API capability. If Oracle has a "select" variant that can specify
a start-offset and length (or a server-side cursor in the DB-API so the
full data is not transferred in one "chunk"), and the PostgreSQL DP-API
supports an "executemany" operation, the threading pseudocode I showed
could still be used. Instead of queueing single records, one could queue
a multiple record "chunk" (a "fetchmany" specifying 100, 500, or
whatever, records); the data would be a list of tuples, so no concerns
about parsing fields from a text record. Then an "executemany" would
submit the chunk at one time.

get() side. Or is there a better way to go about this detail? What
seems to make sense to me is to stringify the incoming oracle data into
some kind of buffer and then read on the postgresql side.

My view: avoid using any commands that result in formatting the data
as text reports... Use DB-API interfaces to keep the data as fields.

ocrsr.execute("select * from table")
while True:
batch = ocrsr.fetchmany(100)
if not batch: break
pcrsr.executemany("insert into table", batch)

{this is a sequential sample -- but using threads just means the ocrsr
operation are in one thread, batch is the data relayed via a queue, and
pcrsr operations are in the other thread}
--
Wulfraed Dennis Lee Bieber KD6MOG
wl*****@ix.netcom.com wu******@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: we******@bestiaria.com)
HTTP://www.bestiaria.com/

Using db-api for this is the wrong approach. COPY FROM, which is the
postgresql equivilent of oracles SQLLoader, is orders of magnitude
faster than sequential inserts, and it's really what you want for bulk
data loads. What you're trying to do is read & write from the file at
the same time, but across a network interface that's just asking for
trouble. In particular, I doubt that postgres expects to have
something writing to the end of the file it is loading and will
probably error out if it hits an unexpected EOF, which is what will
happen if the writer lags behind.

Just dump the files from oracle, copy them, and load them. It'll
probably be faster in the long run when you count all the time you
lose re-starting the processing and troubleshooting your
"optimization".

Jan 11 '07 #11

Paul Boddie

Dennis Lee Bieber wrote:

>
"need"? No... the "copy ... from ..." statement "needs" such, but do
you really "need" to use "copy ... from ..." to load the data -- or is
this just a concept that caught your fancy?

In chapter 13 of the PostgreSQL 8.1 documentation ("Performance Tips"),
descending into section 13.4 ("Populating a Database"), whose
introduction may not admittedly coincide precisely with the inquirer's
use case ("first populating a database" doesn't strictly apply, but it
may yet describe a situation that is close enough given the kind of
data involved), and within section 13.4.2 ("Use COPY"), the text reads
as follows:

"Use COPY to load all the rows in one command, instead of using a
series of INSERT commands."

http://www.postgresql.org/docs/8.1/s...LATE-COPY-FROM

If you can get Oracle to dump the data in the form of SQL insert
statements, maybe with some clean-up done by a script, and feed them to
PostgreSQL the entire job starts to look something like:

odump theDatabase | python cleanup.py | psql theNewDatabase

The problem with this approach (as the manual notes) is that COPY is a
lot faster than lots of INSERT statements:

"Note that loading a large number of rows using COPY is almost always
faster than using INSERT, even if PREPARE is used and multiple
insertions are batched into a single transaction."

My experience with inserting large numbers of records into PostgreSQL
suggests that this advice should not be readily ignored.

Paul

Jan 11 '07 #12

Paul Boddie

Dennis Lee Bieber wrote:

The problem though, is that the original poster claimed the Oracle
data was being emitted in multiple chunks, not as a single output -- and
they want to avoid collecting the data in a temporary file...

I think he wanted to know whether concurrent reads from Oracle and
writes to PostgreSQL would be faster or more efficient. I'd argue that
the temporary file approach is likely to be faster and more efficient
(disk space usage for temporary files disregarded). Sure, if you can
have two concurrent processes with minimal contention, one reading from
Oracle, one writing to PostgreSQL, where the PostgreSQL one never lags
behind and thus completes shortly after the Oracle data has been
completely downloaded, then that might be the ideal solution, but I'm
inclined to believe that unless the stream of data from Oracle were
arriving really slowly, it wouldn't work out that way.

Anyway, that's what my experiences with PostgreSQL suggest so far.
Further commentary on the topic is, as noted, available in the manual.

Paul

Jan 12 '07 #13

Similar topics

Question about database connections

by: Mudge | last post by:

Hi, My hosting provider only allows me to use 50 connections to my MySQL database that my Web site will use. I don't know what this 50 connections means exactly. Does this mean that only 50...

PHP

Session Timeout when trying to read database

by: HolaGoogle | last post by:

hi all, can you please tell me what i should do to avoid session timeout when displaying my database info in my asp form (DisplayUserDatabase.asp)??? ** actualy it does load and display the...

ASP / Active Server Pages

Read-Only Problem with Access Database

by: Graham Taylor | last post by:

I've tried posting this in the 'microsoft.public.access' but I will post it here also, as I think it might be the webserver which is causing my problem. --------- I have an Access 2003 database...

ASP / Active Server Pages

Simultaneously Write to and Read from the same file

by: cnu | last post by:

My program generates a log file for every event that happens in the program. So, I open the file and keep it open till the end. This is how I open the file for writing: <CODE> public...

.NET Framework

Database Help

by: g | last post by:

Hi, I have an MS Access 2000 Database on a network. When ever 2 people use it at the same time it crashes. The database size is 14 megs. I have another database just like it...and it has not...

Microsoft Access / VBA

Can't simultaneously read/write from ossaudio dsp device

by: Tito | last post by:

For an internet telephone application, I need to be able to read and write data to and from /dev/dsp simultaneously. I wrote some code and its not working. Anyone have any working code to do...

Python

Making a database connection global

by: Terry Jolly | last post by:

Web Solution Goal: Have a global database connection Why: (There will be 30+ tables, represented by 30+ classes) I only want to reference the database connection once. I put the connection...

ASP.NET

Querying a database using vb

by: onecorp | last post by:

I have a SQL table comprised of 31 columns. The first column is simply an id column, the next 30 columns are labelled ,.... The numerical columns have a tinyint type and the data stored is either...

Visual Basic .NET

Read Only

by: farhaaad | last post by:

Hi all, i wanted to know how can i make my mdb file read only recommended, i mean when the user opens the database with holding the SHIFT key it shouldn't be read only and when opening the...

Microsoft Access / VBA

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++