473,890 Members | 1,754 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Read from database, write to another database, simultaneously

I am working on a simple script to read from one database (oracle) and
write to another (postgresql). I retrieve the data from oracle in
chunks and drop the data to postgresql continuously. The author of one
of the python database clients mentioned that using one thread to
retrieve the data from the oracle database and another to insert the
data into postgresql with something like a pipe between the two threads
might make sense, keeping both IO streams busy. Any hints on how to
get started?

Thanks,
Sean

Jan 10 '07 #1
12 5478
Sean Davis wrote:
The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,
Björn

--
BOFH excuse #194:

We only support a 1200 bps connection.

Jan 10 '07 #2
Bjoern Schliessmann wrote:
Sean Davis wrote:
>The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.

Regards,
Björn
Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.

One time transfer live to the time issue.

Johnf
Jan 11 '07 #3
Bjoern Schliessmann a écrit :
Sean Davis wrote:
>The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.

IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.

Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.

But, if its only a do-once job, maybe the optimization is net really
necessary.
Jan 11 '07 #4


On Jan 10, 9:27 pm, johnf <jfabi...@yolo. comwrote:
Bjoern Schliessmann wrote:
Sean Davis wrote:
The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.
Using multiplexing, you'll get good results with simple code without
the danger of deadlocks. Have a look at asyncore (standard library)
or the Twisted framework -- personally, I prefer the latter.
Regards,

Sean you can't win - everyone has a different idea! You need to explain
that oracle has millions of records and it's possible to a pipe open to
feed the Postgres side.

One thing I didn't get - is this a one time transfer or something that is
going to happen often.
Yes, some detail about the problem is definitely in order!

We have a collaborator that is going to maintain a large genome
database that is a component of a postgresql database that we currently
maintain. There are going to be consumers of the oracle data using
both mysql and postgresql. The oracle database is LARGE with around
100,000,000 rows spread over some 50-70 tables in multiple schemas.
The idea is that as publicly available data (from various datasources
on the web) become available, the oracle team will update the oracle
database, doing all the parsing and necessary data cleanup of the raw
data. We then want to be able to update postgres with these oracle
data. So the process may be done only once per month on some tables,
but as often as once a day on others.

As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data. Note the
disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.

I hope that clarifies things.

Thanks,
Sean

Jan 11 '07 #5


On Jan 11, 3:20 am, Laurent Pointal <laurent.poin.. .@limsi.frwrote :
Bjoern Schliessmann a écrit :
Sean Davis wrote:
The author of one of the python database clients mentioned that
using one thread to retrieve the data from the oracle database and
another to insert the data into postgresql with something like a
pipe between the two threads might make sense, keeping both IO
streams busy.
IMHO he's wrong. Network interaction is quite slow compared with CPU
performance, so there's no gain (maybe even overhead due to thread
management and locking stuff). That's true even on multiprocessor
machines, not only because there's almost nothing to compute but
only IO traffic. CMIIW.Not so sure, there is low CPU in the Python script, but there may be
CPU+disk activity on the database sides [with cache management and other
optimizations on disk access].
So, with a reader thread and a writer thread, he can have a select on a
database performed in parallel with an insert on the other database.
After, he must know if the two databases use same disks, same
controller, same host... or not.
Some more detail:

The machine running the script is distinct from the Oracle machine
which is distinct from the Postgresql machine. So, CPU usage is low
and because of the independent machines for the database end, it is
definitely possible to read from one database while writing to the
other. That is the solution that I am looking for, and Dennis's post
seems pretty close to what I need. I will have to use some kind of
buffer. A Queue isn't quite right as it stands, as the data is coming
in as records, but for postgres loading, a file-like stream is what I
need, so there will need to be either a wrapper around the Queue on the
get() side. Or is there a better way to go about this detail? What
seems to make sense to me is to stringify the incoming oracle data into
some kind of buffer and then read on the postgresql side.

Thanks,
Sean

Jan 11 '07 #6

Sean Davis wrote:
at the same time that the data is coming in? So, I am actually looking
for a solution to this problem that doesn't require an intermediate
file and allows simultaneous reading and writing, with the caveat that
the data cannot all be read into memory simultaneously, so will need to
be buffered.
IMO the problem that you need to solve is not well suited for the
python DBAPI as this API is meant to support programming and
interacting with the database not streaming large quantities from one
database into another.

I agree with another opinion in this thread.

All you need is a simple way to pipe the output from Oracle into
Postgresql. Just run the oracle client and start dumping to the
standard output. Pipe it through sed (or a python program) to reformat
the output for whatever minor fixes (you might not even need this step)
then continue piping it right into psql.

i.

Jan 11 '07 #7
Sean Davis wrote:
>
As for the specifics, Oracle data is going to be coming in as a DB-API
2 cursor in manageable chunks (and at a relatively slow pace). On the
postgres loading side, I wanted to use the pscycopg2 copy_from
function, which expects an open file-like object (that has read and
readline functionality) and is quite fast for loading data.
And which seems to use the COPY FROM command in PostgreSQL...
Note the disconnect here--Oracle is coming in in discrete chunks, while
postgresql is looking for a file object. I solved this problem by
creating a temporary file as an intermediary, but why wait for Oracle
to finish dumping data when I can potentially be loading into postgres
at the same time that the data is coming in?
My experience along with the PostgreSQL documentation tells me that you
shouldn't worry about this problem too much: using COPY FROM is *far*
faster than performing many inserts or updates. The only thing you'd
need to worry about is data being copied into the database that
duplicates existing data, causing constraint violations, but since
you're already using this method I imagine that this is not a likely
problem.

Paul

Jan 11 '07 #8
Laurent Pointal wrote:
Not so sure, there is low CPU in the Python script,
Yes.
but there may be CPU+disk activity on the database sides [with
cache management and other optimizations on disk access].
That's it. So data queues up on the database side and you won't get
much value from faked concurrency with CPU cycles.
So, with a reader thread and a writer thread, he can have a select
on a database performed in parallel with an insert on the other
database.
Explain. Remember, threads aren't really working concurrently. Even
on a multiprocessor machine you have constraints for IO traffic.
(And the GIL exists too)
But, if its only a do-once job, maybe the optimization is net
really necessary.
I still don't understand how threads would help optimizing a task
that largely depends on IO and will probably be executed on one
CPU.

Regards,
Björn

--
BOFH excuse #14:

sounds like a Windows problem, try calling Microsoft support

Jan 11 '07 #9
Sean Davis wrote:
I solved this problem by creating a temporary file as an
intermediary, but why wait for Oracle to finish dumping data when
I can potentially be loading into postgres at the same time that
the data is coming in? So, I am actually
looking for a solution to this problem that doesn't require an
intermediate file and allows simultaneous reading and writing,
with the caveat that the data cannot all be read into memory
simultaneously, so will need to be buffered.
The functions you use don't seem very suited for such a
streaminglike task.

Regards,
Björn

--
BOFH excuse #282:

High altitude condensation from U.S.A.F prototype aircraft has
contaminated the primary subnet mask. Turn off your computer for 9
days to avoid damaging it.

Jan 11 '07 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
2424
by: Mudge | last post by:
Hi, My hosting provider only allows me to use 50 connections to my MySQL database that my Web site will use. I don't know what this 50 connections means exactly. Does this mean that only 50 visitors to my Web site can access my database through my Web site at one time? Or does this mean that in my code I can only use 50 connections? and like
11
3592
by: HolaGoogle | last post by:
hi all, can you please tell me what i should do to avoid session timeout when displaying my database info in my asp form (DisplayUserDatabase.asp)??? ** actualy it does load and display the database randomly....otherwise it's always runs into a session time out.... **Also when i save new account info (You'll notice that i've been asked to put a save button near each record...so, it's not one global save it's by record), i'd like to be...
7
12664
by: Graham Taylor | last post by:
I've tried posting this in the 'microsoft.public.access' but I will post it here also, as I think it might be the webserver which is causing my problem. --------- I have an Access 2003 database which is in the "fpdb" folder of my webserver. Its located there so that I can use asp to build a web-based front-end for users to read the database - http://www.nist.ac.th/maths/test1.asp The MBD file is edited using Access (2003) and opening...
1
6775
by: cnu | last post by:
My program generates a log file for every event that happens in the program. So, I open the file and keep it open till the end. This is how I open the file for writing: <CODE> public CLogHandler() { this.m_fsLog = new FileStream(strTodaysLogFile, System.IO.FileMode.Append, System.IO.FileAccess.Write, System.IO.FileShare.Read); this.m_swLog = new StreamWriter(this.m_fsLog);
6
2389
by: g | last post by:
Hi, I have an MS Access 2000 Database on a network. When ever 2 people use it at the same time it crashes. The database size is 14 megs. I have another database just like it...and it has not experienced any problems like this. It is about 5 MB. I don't think that size will matter? Anyhow the crashes occur usually when someone is using a form, and another person opends up a database and then in a given ammount of time closes it. Afterwords...
1
2556
by: Tito | last post by:
For an internet telephone application, I need to be able to read and write data to and from /dev/dsp simultaneously. I wrote some code and its not working. Anyone have any working code to do this? I am assuming my card is full duplex, it is a built-in sound card on a new dell 600m laptop, but I am not sure how to tell for sure. But I think the problem is not so much my sound card, but that I am making some fundamentally wrong...
35
4866
by: Terry Jolly | last post by:
Web Solution Goal: Have a global database connection Why: (There will be 30+ tables, represented by 30+ classes) I only want to reference the database connection once. I put the connection string in the web.config. I created a class with a static database connection and the class opens and closes the database.
4
2286
by: onecorp | last post by:
I have a SQL table comprised of 31 columns. The first column is simply an id column, the next 30 columns are labelled ,.... The numerical columns have a tinyint type and the data stored is either 1 or null. I wish to count the number of times a one appears in one column simultaneously with another column: eg count the number of times 1 appears in column and 1 also appears in column in the same row:
4
3162
by: farhaaad | last post by:
Hi all, i wanted to know how can i make my mdb file read only recommended, i mean when the user opens the database with holding the SHIFT key it shouldn't be read only and when opening the datababse whith the SHIFT key it should be readonly. i thank you for your tips in advance, farhad
0
9976
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9815
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
11215
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10908
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9618
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6035
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4665
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
4259
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3267
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.