Using python to delta-load files into a central DB

Chris Nethery

Hello everyone,

I have a challenging issue I need to overcome and was hoping I might gain
some insights from this group.

I am trying to speed up the process I am using, which is as follows:

1) I have roughly 700 files that are modified throughout the day by users,
within a separate application

2) As modifications are made to the files, I use a polling service and mimic
the lock-file strategy used by the separate software application

3) I generate a single 'load' file and bulk insert into a load table

4) I update/insert/delete from the load table

This is just too time consuming, in my opinion.

At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

What I would like to do is spawn separate processes and load only the delta
data. The data must be 100% reliable, so I'm leary of using something like
difflib. I also want to make sure that my code scales since the number of
files is ever-increasing.

I would be grateful for any feedback you could provide.
Thank you,

Chris Nethery

Apr 12 '07 #1

Subscribe Post Reply

2035

Gabriel Genellina

En Thu, 12 Apr 2007 14:05:15 -0300, Chris Nethery <gi******@earthlink.net>
escribió:

At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

Are they text files, or what?
What kind of modifications? some lines changed/deleted/added? a column
recalculated along the whole file?

What I would like to do is spawn separate processes and load only the
delta
data. The data must be 100% reliable, so I'm leary of using something
like
difflib. I also want to make sure that my code scales since the number
of
files is ever-increasing.

Why don't you like difflib? AFAIK it has no known bugs.

--
Gabriel Genellina

Apr 13 '07 #2

Chris Nethery

Gabriel,

Thank you for your reply.

Yes, they are tab-delimited text files that will change very little
throughout the day.

But, this is messy, antiquated 80s junk, nonetheless.

Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.

Generally rows will either be added or changed, but sometimes rows will be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows and
sometimes certain column values are recalculated.

Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.
Thank you,

Chris Nethery

"Gabriel Genellina" <ga*******@yahoo.com.arwrote in message
news:ma***************************************@pyt hon.org...

En Thu, 12 Apr 2007 14:05:15 -0300, Chris Nethery <gi******@earthlink.net>
escribió:

>At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

Are they text files, or what?
What kind of modifications? some lines changed/deleted/added? a column
recalculated along the whole file?

>What I would like to do is spawn separate processes and load only the
delta
data. The data must be 100% reliable, so I'm leary of using something
like
difflib. I also want to make sure that my code scales since the number
of
files is ever-increasing.

Why don't you like difflib? AFAIK it has no known bugs.

--
Gabriel Genellina

Apr 13 '07 #3

Gabriel Genellina

En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <gi******@earthlink.net>
escribió:

Yes, they are tab-delimited text files that will change very little
throughout the day.
But, this is messy, antiquated 80s junk, nonetheless.

Ugh...

Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The
comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.
Generally rows will either be added or changed, but sometimes rows will
be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows
and
sometimes certain column values are recalculated.

http://tgolden.sc.sabren.com/python/...r_changes.html

You could keep a copy of all files - let's say, as they were yesterday.
When you want to process the changes, iterate over all files and see if
they are newer than your copy. You could use the filecmp module:
http://docs.python.org/lib/module-filecmp.html
For each modified file: load it, and process the comments adding the
associated row type and the identity. Just do the same with the
"yesterday" file. (I assume they're not so big that you can keep both in
memory). You have then two lists of lines; then, use the functions in
module difflib to detect the changed lines; based on those results,
generate your database inserts/deletes/updates.

This way you will not process the unchanged files, and inside each file,
you will ignore unchanged lines. At least in principle it should be faster
than redoing all from scratch each time...

Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.

good - I imagine it's enough work as it is now...

--
Gabriel Genellina

Apr 13 '07 #4

Chris Nethery

Gabriel,

I think that would work well. Also, thank you for suggesting the use of
filecmp. I have never used this module, but it looks like a much better
solution than what I had been doing previously--using os.stat and performing
a DB lookup in order to verify that the filename and timestamp existed in a
'file update' table. Also, if the only limitation to difflib is that both
files reside in memory, I should be fine. The largest of all of these files
is just over 200k, which should be fine. If memory serves me right, I can't
use more than 4MB, so I should be fine. And, if I spawn separate processes
for generating the delta files, I should be able to speed things up even
more.

Thanks again for your help!
Best Regards,

Christopher Nethery

"Gabriel Genellina" <ga*******@yahoo.com.arwrote in message
news:ma***************************************@pyt hon.org...

En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <gi******@earthlink.net>
escribió:

>Yes, they are tab-delimited text files that will change very little
throughout the day.
But, this is messy, antiquated 80s junk, nonetheless.

Ugh...

>Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The
comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.
Generally rows will either be added or changed, but sometimes rows will
be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows
and
sometimes certain column values are recalculated.

http://tgolden.sc.sabren.com/python/...r_changes.html

You could keep a copy of all files - let's say, as they were yesterday.
When you want to process the changes, iterate over all files and see if
they are newer than your copy. You could use the filecmp module:
http://docs.python.org/lib/module-filecmp.html
For each modified file: load it, and process the comments adding the
associated row type and the identity. Just do the same with the
"yesterday" file. (I assume they're not so big that you can keep both in
memory). You have then two lists of lines; then, use the functions in
module difflib to detect the changed lines; based on those results,
generate your database inserts/deletes/updates.

This way you will not process the unchanged files, and inside each file,
you will ignore unchanged lines. At least in principle it should be faster
than redoing all from scratch each time...

>Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.

good - I imagine it's enough work as it is now...

--
Gabriel Genellina

Apr 13 '07 #5

Similar topics

226

reduce() anomaly?

by: Stephen C. Waterbury | last post by:

This seems like it ought to work, according to the description of reduce(), but it doesn't. Is this a bug, or am I missing something? Python 2.3.2 (#1, Oct 20 2003, 01:04:35) on linux2 Type...

Python

python and LAN

by: Egor Bolonev | last post by:

Hello, All! I wish to write a programs for LAN, but I don't know where to begin. For example I have to get a hostlist of my network: .... 192.168.1.35 acca 192.168.3.38 alligator...

Python

147

Python indentation

by: Sateesh | last post by:

Hi, I am a beginner in Python, and am wondering what is it about the indentation in Python, without which python scripts do not work properly. Why can't the indentation not so strict so as to give...

Python

python number handling - tiny encryption algorithm

by: Kinsley Turner | last post by:

Hey-ho, I'm getting a bit out of my depth porting the 'tiny encryption algorithm' from C to python. Ref: http://en.wikipedia.org/wiki/Tiny_Encryption_Algorithm...

Python

How to find out the next Friday using RelativeDateTime

by: vj | last post by:

I'm doing: a = now() delta = ReltaiveDateTime(days=+6, weekday(mx.DateTime.Friday, 0)) Next Friday: a+delta a: march 23 a+delta: Gives me March 31st and not March 24th Any ideas?

Python

using a for loop to determine maximum value of an int variable

by: garyusenet | last post by:

I'm trying to investigate the maximum size of different variable types. I'm using INT as my starting variable for exploration. I know that the maximum number that the int variable can take is:...

C# / C Sharp

Tkinter event using arrow keys

by: TMS | last post by:

Hey all: I'm working on this 'pac man' like game I've been writing in Tkinter. The temporary gif I'm using (I can't attach it, sorry) goes around the maze based on a dictionary that has each path...

Python

encrypting a string in C using XTEA

by: goron | last post by:

Hello, I'm trying to encrypt the final output of my program. This final output is a char buffer, a string, the encrypted form of which I'd like to save to a text file. I have the XTEA code...

C / C++

Python refuses to import SVN

by: Tomas | last post by:

Dear group. I have some big problems getting python to work with svn. Python SVN and SWIG is compiled from source. I have been testing Python 2.5.1 and 2.4.4 together with SWIG 1.3.25 and 1.3.33...

Python

Using class and methods for updating information...

by: alireza6485 | last post by:

Hi,Please help me to answer this question.I tried my best and I am not sure if my code is good or useless. I am writing a program that calculates projectile motion. so the ball is thrown with...

Java

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing