469,646 Members | 1,100 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,646 developers. It's quick & easy.

Using python to delta-load files into a central DB

Hello everyone,

I have a challenging issue I need to overcome and was hoping I might gain
some insights from this group.

I am trying to speed up the process I am using, which is as follows:

1) I have roughly 700 files that are modified throughout the day by users,
within a separate application

2) As modifications are made to the files, I use a polling service and mimic
the lock-file strategy used by the separate software application

3) I generate a single 'load' file and bulk insert into a load table

4) I update/insert/delete from the load table

This is just too time consuming, in my opinion.

At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

What I would like to do is spawn separate processes and load only the delta
data. The data must be 100% reliable, so I'm leary of using something like
difflib. I also want to make sure that my code scales since the number of
files is ever-increasing.

I would be grateful for any feedback you could provide.
Thank you,

Chris Nethery
Apr 12 '07 #1
4 1891
En Thu, 12 Apr 2007 14:05:15 -0300, Chris Nethery <gi******@earthlink.net>
escribió:
At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.
Are they text files, or what?
What kind of modifications? some lines changed/deleted/added? a column
recalculated along the whole file?
What I would like to do is spawn separate processes and load only the
delta
data. The data must be 100% reliable, so I'm leary of using something
like
difflib. I also want to make sure that my code scales since the number
of
files is ever-increasing.
Why don't you like difflib? AFAIK it has no known bugs.

--
Gabriel Genellina

Apr 13 '07 #2
Gabriel,

Thank you for your reply.

Yes, they are tab-delimited text files that will change very little
throughout the day.

But, this is messy, antiquated 80s junk, nonetheless.

Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.

Generally rows will either be added or changed, but sometimes rows will be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows and
sometimes certain column values are recalculated.

Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.
Thank you,

Chris Nethery

"Gabriel Genellina" <ga*******@yahoo.com.arwrote in message
news:ma***************************************@pyt hon.org...
En Thu, 12 Apr 2007 14:05:15 -0300, Chris Nethery <gi******@earthlink.net>
escribió:
>At present, users of the separate application can run recalculation
functions that modify all 700 files at once, causing my code to take the
whole ball of wax, rather than just the data that has changed.

Are they text files, or what?
What kind of modifications? some lines changed/deleted/added? a column
recalculated along the whole file?
>What I would like to do is spawn separate processes and load only the
delta
data. The data must be 100% reliable, so I'm leary of using something
like
difflib. I also want to make sure that my code scales since the number
of
files is ever-increasing.

Why don't you like difflib? AFAIK it has no known bugs.

--
Gabriel Genellina

Apr 13 '07 #3
En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <gi******@earthlink.net>
escribió:
Yes, they are tab-delimited text files that will change very little
throughout the day.
But, this is messy, antiquated 80s junk, nonetheless.
Ugh...
Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The
comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.
Generally rows will either be added or changed, but sometimes rows will
be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows
and
sometimes certain column values are recalculated.
http://tgolden.sc.sabren.com/python/...r_changes.html

You could keep a copy of all files - let's say, as they were yesterday.
When you want to process the changes, iterate over all files and see if
they are newer than your copy. You could use the filecmp module:
http://docs.python.org/lib/module-filecmp.html
For each modified file: load it, and process the comments adding the
associated row type and the identity. Just do the same with the
"yesterday" file. (I assume they're not so big that you can keep both in
memory). You have then two lists of lines; then, use the functions in
module difflib to detect the changed lines; based on those results,
generate your database inserts/deletes/updates.

This way you will not process the unchanged files, and inside each file,
you will ignore unchanged lines. At least in principle it should be faster
than redoing all from scratch each time...
Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.
good - I imagine it's enough work as it is now...

--
Gabriel Genellina

Apr 13 '07 #4
Gabriel,

I think that would work well. Also, thank you for suggesting the use of
filecmp. I have never used this module, but it looks like a much better
solution than what I had been doing previously--using os.stat and performing
a DB lookup in order to verify that the filename and timestamp existed in a
'file update' table. Also, if the only limitation to difflib is that both
files reside in memory, I should be fine. The largest of all of these files
is just over 200k, which should be fine. If memory serves me right, I can't
use more than 4MB, so I should be fine. And, if I spawn separate processes
for generating the delta files, I should be able to speed things up even
more.

Thanks again for your help!
Best Regards,

Christopher Nethery

"Gabriel Genellina" <ga*******@yahoo.com.arwrote in message
news:ma***************************************@pyt hon.org...
En Thu, 12 Apr 2007 23:51:22 -0300, Chris Nethery <gi******@earthlink.net>
escribió:
>Yes, they are tab-delimited text files that will change very little
throughout the day.
But, this is messy, antiquated 80s junk, nonetheless.

Ugh...
>Rows are designated either by a row type or they contain a comment. Each
row type has an identity value, but the 'comment' rows do not. The
comment
rows, however, are logically associated with the last occurring row type.
When I generate my bulk insert file, I add the identity of the last
occurring row type to the comment rows, and generate and populate an
additional identity column in order to retain the order of the comments.
Generally rows will either be added or changed, but sometimes rows will
be
removed. Typically, only 1-5 new rows will be added to a file in a given
day, but users sometimes make manual corrections/deletions to older rows
and
sometimes certain column values are recalculated.

http://tgolden.sc.sabren.com/python/...r_changes.html

You could keep a copy of all files - let's say, as they were yesterday.
When you want to process the changes, iterate over all files and see if
they are newer than your copy. You could use the filecmp module:
http://docs.python.org/lib/module-filecmp.html
For each modified file: load it, and process the comments adding the
associated row type and the identity. Just do the same with the
"yesterday" file. (I assume they're not so big that you can keep both in
memory). You have then two lists of lines; then, use the functions in
module difflib to detect the changed lines; based on those results,
generate your database inserts/deletes/updates.

This way you will not process the unchanged files, and inside each file,
you will ignore unchanged lines. At least in principle it should be faster
than redoing all from scratch each time...
>Did I mention that the header contains another implied hierarchy?
Fortunately, I can just ignore it and strip it off.

good - I imagine it's enough work as it is now...

--
Gabriel Genellina

Apr 13 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

226 posts views Thread by Stephen C. Waterbury | last post: by
2 posts views Thread by Egor Bolonev | last post: by
147 posts views Thread by Sateesh | last post: by
4 posts views Thread by Kinsley Turner | last post: by
TMS
3 posts views Thread by TMS | last post: by
reply views Thread by Tomas | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.