Join Bytes! (Alex) wrote in message news:<b270c055.0312050946.59a179c0@posting.google.com>...
[color=blue]
> I suppose what i'm asking is what would be the quickest way of getting
> the data into the database. Stick with perl/DBI ?, java prog to
> process the input (doesn't feel as if this would be the quickest way
> of doing things) piping a data file through a db2 cli interface? or
> something else?[/color]
I'd recommend a typical warehouse etl solution. However, rather than
go the route of commercial etl tools (especially for such a small
project), I'd implement with simple commodity components:
- sftp the gzipped files to the warehouse server
- transform the files from extract to load images using an
easily-maintained
language such as python (or whatever works best for you)
- use the db2 load utility for loading into base fact tables
- archive both extract and load files in a gzipped format in case you
want to
recover or change your data model.
If you go this route then there are only two challenges initially:
- learning the ins & outs of the db2 load utility
- process management
The load utility isn't that bad - just need to get familiar with how
it locks the table, load recovery, etc. And it's very fast - a small
box using load can easily hit 20,000 rows / sec - far faster than
anything you could do with java, perl, etc.
Process management is trickier. But if you only need to do loads once
a day it isn't too tough. I normally just keep the extract,
transport, transform, and load processes completely separate - each
run once a minute via cron, each ensuring that only a single copy is
running, and each interfacing to the other processes only via a flat
file. Need to ensure that files aren't picked up until they're
complete - so a signally file, or a file rename should be performed at
the conclusion of a successful process.
However, an initial implemention (given just 5 files a day) could be
far simplier - based on a static schedule, and an alerting process in
the event that those files aren't available when the downstream
processes get kicked off. Developed in this fashion, a simple log
fact table solution could be developed in 1-3 days, and then easily
enhanced later if you wanted.
ken