By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,905 Members | 900 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,905 IT Pros & Developers. It's quick & easy.

Import/Normalize strategy - More issues - New thread

P: n/a
I'm restarting this thread with a different focus.

The project I'm working on now id coming along and will be made to work, and
it's too late to start over with a new strategy. Still, I'm not coming to a
firm conclusion over whether it was the better approach, and wonder if I
should do it differently the next time I'm faced with a similar issue.

I needed an app to automatically import from spreadsheets with a semi-dynamic
structure, and I needed to handle 2 different, but similar spreadheet formats
implying different normalized target structures.

The obvious part of the design which is what I did do and would do again is to
create a set of data dictionary tables that describe the target tables and
fields, and how they extend from a central table representing the fixed
import->row mapped part. Then, I created a rules table that can be updated by
users over time to correctly drive the importer to tell it the meanings of new
or differently named columns.

Now, the part I'm not so certain about, and the real question I'm mulling over
is whether I was right to design this as query driven or whether it should
have been loop driven. Today, during debugging, I realized a whole new can of
worms I opened when I chose to do it primarily by building and running dynamic
insert queries rather than loop.

So - I already brought up the issue that the pure query approach didn't nicely
handle our strange case of some columns with long strings of delimited entries
that needed to be split out into 1-M branches, and we've covered that somewhat
in the other thread. Today, though, I realized the other issue wchich is
error handling.

When the new data doesn't conform to the old rules, you have to fix the rules
in the rules table, but you often discover the broken rules when the import
fails (such as when the alias for Null in the input is now a dash instead of
"NA"), but the error description and code information aren't specific enough
to use that to determine which column and which row or rows had what problem!
So as I'm testing it myself, I have to debug the code, rip out the offending
SQL, test it in vitro, and diagnose the problem. That's not hard for -me-,
but users can't do that every month when they run an import! So, with the
approach I took on this, it looks like I'll have to have my code generate even
more queries and pre-checks to catch and meaningfully report error conditions
before running the actual import queries, and hope I guess all the common,
general cases correctly.

Based on this issue, the cards are seeming to stack more in favor of a simple,
loop-based system if I ever do it again.

Now, I wouldn't bother bringing this issue to this forum if I didn't still
have a nagging discomfort with the conclusion. After all, if I looked at
someone else's code for this kind of system, and saw nested loops, my thought
would usually be that the programmer was incompetent at database programming.
After all, every database programmer knows you try to let the database engine
handle most all data transformations.

So, am I questioning my own conclusions simply because I've grown accustomed
to thinking there's only one right way to handle relational data operations,
or is there some real reason for my discomfort that I have simply forgotten
with my conscious mind because it's so long since I've attempted to do
anything like this the "wrong" way with loops.

Your thoughts?
Nov 12 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
Steve Jorgensen wrote:
I'm restarting this thread with a different focus.

The project I'm working on now id coming along and will be made to work, and
it's too late to start over with a new strategy. Still, I'm not coming to a
firm conclusion over whether it was the better approach, and wonder if I
should do it differently the next time I'm faced with a similar issue.

I needed an app to automatically import from spreadsheets with a semi-dynamic
structure, and I needed to handle 2 different, but similar spreadheet formats
implying different normalized target structures.

The obvious part of the design which is what I did do and would do again is to
create a set of data dictionary tables that describe the target tables and
fields, and how they extend from a central table representing the fixed
import->row mapped part. Then, I created a rules table that can be updated by
users over time to correctly drive the importer to tell it the meanings of new
or differently named columns.

Now, the part I'm not so certain about, and the real question I'm mulling over
is whether I was right to design this as query driven or whether it should
have been loop driven. Today, during debugging, I realized a whole new can of
worms I opened when I chose to do it primarily by building and running dynamic
insert queries rather than loop.

So - I already brought up the issue that the pure query approach didn't nicely
handle our strange case of some columns with long strings of delimited entries
that needed to be split out into 1-M branches, and we've covered that somewhat
in the other thread. Today, though, I realized the other issue wchich is
error handling.

When the new data doesn't conform to the old rules, you have to fix the rules
in the rules table, but you often discover the broken rules when the import
fails (such as when the alias for Null in the input is now a dash instead of
"NA"), but the error description and code information aren't specific enough
to use that to determine which column and which row or rows had what problem!
So as I'm testing it myself, I have to debug the code, rip out the offending
SQL, test it in vitro, and diagnose the problem. That's not hard for -me-,
but users can't do that every month when they run an import! So, with the
approach I took on this, it looks like I'll have to have my code generate even
more queries and pre-checks to catch and meaningfully report error conditions
before running the actual import queries, and hope I guess all the common,
general cases correctly.

Based on this issue, the cards are seeming to stack more in favor of a simple,
loop-based system if I ever do it again.

Now, I wouldn't bother bringing this issue to this forum if I didn't still
have a nagging discomfort with the conclusion. After all, if I looked at
someone else's code for this kind of system, and saw nested loops, my thought
would usually be that the programmer was incompetent at database programming.
After all, every database programmer knows you try to let the database engine
handle most all data transformations.
I like forms to handle my data transforations. Lots of good validation.

You have to look at what you are doing. Instead of doing validation one record at
a time like a form you are performing a batch update. That's nice if you have
clean data or are changing all dates from today to tomorrow. But you have dirty
data....perhaps lots of it.

I further doubt that if any of us were to follow you and maintain what you've down
we'd say "Damn, look at what Steve did, he used a loop Can you believe he ever
helped others on CDMA?"
So, am I questioning my own conclusions simply because I've grown accustomed
to thinking there's only one right way to handle relational data operations,
or is there some real reason for my discomfort that I have simply forgotten
with my conscious mind because it's so long since I've attempted to do
anything like this the "wrong" way with loops.

Your thoughts?


Im the old days you might perform a validation check first. If the data was
dirty, the entire job was rejected....to be resubmitted when the data was clean.
That is my first impression. It appears there is little validation/checks in
your Excel sheets when data is entered and you are discovering the errors during
the import process. Can't Excel be set up to have better data entry checks?

In your case you may be better off appending the records in a temp table and
validate each record. In your case, tag those records that pass and those that
fail. You would export back to Xcel the bad rows. If the entire set needs to be
clean, tell them the import process in Access failed. You could even providet
them with a report of all records that failed and the reason.

I would think that looping records gives you the control you want and need. You
can add your validataion checks easily and they can be modified easily. Once I
have a complex query set up I hate to go back and modify the thing.

In your case I would be more concerned about accuracy then the speed of the
process.....or what some non-existent programmer is going to think of you years
from now because you opted to use a loop. Once he futzes around with some queries
he may think "Steve was a genius!"

One last thing. If you do need to use queries, you can have your queries perform
validation checks via functions. You can pass your fields to the function(s) and
return true/false. The only negative I've encountered using functions is that the
the datasheets has a slower repaint.

One other thing. Are you importing from Excel or linking? If linking, you could
query the sheet on your bad data. If you have the file name, I'd think you could
set up a temp link and run some queries to pinpoint the problem to the user. Or
do the updates right then and there. Ex: Anything not between 1 and 10 is NA.
Any rows where the column is blank is not updated to Unknown. Once you do the
update queries on the linked spreadsheet import it and kill the link.


Nov 12 '05 #2

P: n/a
rkc
Steve Jorgensen wrote:
Your thoughts?


Ditch the static notions of competent and incompetent and base your
decisions on an analysis of the problem at hand.

Sounds to me like you made up your mind how you were going to approach
the solution before you looked at the problem.
Nov 12 '05 #3

P: n/a
Steve Jorgensen <no****@nospam.nospam> wrote in
news:83********************************@4ax.com:
So, am I questioning my own conclusions simply because I've grown
accustomed to thinking there's only one right way to handle
relational data operations, or is there some real reason for my
discomfort that I have simply forgotten with my conscious mind
because it's so long since I've attempted to do anything like this
the "wrong" way with loops.


Well, there's nothing wrong with looping where it is appropriate to
the task. It's just that many people use it when it's not, because
it's the only way they can think of (usually because they coming
from procedural database environments like xBase).

That said, I still think you should stay with your present method
and simply insert a validation stage before the actual import.

I designed an application for a client who was importing data
exported from the UNOS organ transplant database. I did it like
this:

1. link to the text file.

2. check a few things in it to see that it's (has the expected
column names and so forth).

3. if it passes the structural tests of #2, import the text file
into the buffer import table.

4. now run tests on the data in the fields that are problematic.

5. create discrepancy report records for all the failures.

6. import the data and retain the attachment to the discrepancy
report record.

7. the user can then go through all the discrepancy records and fix
the problems manually.

Actually, this was a system where the client's application was
capturing some data that overlapped with the imported data (to
insure accuracy of some crucial fields) and many of the
discrepancies were between the two data sources -- one of the most
common such discrepancies was reported data of birth! The
discrepancy resolution system included a system for creating an
email to the data supervisor of the institution where the data came
from so they could check their records and provide a definitive
answer as to the correct value.

Maybe that gives you some ideas.

I can't remember if I built in a threshold for number of
discrepancies that would abort the import if, say 90% of the records
had a discrepancy in a single field. But you could consider such a
mechanism.

All that said, if I were in your position, I'd not do anything of
the sort. If you're chasing a moving target, I don't know that
there's any way to ever program for anything, so it's probably
cheaper to fix problems as they occur than it is to attempt to build
a system that will accomodate the errors.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc
Nov 12 '05 #4

P: n/a
On Sat, 31 Jan 2004 16:55:28 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:
Steve Jorgensen <no****@nospam.nospam> wrote in
news:83********************************@4ax.com :
So, am I questioning my own conclusions simply because I've grown
accustomed to thinking there's only one right way to handle
relational data operations, or is there some real reason for my
discomfort that I have simply forgotten with my conscious mind
because it's so long since I've attempted to do anything like this
the "wrong" way with loops.
Well, there's nothing wrong with looping where it is appropriate to
the task. It's just that many people use it when it's not, because
it's the only way they can think of (usually because they coming
from procedural database environments like xBase).

That said, I still think you should stay with your present method
and simply insert a validation stage before the actual import.


Well, for the current project, certainly. I certainly can't get a budget to
code 1/2 the app a second time!
I designed an application for a client who was importing data
exported from the UNOS organ transplant database. I did it like
this:

1. link to the text file.

2. check a few things in it to see that it's (has the expected
column names and so forth).
Right, in my case, that's making sure every column has exactly rule that
matches on its name.
3. if it passes the structural tests of #2, import the text file
into the buffer import table.
Yup. First, I copy the spreadsheet, check how many header rows there are,
mush them together, insert a row under the header with "!!!"in each cell, so
Access won't try and fail to treat columns as something other than text, then
create a temporary staging database, import the sheet copy into there, and
link from my importer app to the staging database. After that, I do some
basic pre-cleanup like changing blank strings to Nulls, changing Null alias
text to Nulls, etc.
4. now run tests on the data in the fields that are problematic.
Here's the fun part, eh? One example of an issue I had was that a field with
mostly numberic data had a special code in just 3 rows. Until now, I had been
making no attempt to validate type conversions in columns prior to import
because the import query does type conversions, and I'd been assuming I would
just trap errors there. Obviously, that idea should have been tested in a
prototype. Since this system is table-driven, I guess my code will have to
inspect each destination column type, then run a check on the source column
that will feed it, applying the same transformation expression that will be
used in the import stage, and see if there's a problem.
5. create discrepancy report records for all the failures.
Clearly
6. import the data and retain the attachment to the discrepancy
report record.
Actually, we'll want to fail if there are any discrepancies, and have the user
fix the import rules and/or target schema to eliminate the discrepancies.
Apply, lather, rinse, repeat.
7. the user can then go through all the discrepancy records and fix
the problems manually.
We keep the old export files, and it's nice if the process is repeatable in
case we find any important errors or omissions in the import process later.
Actually, this was a system where the client's application was
capturing some data that overlapped with the imported data (to
insure accuracy of some crucial fields) and many of the
discrepancies were between the two data sources -- one of the most
common such discrepancies was reported data of birth! The
discrepancy resolution system included a system for creating an
email to the data supervisor of the institution where the data came
from so they could check their records and provide a definitive
answer as to the correct value.

Maybe that gives you some ideas.

I can't remember if I built in a threshold for number of
discrepancies that would abort the import if, say 90% of the records
had a discrepancy in a single field. But you could consider such a
mechanism.

All that said, if I were in your position, I'd not do anything of
the sort. If you're chasing a moving target, I don't know that
there's any way to ever program for anything, so it's probably
cheaper to fix problems as they occur than it is to attempt to build
a system that will accomodate the errors.


Well, what I did was try to make a system that could handle all expected
categories of discrepancies by changing only the schema, the dictionary data,
and the rules, without needing to deal with the VBA code. I think, except for
validation, I've succeeeded pretty well. I've batch imported a sequence of 4
months so far, and that included a month in which they changed the alias for
Null from "NA" to "-" and changed a few column names. There are bigger
changes in later data, but the way my code is structured, if the known cases
of these would be problematic, my code would not be working now, so I'm pretty
confident.
Nov 12 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.