Import/Normalize approach - column-DML, or loop

Steve Jorgensen

Hi all,

I've just finished almost all of what has turned out to be a real bear of a
project. It has to import data from a monthly spreadsheet export from another
program, and convert that into normalized data. The task is made more
difficult by the fact that the structure itself can vary from month to month
(in well defined ways).

So, I used the SQL-centric approach, taking vertical stripes at a time so
that, for instance, for each field with simple, repeating text data, I make a
group-by pass, inserting into the destination lookup table, then I do another
query, joining from the input table to the text fields in the lookups to get
the foreign keys for the main destination table. When I have multiple columns
that need to become 1-M, I make a pass for each of those columns, inserting
lookup record that identifies the split, then inserting the rows into the
many-side table, yadda, yadda, yadda.

All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query to
pull out, in successive query slices, argument 1, argument 2, etc. I was
going to use a where clause to exclude null results (input doesn't have
arguments n or above), and quit after the first pass with .RecordsAffected =
0.

Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
after waiting about 20 minutes. Note that a whole import process takes about
3 minutes if this part is ommitted, and that includes about 40 vertical split
passes. The UDF is very simple, and performs quite fast from VB, but I guess
the overhead of calling this function from a query is VERY SEVERE. Just to
get this thing out the door, I've decided to just tnot split these for now
since we're not doing queries of that data yet.

So, my next thought is that, perhaps this step is better done by simply,
brute-force, cycling through a recordset of the input table, and inserting
rows into the destination tables. If I'm soing that, though, then am I really
getting any benefit worth speaking of by doing everything -else- as SQL-DML in
vertical slices, or would I have been much better off, just doing it the
procedural way, walking through the input table, and creating destination rows
one row at a time? I know it would have been easier to write, but here I was
trying to do things the "right" way.

If I do something like this again, would my code end up performing just as
well, and being easier to write and maintain using simple iteration rather
than SQL-DML?

Thanks for any opinions,

- Steve J

Nov 12 '05 #1

Subscribe Reply

2732

Bas Cost Budde

> If I do something like this again, would my code end up performing just as

well, and being easier to write and maintain using simple iteration rather
than SQL-DML?
My guess is that you can build a fairly specialized optimalisation here.
After all, the Jet engine is not the heaviest in the world, and its
query optimizers can do only so much for true DML.
I have a query that calculates several values by simple IIFs, and that
takes three minutes for a simple 20000 records table!

Somewhere in the line any SQL statement must be executed
procedurally--SQL is invented only for us humans to focus attention on
the what, not the how.

I'd say: use the procedure. Bonus: you can display your own progress
indicator. A slow function with feedback is often perceived as quicker
than a slow function without feedback, even if it runs several seconds
quicker not having to update the display.

Steve Jorgensen wrote:
All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query to
pull out, in successive query slices, argument 1, argument 2, etc. I was
going to use a where clause to exclude null results (input doesn't have
arguments n or above), and quit after the first pass with .RecordsAffected =
0.

Sometimes you can invent a preprocessor (this time inside the
spreadsheet), is that possible? I'd like to explore the point but see
little value doing that in the group (eating up everyones bandwith,
verbose as I tend to be). If you have time, mail me. My domain is not
org but nl.

--
Bas Cost Budde

Nov 12 '05 #2

Steve Jorgensen

On Thu, 29 Jan 2004 09:44:01 +0100, Bas Cost Budde <ba*@heuveltop.org> wrote:

If I do something like this again, would my code end up performing just as
well, and being easier to write and maintain using simple iteration rather
than SQL-DML?
My guess is that you can build a fairly specialized optimalisation here.
After all, the Jet engine is not the heaviest in the world, and its
query optimizers can do only so much for true DML.
I have a query that calculates several values by simple IIFs, and that
takes three minutes for a simple 20000 records table!

I generally have much better performance than that. In fact, if we're not
talking about a multi-processor or multi-user system, I generally find that
JET is faster than most SQL Servers since it has fewer contingencies to deal
with.
I'd say: use the procedure. Bonus: you can display your own progress
indicator. A slow function with feedback is often perceived as quicker
than a slow function without feedback, even if it runs several seconds
quicker not having to update the display.
Actually, since most of my vertical slice insert queries run in under 2 or 3
seconds, I have a pretty good progress bar already - until I get to the
multi-value split, that is.

Steve Jorgensen wrote:

All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query to
pull out, in successive query slices, argument 1, argument 2, etc. I was
going to use a where clause to exclude null results (input doesn't have
arguments n or above), and quit after the first pass with .RecordsAffected =
0.
Sometimes you can invent a preprocessor (this time inside the
spreadsheet), is that possible? I'd like to explore the point but see

Actually, I do a small amount of pre-processing in the spreadsheet (just in
the top few rows) to make it into something Access wil import properly, then I
import that into a staging database, and link to that from the Access importer
database app (also linked to the target back-end). The transformation,
import, and link steps take about 6 or 7 seconds altogether. From there, it's
an ordinary table (albeit, with all Text columns), so I have my choice of DML
or looping to do the complicated part from there.
little value doing that in the group (eating up everyones bandwith,
verbose as I tend to be). If you have time, mail me. My domain is not
org but nl.

I don't need that kind of detail. I was really just looking for people's
experience/optinions of the values of DML vs looping for this kind of job.

Nov 12 '05 #3

Steve Jorgensen

On Thu, 29 Jan 2004 09:44:01 +0100, Bas Cost Budde <ba*@heuveltop.org> wrote:

Ah - forgot a point...

Somewhere in the line any SQL statement must be executed
procedurally--SQL is invented only for us humans to focus attention on
the what, not the how.

Well, it's more than that, though. Using procedural code, every single
read/write must go through several code layers. The SQL engine is designed to
handle set operations optimally without crossing all those layers, and it can
use covering indexes (a concept hidden from VB code), etc. Still, it may be
that for a data transformation operation, those advantages are not realized,
or worse.

Nov 12 '05 #4

Bruce Pick

Steve,

I'm assuming you need to separate the delimited fields into a set of
fields. Below is code that would separate them out in a query, which
should run fairly quickly. (If you need to create multiple rows for
each source string, the approach won't help.)

You could use InStr() to locate the delimiters and then generate the
fields based on that knowledge.

Field1: Left([SourceField],(InStr(1,[SourceField],',')-1))
Balance1: Mid([SourceField],(InStr(1,[SourceField],',')+1))
Field2: Left([Balance1],(InStr(1,[Balance1],',')-1))
Balance2: Mid([Balance1],(InStr(1,[Balance1],',')+1))
Field3:IIf(InStr(1,[Balance2],',')=0,Null,Left([Balance2],(InStr(1,[Balance2],',')-1)))
Balance3:IIf(IsNull([Field3]),'',Mid([Balance2],(InStr(1,[Balance2],',')+1)))
Field4:IIf(InStr(1,[Balance3],',')=0,'',Left([Balance3],(InStr(1,[Balance3],',')-1)))
I think you get the idea.

If the number of delimited components varies betseen records, you will
need to add code to test for IIf( InStr(xxx) = 0, '', ... as above). I
set up Field3, Balance3, and Field4 with the needed logic.

I admit it's a bit of a pain to write, but once the pattern is set, the
statements are easy to derive. I think it should run more quickly
than a user-defined function in a module.

Bruce Pick

Steve Jorgensen wrote:

Hi all,
. . .
All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query to
pull out, in successive query slices, argument 1, argument 2, etc.
. .
Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
after waiting about 20 minutes.
.. .
Thanks for any opinions,

- Steve J

Nov 12 '05 #5

Steve Jorgensen

No, that's not it at all. I need to split the items out into rows in another
table with a 1-M relationship, and the number of items varies from row to row.
I have certainly written code like your example before, could generate that
pattern for a series of queries to extract one argument at a time, but I
believe I would hit the "Expression too complex" error condition long before I
got to the last argument (try about 40).

On Thu, 29 Jan 2004 09:57:53 -0500, Bruce Pick <br********@comcast.net> wrote:

Steve,

I'm assuming you need to separate the delimited fields into a set of
fields. Below is code that would separate them out in a query, which
should run fairly quickly. (If you need to create multiple rows for
each source string, the approach won't help.)

You could use InStr() to locate the delimiters and then generate the
fields based on that knowledge.

Field1: Left([SourceField],(InStr(1,[SourceField],',')-1))
Balance1: Mid([SourceField],(InStr(1,[SourceField],',')+1))
Field2: Left([Balance1],(InStr(1,[Balance1],',')-1))
Balance2: Mid([Balance1],(InStr(1,[Balance1],',')+1))
Field3:IIf(InStr(1,[Balance2],',')=0,Null,Left([Balance2],(InStr(1,[Balance2],',')-1)))
Balance3:IIf(IsNull([Field3]),'',Mid([Balance2],(InStr(1,[Balance2],',')+1)))
Field4:IIf(InStr(1,[Balance3],',')=0,'',Left([Balance3],(InStr(1,[Balance3],',')-1)))
I think you get the idea.

If the number of delimited components varies betseen records, you will
need to add code to test for IIf( InStr(xxx) = 0, '', ... as above). I
set up Field3, Balance3, and Field4 with the needed logic.

I admit it's a bit of a pain to write, but once the pattern is set, the
statements are easy to derive. I think it should run more quickly
than a user-defined function in a module.

Bruce Pick

Steve Jorgensen wrote:
Hi all,
. . .
All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query to
pull out, in successive query slices, argument 1, argument 2, etc.
. .
Sounds good, but with a mere 8000 input rows, I had to cancel the first pass
after waiting about 20 minutes.
.. .
Thanks for any opinions,

- Steve J

Nov 12 '05 #6

David W. Fenton

Steve Jorgensen <no****@nospam.nospam> wrote in
news:oa********************************@4ax.com:

Sounds good, but with a mere 8000 input rows, I had to cancel the
first pass after waiting about 20 minutes. Note that a whole
import process takes about 3 minutes if this part is ommitted, and
that includes about 40 vertical split passes. The UDF is very
simple, and performs quite fast from VB, but I guess the overhead
of calling this function from a query is VERY SEVERE. Just to get
this thing out the door, I've decided to just tnot split these for
now since we're not doing queries of that data yet.

I did this about two years ago with multiple fields with
comma-separated values.

My solution was this:

1. first, determine the distribution of multiple values. That is,
what percent of the table has one value? What percent, two? Three?
And so on.

2. in my cases, 95% of the records had 4 or fewer values, so I
created four additional fields in my processing table.

3. I then copied the values that have no commas (only one value)
into the first of the processing columns.

4. then, in a query that eliminates the rows without commas in the
original column, copy the value before the first comma into the
second column.

5. then, in a query that eliminates the rows that have no commas in
the source column and a value in the second target column, copy the
values between the 1st and 2nd column into the second target column.

And so forth.

This is not too hard to do because you're doing it all in relatively
small sets, rather than trying to process the whole thing with a
single multi-purpose subroutine.

If this isn't clear, let me know and I'll outline it more clearly.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

Nov 12 '05 #7

Steve Jorgensen

On Thu, 29 Jan 2004 20:29:55 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:

Steve Jorgensen <no****@nospam.nospam> wrote in
news:oa********************************@4ax.com :
Sounds good, but with a mere 8000 input rows, I had to cancel the
first pass after waiting about 20 minutes. Note that a whole
import process takes about 3 minutes if this part is ommitted, and
that includes about 40 vertical split passes. The UDF is very
simple, and performs quite fast from VB, but I guess the overhead
of calling this function from a query is VERY SEVERE. Just to get
this thing out the door, I've decided to just tnot split these for
now since we're not doing queries of that data yet.

I did this about two years ago with multiple fields with
comma-separated values.

My solution was this:

1. first, determine the distribution of multiple values. That is,
what percent of the table has one value? What percent, two? Three?
And so on.

2. in my cases, 95% of the records had 4 or fewer values, so I
created four additional fields in my processing table.

3. I then copied the values that have no commas (only one value)
into the first of the processing columns.

4. then, in a query that eliminates the rows without commas in the
original column, copy the value before the first comma into the
second column.

5. then, in a query that eliminates the rows that have no commas in
the source column and a value in the second target column, copy the
values between the 1st and 2nd column into the second target column.

And so forth.

This is not too hard to do because you're doing it all in relatively
small sets, rather than trying to process the whole thing with a
single multi-purpose subroutine.

If this isn't clear, let me know and I'll outline it more clearly.

Well, I'm not sure that's much help in this case. it's not a field that was
originally indended to be single-valued, then had some multi-valued data put
in, it is an export form a system with no other way to represent 1-to-many
data in a spreadsheet than to put it in a multi-valued field. Out of 8000
records, there are about 2500 unique combinations (using GROUP BY).

I think I've figured out that a hybrid apprach might be best for my current
situation, given that I already have all the infrastructure to build the DML
parts. The idea is that I query a recordset with all the unique combinations,
cycle through that, and parse out the arguments in each one, build a
collection of unique values (2500 combinations, but probably only a few
hundred unique items), then insert the collection items into the lookup table.
From there, I should be able join from the source table to the lookup using a
Like expression ("/" + combination + "/" Like "*/" + delimiter + item-text +
delimier + "/*") to populate the junction table.

I think/hope this will do the job.

What is your opinion if I do another project like this in the future? DML, or
loops?

Nov 12 '05 #8

David W. Fenton

Steve Jorgensen <no****@nospam.nospam> wrote in
news:o6********************************@4ax.com:

Well, I'm not sure that's much help in this case. it's not a
field that was originally indended to be single-valued, then had
some multi-valued data put in, it is an export form a system with
no other way to represent 1-to-many data in a spreadsheet than to
put it in a multi-valued field. Out of 8000 records, there are
about 2500 unique combinations (using GROUP BY).
With 8000->2500, I'm not sure there's any real benefit in processing
unique values.

Either way, though, I'd first process the values into columns and
then process the columns into rows. But if more than half the
records have more than 10 values in them, then it would become more
unwieldy, I think, than just walking the table and creating the new
records.

I'd do that in two steps of course, as you're likely to want a
lookup here, so you'll need both a list of the values connected to
the source record and a list of the unique values to pick from in
new records. So, I'd create the 1:N records (FK + Value), then
create the lookup table with a DISTINCT, then join on the value to
populate the lookup key.

But with only 8000 records, I think I'd do the record creation by
walking the recordset and then walking the field and parsing.
I think I've figured out that a hybrid apprach might be best for
my current situation, given that I already have all the
infrastructure to build the DML parts. . . .
DML? Perhaps I should not have ignored the fact that I don't know
what DML means?
. . . The idea is that I query a
recordset with all the unique combinations, cycle through that,
and parse out the arguments in each one, build a collection of
unique values (2500 combinations, but probably only a few hundred
unique items), then insert the collection items into the lookup
table. From there, I should be able join from the source table to
the lookup using a Like expression ("/" + combination + "/" Like
"*/" + delimiter + item-text + delimier + "/*") to populate the
junction table.
I think/hope this will do the job.
Why join in that fashion?

Why not do this: walk the source table, parse each value and insert
a record with the source PK and the value into a temporary table,
then from that table, create another table witht he unique values
(this will be your lookup table, with a PK), then join these two new
tables on the value field and insert the PK from the source table
and the PK from the lookup table into your junction table.

With very large numbers of records, I'd break the first step into 2
parts, first parsing the values into columns and then inserting the
records from the columns.
What is your opinion if I do another project like this in the
future? DML, or loops?

Depends on the number of records and the number of values in the
multi-value field. With larger numbers of records and/or larger
numbers of values in the multi-value field, I'd tend to process the
values into columns first and then create records from the columns.
With smaller recordcounts (as in your case), I might do it directly,
especially if the number of values stored in each field is large.

The last time I did this I was processing a phone number field that
had multiple values in it. Most of the fields had 1, 2 or 3 numbers,
a handful had 4 and very small number had more than that. I
processed into 4 columns, with the 4th column still being
multi-valued, but with only a small number of records to process
(indeed, it was such a small number that I believe I created the
real records manually!). My experience with this kind of data is
that you have lots of records with 1 or 2 values, about half as many
with 3, and then a quick falloff from there, with 10% or less having
more than that. But it depends entirely on the kind of data. Phone
numbers have pretty much an upper limit, but other kinds of data
will not.

I guess what I'm saying is that with large numbers of records, I'd
separate the process of parsing the multi-value field from the
process of creating the records, for performance purposes.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

Nov 12 '05 #9

Steve Jorgensen

On Fri, 30 Jan 2004 16:34:18 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:

Steve Jorgensen <no****@nospam.nospam> wrote in
news:o6********************************@4ax.com :
Well, I'm not sure that's much help in this case. it's not a
field that was originally indended to be single-valued, then had
some multi-valued data put in, it is an export form a system with
no other way to represent 1-to-many data in a spreadsheet than to
put it in a multi-valued field. Out of 8000 records, there are
about 2500 unique combinations (using GROUP BY).
With 8000->2500, I'm not sure there's any real benefit in processing
unique values.

Well, the Group By seems to run pretty fast. Intuitively, it seems like
walking more than 3x as many rows to parse would waste more time than the
Group By. Of course, it's trivial to try it both ways, and time it.
Either way, though, I'd first process the values into columns and
then process the columns into rows. But if more than half the
records have more than 10 values in them, then it would become more
unwieldy, I think, than just walking the table and creating the new
records.
I think it's borderline on that one. I'm not certain I know what you are
advocating, though. Are you saying to add columns to the import table, and
step-wise run queries to peel off argument 1 to the first added column,
removing it from the original source column (leaving argument 2 as argument
1), then query again to peel off argument 1 to the second added column, etc.?
I'd do that in two steps of course, as you're likely to want a
lookup here, so you'll need both a list of the values connected to
the source record and a list of the unique values to pick from in
new records. So, I'd create the 1:N records (FK + Value), then
create the lookup table with a DISTINCT, then join on the value to
populate the lookup key.
Right, that's pretty much how I'm handling all the other single-valued fields
with repeated text already. It's working great.
But with only 8000 records, I think I'd do the record creation by
walking the recordset and then walking the field and parsing.
Yeah. If i get what you're saying, I don't thing ripping arguments into 10 or
40 new columns will be a good thing, and anything short of the 40 or so means
writing more code to handle the arguments past #10 or so differently than the
rest.

I think I've figured out that a hybrid apprach might be best for
my current situation, given that I already have all the
infrastructure to build the DML parts. . . .

DML? Perhaps I should not have ignored the fact that I don't know
what DML means?

Oh, it's just SQL that's not DDL. SQL consists of DDL and DML. To me, use of
the term DML implies a focus on INSERT/UPDATE queries rather than simple
SELECTS, though I believe SELECT is still considered part of DML. I welcome
anyone's corrections on these facts.

. . . The idea is that I query a
recordset with all the unique combinations, cycle through that,
and parse out the arguments in each one, build a collection of
unique values (2500 combinations, but probably only a few hundred
unique items), then insert the collection items into the lookup
table. From there, I should be able join from the source table to
the lookup using a Like expression ("/" + combination + "/" Like
"*/" + delimiter + item-text + delimier + "/*") to populate the
junction table.
I think/hope this will do the job.

Why join in that fashion?

Why not do this: walk the source table, parse each value and insert
a record with the source PK and the value into a temporary table,
then from that table, create another table witht he unique values
(this will be your lookup table, with a PK), then join these two new
tables on the value field and insert the PK from the source table
and the PK from the lookup table into your junction table.

I'm guessing that this join, even though it would be an extra query that will
not benefit from any indexes, could be quicker than executing 10 to 40
individual inserts from code per source row as I go. I dunno, though. Your
way is probably easier to write.
With very large numbers of records, I'd break the first step into 2
parts, first parsing the values into columns and then inserting the
records from the columns.
What is your opinion if I do another project like this in the
future? DML, or loops?

Depends on the number of records and the number of values in the
multi-value field. With larger numbers of records and/or larger
numbers of values in the multi-value field, I'd tend to process the
values into columns first and then create records from the columns.
With smaller recordcounts (as in your case), I might do it directly,
especially if the number of values stored in each field is large.

The last time I did this I was processing a phone number field that
had multiple values in it. Most of the fields had 1, 2 or 3 numbers,
a handful had 4 and very small number had more than that. I
processed into 4 columns, with the 4th column still being
multi-valued, but with only a small number of records to process
(indeed, it was such a small number that I believe I created the
real records manually!). My experience with this kind of data is
that you have lots of records with 1 or 2 values, about half as many
with 3, and then a quick falloff from there, with 10% or less having
more than that. But it depends entirely on the kind of data. Phone
numbers have pretty much an upper limit, but other kinds of data
will not.

I guess what I'm saying is that with large numbers of records, I'd
separate the process of parsing the multi-value field from the
process of creating the records, for performance purposes.

I guess that sounds reasonable. I think I would have done pretty much the
same thing for a field with mostly one or 2 values per row. It was only this
strange case of having a 1-M stuffed into a column that bought me this
headache to begin with.

The other issue I'm seeing, though, is that the code would probably have been
simpler, cost less to write, and would now be simpler to maintain if I had
used loops instead of queries. Since this is a monthly import, if the
processing took less that twice as long as now, perhaps, the simplicity alone
would have been sufficient reason to just do the loops, eh? Of course, a
slower process means testing less often during development, and that means
more difficult debugging sessions each time, so perhaps it would not have cost
less to write, but I still think I could have made it more legible for
maintenance that way.

Nov 12 '05 #10

David W. Fenton

Steve Jorgensen <no****@nospam.nospam> wrote in
news:0v********************************@4ax.com:

On Fri, 30 Jan 2004 16:34:18 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:
Steve Jorgensen <no****@nospam.nospam> wrote in
news:o6********************************@4ax.co m:
Well, I'm not sure that's much help in this case. it's not a
field that was originally indended to be single-valued, then had
some multi-valued data put in, it is an export form a system
with no other way to represent 1-to-many data in a spreadsheet
than to put it in a multi-valued field. Out of 8000 records,
there are about 2500 unique combinations (using GROUP BY).

With 8000->2500, I'm not sure there's any real benefit in
processing unique values.

Well, the Group By seems to run pretty fast. Intuitively, it
seems like walking more than 3x as many rows to parse would waste
more time than the Group By. Of course, it's trivial to try it
both ways, and time it.

But you'll end up with unique values that have no meaning, and
really only save you a tiny amount of time. On the other hand, it
doesn't necessarily take very much time to do it, so perhaps it's a
wash.

Either way, though, I'd first process the values into columns and
then process the columns into rows. But if more than half the
records have more than 10 values in them, then it would become
more unwieldy, I think, than just walking the table and creating
the new records.

I think it's borderline on that one. I'm not certain I know what
you are advocating, though. Are you saying to add columns to the
import table, and step-wise run queries to peel off argument 1 to
the first added column, removing it from the original source
column (leaving argument 2 as argument 1), then query again to
peel off argument 1 to the second added column, etc.?

That's one way to do it. He other way is to have a single query do
it. Or to write code that will write the SQL for you.

I'd do that in two steps of course, as you're likely to want a
lookup here, so you'll need both a list of the values connected to
the source record and a list of the unique values to pick from in
new records. So, I'd create the 1:N records (FK + Value), then
create the lookup table with a DISTINCT, then join on the value to
populate the lookup key.

Right, that's pretty much how I'm handling all the other
single-valued fields with repeated text already. It's working
great.
But with only 8000 records, I think I'd do the record creation by
walking the recordset and then walking the field and parsing.

Yeah. If i get what you're saying, I don't thing ripping
arguments into 10 or 40 new columns will be a good thing, and
anything short of the 40 or so means writing more code to handle
the arguments past #10 or so differently than the rest.

Well, I'm rather shocked at the idea that there'd be 40 separate
values stored in a single field. I know you said it was a
spreadsheet, but what kind of morons would write a spreadsheet that
complex when they obviously need a database?

I think I've figured out that a hybrid apprach might be best for
my current situation, given that I already have all the
infrastructure to build the DML parts. . . .

DML? Perhaps I should not have ignored the fact that I don't know
what DML means?

Oh, it's just SQL that's not DDL. SQL consists of DDL and DML.
To me, use of the term DML implies a focus on INSERT/UPDATE
queries rather than simple SELECTS, though I believe SELECT is
still considered part of DML. I welcome anyone's corrections on
these facts.

I've never heard that terminology before.

Of course, I never use DDL, either.

. . . The idea is that I query a
recordset with all the unique combinations, cycle through that,
and parse out the arguments in each one, build a collection of
unique values (2500 combinations, but probably only a few
hundred unique items), then insert the collection items into the
lookup table. From there, I should be able join from the source
table to the lookup using a Like expression ("/" + combination +
"/" Like "*/" + delimiter + item-text + delimier + "/*") to
populate the junction table.
I think/hope this will do the job.

Why join in that fashion?

Why not do this: walk the source table, parse each value and
insert a record with the source PK and the value into a temporary
table, then from that table, create another table witht he unique
values (this will be your lookup table, with a PK), then join
these two new tables on the value field and insert the PK from the
source table and the PK from the lookup table into your junction
table.

I'm guessing that this join, even though it would be an extra
query that will not benefit from any indexes, could be quicker
than executing 10 to 40 individual inserts from code per source
row as I go. I dunno, though. Your way is probably easier to
write.

A join on an expression faster than individual queries? I don't
think so!

With very large numbers of records, I'd break the first step into
2 parts, first parsing the values into columns and then inserting
the records from the columns.
What is your opinion if I do another project like this in the
future? DML, or loops?

Depends on the number of records and the number of values in the
multi-value field. With larger numbers of records and/or larger
numbers of values in the multi-value field, I'd tend to process
the values into columns first and then create records from the
columns. With smaller recordcounts (as in your case), I might do
it directly,
especially if the number of values stored in each field is large.

The last time I did this I was processing a phone number field
that had multiple values in it. Most of the fields had 1, 2 or 3
numbers, a handful had 4 and very small number had more than that.
I processed into 4 columns, with the 4th column still being
multi-valued, but with only a small number of records to process
(indeed, it was such a small number that I believe I created the
real records manually!). My experience with this kind of data is
that you have lots of records with 1 or 2 values, about half as
many with 3, and then a quick falloff from there, with 10% or less
having more than that. But it depends entirely on the kind of
data. Phone numbers have pretty much an upper limit, but other
kinds of data will not.

I guess what I'm saying is that with large numbers of records, I'd
separate the process of parsing the multi-value field from the
process of creating the records, for performance purposes.

I guess that sounds reasonable. I think I would have done pretty
much the same thing for a field with mostly one or 2 values per
row. It was only this strange case of having a 1-M stuffed into a
column that bought me this headache to begin with.

The other issue I'm seeing, though, is that the code would
probably have been simpler, cost less to write, and would now be
simpler to maintain if I had used loops instead of queries. Since
this is a monthly import, if the processing took less that twice
as long as now, perhaps, the simplicity alone would have been
sufficient reason to just do the loops, eh? Of course, a slower
process means testing less often during development, and that
means more difficult debugging sessions each time, so perhaps it
would not have cost less to write, but I still think I could have
made it more legible for maintenance that way.

What kind of source application is built around such a poorly
designed spreadsheet?

Is there any plan to get away from such a bad design?

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

Nov 12 '05 #11

Steve Jorgensen

On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:

....

Yeah. If i get what you're saying, I don't thing ripping
arguments into 10 or 40 new columns will be a good thing, and
anything short of the 40 or so means writing more code to handle
the arguments past #10 or so differently than the rest.

Well, I'm rather shocked at the idea that there'd be 40 separate
values stored in a single field. I know you said it was a
spreadsheet, but what kind of morons would write a spreadsheet that
complex when they obviously need a database?

It probably is stored in a database inside the proprietary system it comes
from. What I get is the spreadsheet exported from that system. Also, the
items in the export are very short codes.

I think I've figured out that a hybrid approach might be best for
my current situation, given that I already have all the
infrastructure to build the DML parts. . . . .... . . . The idea is that I query a
recordset with all the unique combinations, cycle through that,
and parse out the arguments in each one, build a collection of
unique values (2500 combinations, but probably only a few
hundred unique items), then insert the collection items into the
lookup table. From there, I should be able join from the source
table to the lookup using a Like expression ("/" + combination +
"/" Like "*/" + delimiter + item-text + delimier + "/*") to
populate the junction table.
I think/hope this will do the job.

Why join in that fashion?

Why not do this: walk the source table, parse each value and
insert a record with the source PK and the value into a temporary
table, then from that table, create another table witht he unique
values (this will be your lookup table, with a PK), then join
these two new tables on the value field and insert the PK from the
source table and the PK from the lookup table into your junction
table.

I'm guessing that this join, even though it would be an extra
query that will not benefit from any indexes, could be quicker
than executing 10 to 40 individual inserts from code per source
row as I go. I dunno, though. Your way is probably easier to
write.

A join on an expression faster than individual queries? I don't
think so!

Huh? Each query has to go through the whole VBA/DAO/JET layer 10-40 times per
input row. I should expect that to be quicker than running one query, an
letting JET do the whole thing at the JET engine level?

With very large numbers of records, I'd break the first step into
2 parts, first parsing the values into columns and then inserting
the records from the columns.

What is your opinion if I do another project like this in the
future? DML, or loops?
....I guess what I'm saying is that with large numbers of records, I'd
separate the process of parsing the multi-value field from the
process of creating the records, for performance purposes.

I guess that sounds reasonable. I think I would have done pretty
much the same thing for a field with mostly one or 2 values per
row. It was only this strange case of having a 1-M stuffed into a
column that bought me this headache to begin with.

The other issue I'm seeing, though, is that the code would
probably have been simpler, cost less to write, and would now be
simpler to maintain if I had used loops instead of queries. Since
this is a monthly import, if the processing took less that twice
as long as now, perhaps, the simplicity alone would have been
sufficient reason to just do the loops, eh? Of course, a slower
process means testing less often during development, and that
means more difficult debugging sessions each time, so perhaps it
would not have cost less to write, but I still think I could have
made it more legible for maintenance that way.

What kind of source application is built around such a poorly
designed spreadsheet?

Is there any plan to get away from such a bad design?

No. It's the export format provided from the application we need to get the
data from. The vendor provides another version of the program with a querying
interface for analysis, but it's missing much of the data we need that is
exported in the spreadsheet.

Nov 12 '05 #12

Mike Sherrill

On Thu, 29 Jan 2004 08:18:26 GMT, Steve Jorgensen
<no****@nospam.nospam> wrote:

[snip]

All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query

[snip]

Did you consider using code to generate SQL statements, then executing
them? (Possibly within explicit transactions?)

--
Mike Sherrill
Information Management Systems

Nov 12 '05 #13

David W. Fenton

Steve Jorgensen <no****@nospam.nospam> wrote in
news:75********************************@4ax.com:

On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:

...
Yeah. If i get what you're saying, I don't thing ripping
arguments into 10 or 40 new columns will be a good thing, and
anything short of the 40 or so means writing more code to handle
the arguments past #10 or so differently than the rest.

Well, I'm rather shocked at the idea that there'd be 40 separate
values stored in a single field. I know you said it was a
spreadsheet, but what kind of morons would write a spreadsheet
that complex when they obviously need a database?

It probably is stored in a database inside the proprietary system
it comes from. What I get is the spreadsheet exported from that
system. Also, the items in the export are very short codes.

So, you're normalizing data that has been denormalized for export?

How stupid is that?

Wouldn't it be better to have someone skip the spreadsheet and have
a normalized export process instead?

[]

What kind of source application is built around such a poorly
designed spreadsheet?

Is there any plan to get away from such a bad design?

No. It's the export format provided from the application we need
to get the data from. The vendor provides another version of the
program with a querying interface for analysis, but it's missing
much of the data we need that is exported in the spreadsheet.

Are you sure there's absolutely no access to the underlying data
structures?

This is the kind of thing that drives me crazy, having to program
something to undo something that has been extensively programmed
already. Any changes to the export will break your import routine,
for instance.

I recently replaced a client's system for importing data from MYOB
with direct connections via the MYOB ODBC, and vastly improved the
quality of data (previously, certain kinds of data were just not
available). I don't know if the application in question has any such
capability, but I would certainly let the client know that anything
you program is heavily contingent on there being no changes in the
output format at all.

--
David W. Fenton http://www.bway.net/~dfenton
dfenton at bway dot net http://www.bway.net/~dfassoc

Nov 12 '05 #14

Steve Jorgensen

On Fri, 30 Jan 2004 18:08:32 -0500, Mike Sherrill
<MS*************@compuserve.com> wrote:

On Thu, 29 Jan 2004 08:18:26 GMT, Steve Jorgensen
<no****@nospam.nospam> wrote:

[snip]
All that was going swimmingly, and performing pretty well until I got to the
fields containing multiple, delimited values. My whole dedign is based on
using SQL/DML passes for everything, but the only way I could figure out to
make that work was to call a user defined function from within the query

[snip]

Did you consider using code to generate SQL statements, then executing
them? (Possibly within explicit transactions?)

Yes, that's precisely what I am doing.

Nov 12 '05 #15

Steve Jorgensen

On Sat, 31 Jan 2004 00:06:28 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:

Steve Jorgensen <no****@nospam.nospam> wrote in
news:75********************************@4ax.com :
On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:

...
Yeah. If i get what you're saying, I don't thing ripping
arguments into 10 or 40 new columns will be a good thing, and
anything short of the 40 or so means writing more code to handle
the arguments past #10 or so differently than the rest.

Well, I'm rather shocked at the idea that there'd be 40 separate
values stored in a single field. I know you said it was a
spreadsheet, but what kind of morons would write a spreadsheet
that complex when they obviously need a database?

It probably is stored in a database inside the proprietary system
it comes from. What I get is the spreadsheet exported from that
system. Also, the items in the export are very short codes.

So, you're normalizing data that has been denormalized for export?

How stupid is that?

Wouldn't it be better to have someone skip the spreadsheet and have
a normalized export process instead?

Of course, that was the first thing I asked the client after they gave me the
requirements. It's a totally closed system. Since it provides valuable
market data on a subscription basis, presumably, they think giving customers
too much access would compete with their own analysis consulting business - so
we end up working around them.

It's not the first time I've seen this sort of thing, and it probably won't be
the last.

Nov 12 '05 #16

Peter Miller

Steve,

On Sat, 31 Jan 2004 04:03:07 GMT, Steve Jorgensen
<no****@nospam.nospam> wrote in comp.databases.ms-access:

Of course, that was the first thing I asked the client after they gave me the
requirements. It's a totally closed system. Since it provides valuable
market data on a subscription basis, presumably, they think giving customers
too much access would compete with their own analysis consulting business - so
we end up working around them.

It's not the first time I've seen this sort of thing, and it probably won't be
the last.

I've run into this before, under similar circumstances.

I would suggest the following. Just because the vendor only provides
a largely unsuitable export routine and doesn't expose the data
through standard means (odbc, a propr. driver, an embedded prog. lang,
etc) does not mean that the data is not accessible by more reliable
means.

Consider the following. If there is any logical construction to the
data stored by this other app (and if there is any complexity at all
to the app, its data storage must be logically structured or it would
have been unmanagable) then the vendor probably used a third party
solution for storage. In other words, the data files probably are
some recognizable format, even if this is not acknowledged by the
vendor. Open the data file in an editor. What does it look like?
Jet? dbase? sql server? mysql? Are there any processes that need
to be run in order for this other app to work (ie, is there a db
server of some sort)? If you can determine what they used for
storage, you can access the data directly without using their
poorly-designed export interface. I would recommend strongly though
that this export be enforced as an offline read-only process. You do
not want to impact performance of the other app, or interfere with
your licensing or provision for vendor support of the existing app.

But that said, in most cases, although the vendor protects the means
of obtaining and using the information (ie the protocols for updating
and the 'program' itself) this does not necessarily mean that the data
itself is owned or protected legally by the vendor. They may have
obfuscated its storage format and provided a controlled export process
to limit your ability to access the data, but in most cases, a
licensed user would not be prohibited from using the data via an
external routine, should you be able to devise one for them.

That said, everything is, of course, dependant upon the actual license
involved here. But in numerous cases where the license wasn't
prohibitive in this regard, and the data storage architecture was
obfuscated and the programs for accessing it limited, I have written
external routines for directly reading data from such systems for
clients who needed faster and more flexible access to their data. I'd
encourage you to take a closer look at both the license involved and
the proprietary (but almost inevitably non-custom) data storage
formats and decide for yourself whether there isn't a better way,
without taking your client's word for it. You may well find a lot
better solution than the one you are putting together now.

HTH

Peter Miller
__________________________________________________ __________
PK Solutions -- Data Recovery for Microsoft Access/Jet/SQL
Free quotes, Guaranteed lowest prices and best results
www.pksolutions.com 1.866.FILE.FIX 1.760.476.9051

Nov 12 '05 #17

Mike Sherrill

On Fri, 30 Jan 2004 19:39:25 GMT, "David W. Fenton"
<dX********@bway.net.invalid> wrote:

Well, I'm rather shocked at the idea that there'd be 40 separate
values stored in a single field. I know you said it was a
spreadsheet, but what kind of morons would write a spreadsheet that
complex when they obviously need a database?

"Obvious" is in the eye of the beholder.

One of my first system conversions involved building a SQL database
out of 30 megs of WordPerfect (not a typo) tables. Only forty values
in a cell would have been a real treat.

--
Mike Sherrill
Information Management Systems

Nov 12 '05 #18

Mike Sherrill

On Sat, 31 Jan 2004 04:00:27 GMT, Steve Jorgensen
<no****@nospam.nospam> wrote:

Did you consider using code to generate SQL statements, then executing
them? (Possibly within explicit transactions?)

Yes, that's precisely what I am doing.

I got the impression your SQL statements were executing VBA functions.
I was talking (unclearly) of building SQL statements that contain only
literal values, no function calls.

--
Mike Sherrill
Information Management Systems

Nov 12 '05 #19

Steve Jorgensen

On Sat, 31 Jan 2004 17:12:35 -0500, Mike Sherrill
<MS*************@compuserve.com> wrote:

On Sat, 31 Jan 2004 04:00:27 GMT, Steve Jorgensen
<no****@nospam.nospam> wrote:
Did you consider using code to generate SQL statements, then executing
them? (Possibly within explicit transactions?)

Yes, that's precisely what I am doing.

I got the impression your SQL statements were executing VBA functions.
I was talking (unclearly) of building SQL statements that contain only
literal values, no function calls.

Oh, right. I see what you're getting at.

The issue was that the only way I could figure out to use a query to split up
the multi-vlaued fields was to call a UDF. That's because expressions that
can be built to do that using native functions (FWICS would become longer and
more deeply nested for each successive argument number, and would quickly
exceed the limits of what JET could be reasonably expected to parse (long
before argument 40). If this was SQL Server, I think I could use PATINDEX to
find, say, the text between the 12th and 13th delimiter characters.

Nov 12 '05 #20

Mike Sherrill

On Sat, 31 Jan 2004 22:35:32 GMT, Steve Jorgensen
<no****@nospam.nospam> wrote:

I got the impression your SQL statements were executing VBA functions.
I was talking (unclearly) of building SQL statements that contain only
literal values, no function calls.

Oh, right. I see what you're getting at.

The issue was that the only way I could figure out to use a query to split up
the multi-vlaued fields was to call a UDF.

And you were trying to approach this in a set-based way. Sorry; I
lost the context somewhere.

Did you consider using third-party software like DataJunction? I had
a project that involved moving loads of data between a SQL system and
Lotus Notes, and a third-party program really made my life easier.
(It wasn't DataJunction, but it was the same kind of thing.)

--
Mike Sherrill
Information Management Systems

Nov 12 '05 #21

Import/Normalize approach - column-DML, or loop

Similar topics