Need something similar to JOIN, but not exactly.

8,435 Expert 8TB

Ok, this is a question I've asked here before, but never found the answer I was looking for. However, at that time I was investigating ways of approaching something, so I'll try simplifying the question to the technical details of one specific approach. Anyway...

Let's say I have a table Table1 with field DateTime1 (date/time).
And I have a table Table2 with fields Started and Finished (both date/time).
(Actually, the field formats should be irrelevant to the discussion).

I want to build some sort of query, if possible, which will group by the values in DateTime1, and for each value, return a count of the number of records in Table2 with (Started <= DateTime1 and Finished >= DateTime1).
In other words, kind of like a simple join, but where two fields "bracket" or "surround" the original field, rather than one field matching it.

Can do?

Nov 15 '06 #1

Subscribe Post Reply

1838

NeoPa

32,556

Expert Mod 16PB

Using Between (DateTime1 Between Started And Finished) in your WHERE clause, you can determine if there is a match.

Expand|Select|Wrap|Line Numbers

 SELECT DateTime1, Count(DateTime1) AS MatchCount

FROM Table1, Table2

WHERE DateTime1 Between Started And Finished

GROUP BY DateTime1

Nov 16 '06 #2

Killer42

8,435

Expert 8TB

Using Between (DateTime1 Between Started And Finished) in your WHERE clause, you can determine if there is a match.

Expand|Select|Wrap|Line Numbers

SELECT DateTime1, Count(DateTime1) AS MatchCount

FROM Table1, Table2

WHERE DateTime1 Between Started And Finished

GROUP BY DateTime1

Just heading off home now, but I'll mull this over.

Thanks.

Nov 16 '06 #3

MMcCarthy

14,534

Expert Mod 8TB

Ok, this is a question I've asked here before, but never found the answer I was looking for. However, at that time I was investigating ways of approaching something, so I'll try simplifying the question to the technical details of one specific approach. Anyway...

Let's say I have a table Table1 with field DateTime1 (date/time).
And I have a table Table2 with fields Started and Finished (both date/time).
(Actually, the field formats should be irrelevant to the discussion).

I want to build some sort of query, if possible, which will group by the values in DateTime1, and for each value, return a count of the number of records in Table2 with (Started <= DateTime1 and Finished >= DateTime1).
In other words, kind of like a simple join, but where two fields "bracket" or "surround" the original field, rather than one field matching it.

Can do?

You could also try ...

SELECT DateTime1, Sum(IIf(DLookup("[Started]","Table2","[Started]<=" & [DateTime1]) AND DLookup("[Finished]","Table2","[Finished]>=" & [DateTime1]),1,0))
FROM Table1
GROUP BY DateTime1;

I don't know which will have the better preformance but I'm guessing that because Adrian's has a full outer join mine will be better. Maybe I'll win this one. Although with my luck it won't work.

Don't tell me you're winning the battle on this one Killer.

Mary

Nov 16 '06 #4

NeoPa

32,556

Expert Mod 16PB

SELECT DateTime1, Sum(IIf(DLookup("[Started]","Table2","[Started]<=" & [DateTime1]) AND DLookup("[Finished]","Table2","[Finished]>=" & [DateTime1]),1,0))
FROM Table1
GROUP BY DateTime1;

Actually, because OUTER JOINs don't need either recordset to be sorted to match the join, I would expect it to run quite quickly.
However, if that doesn't prove true then this version of Mary's code might be easier to use (excuse this Mary - I'm not trying to be critical but this area is a strength of mine that most people struggle with to a certain extent - this processes through Table2 only once per Table1 record).

Expand|Select|Wrap|Line Numbers

 SELECT DateTime1, _

   DCount('[Started]','Table2', _

   Format([DateTime1],'\#m/d/yyyy\#') & _

   ' Between [Started] And [Finished]')

FROM Table1

GROUP BY DateTime1

Nov 16 '06 #5

MMcCarthy

14,534

Expert Mod 8TB

Actually, because OUTER JOINs don't need either recordset to be sorted to match the join, I would expect it to run quite quickly.
However, if that doesn't prove true then this version of Mary's code might be easier to use (excuse this Mary - I'm not trying to be critical but this area is a strength of mine that most people struggle with to a certain extent - this processes through Table2 only once per Table1 record).

Adrian

What have I told you about criticising me. How very DOUBLE DARE you.

But really ...

You know query preformance is not one of my areas of strength. I love these discussions because I learn more about it all the time.

Feel free to correct my code anytime.

Mary

Nov 16 '06 #6

PEB

1,418

Expert 1GB

He he he
Where i see dlookup, dcount, dsum or something that begins with d

It drives me mad... And don't speek me for performance about a query that have this kind of functions...

So Killer if you want to trat a lot of data with this SQL change the d function with your own! It will run better without giving wrong results and not actualized data!

See f you can use subqueries! But in this case it seems not possible!

Nov 18 '06 #7

NeoPa

32,556

Expert Mod 16PB

Hee Hee.

Yes, Domain Aggregate functions (DThis & DThat), should probably come with a Government Health Warning ;).

However :
They can be useful to provide individual items of data (generally processing through a dataset just the once).
They can be easier to understand when SQL logic gets complicated (if you've never seen really complicated SQL logic - trust me, it can be brain-scrambling).
Generally it's a very good rule of thumb that it's not a good idea to use it relative to a record in a recordset.
This is partly because the setting up and processing of the dataset is done over and over again for each record.

Obviously, there are exceptions to this, many of them we've seen in these forums. Where it's not practical or possible to do it any other way.

When there's a choice - don't use them.

Nov 18 '06 #8

Killer42

8,435

Expert 8TB

Thanks for all the input, people. Despite what I said on Friday, I haven't had a chance to mull it all over yet.

But when I find the time, I'll definitely be having a go, using the techniques expounded here.

Depending on the performance I can get out of all this (assuming for the moment that I can get it to work), I'll either use the stats directly from the query, or dump them into another table for future reference. Given that the data is static once loaded, the latter may be the way to go, it's just a bit of a nuisance that I'll have to add yet another step or two to the daily load procedure.

Anyway, I'll let you know how it goes, when it goes.

Nov 19 '06 #9

Killer42

8,435

Expert 8TB

Using Between (DateTime1 Between Started And Finished) in your WHERE clause, you can determine if there is a match.

Expand|Select|Wrap|Line Numbers

SELECT DateTime1, Count(DateTime1) AS MatchCount

FROM Table1, Table2

WHERE DateTime1 Between Started And Finished

GROUP BY DateTime1

I've just tried this query. My CPU is currently redlining, cooling fan going berserk. This has been going on for 5 - 10 minutes now, and this is with only a small sample (maybe two months) of the date/time range.

I guess I'd better have a look at Mary's version. :)

(This was a quick first attempt, and obviously I may have interpreted your suggestion incorrectly - will check it out later.)

Nov 23 '06 #10

MMcCarthy

14,534

Expert Mod 8TB

I've just tried this query. My CPU is currently redlining, cooling fan going berserk. This has been going on for 5 - 10 minutes now, and this is with only a small sample (maybe two months) of the date/time range.

I guess I'd better have a look at Mary's version. :)

(This was a quick first attempt, and obviously I may have interpreted your suggestion incorrectly - will check it out later.)

Let me know if I win!!

Mary

Nov 23 '06 #11

Killer42

8,435

Expert 8TB

Let me know if I win!!

Ok. But don't forget, the last code posted was NeoPa's enhanced version of your code. So it looks as though it'll be more of a team effort.

Sorry about that. :) I know how you like to hog the glory.

Still running, by the way. Let's see, that's about 1hr 10min, take off 25 minutes to compensate for theScripts adding it:(, so it has been going for about 45 minutes. Given that this is on a relatively small sample of my data, it might be time to interrupt it and try your version.

Nov 23 '06 #12

Killer42

8,435

Expert 8TB

Sorry, just to clarify...

(Showing my translation of field names etc, not originally posted code)

NeoPa posted some code (which I'm testing now)

Expand|Select|Wrap|Line Numbers

 SELECT StartTime, Count(StartTime) AS MatchCount

FROM DateTime2, Log

WHERE StartTime Between Started And Ended

GROUP BY StartTime

Then Mary posted an alternative which I will be trying out next

Expand|Select|Wrap|Line Numbers

 SELECT StartTime, Sum(IIf(DLookup("[Started]","Table2","[Started]<=" & [StartTime])

 AND DLookup("[Finished]","Log","[Ended]>=" & [StartTime]),1,0))

FROM DateTime2

GROUP BY StartTime;

Then NeoPa suggested a modified version of that

Expand|Select|Wrap|Line Numbers

 SELECT StartTime, _

   DCount('[Started]','Table2', _

   Format([StartTime],'\#m/d/yyyy\#') & _

   ' Between [Started] And [Ended]')

FROM DateTime2

GROUP BY StartTime

Nov 23 '06 #13

MMcCarthy

14,534

Expert Mod 8TB

Ok. But don't forget, the last code posted was NeoPa's enhanced version of your code. So it looks as though it'll be more of a team effort.

Sorry about that. :) I know how you like to hog the glory.

Still running, by the way. Let's see, that's about 1hr 10min, take off 25 minutes to compensate for theScripts adding it:(, so it has been going for about 45 minutes. Given that this is on a relatively small sample of my data, it might be time to interrupt it and try your version.

You know if my version runs faster I'll never let him live it down.

Hopefully.... :)

Mary

Nov 23 '06 #14

MMcCarthy

14,534

Expert Mod 8TB

Damn, I knew it was to good to be true.

But, there is still a small glimmer of hope .... ;)

Mary

Nov 23 '06 #15

Killer42

8,435

Expert 8TB

Damn, I knew it was to good to be true.
But, there is still a small glimmer of hope .... ;)

Ok, I think I managed to interrupt it (or perhaps it just finished). Produced 72 lines of output. Which may or may not be right - I'll check later. The results look reasonable, though.

Now to try Mary's version...

By the way, I've had to modify this one a few times to get it to start (such as inserting # symbols around the date. Here's the latest...

Expand|Select|Wrap|Line Numbers

 SELECT DateTime2.StartTime,

Sum(IIf(DLookUp("[Started]","Log","[Started]<=#" & [StartTime]&"#")

And DLookUp("[Ended]","Log","[Ended]>=#" & [StartTime]&"#"),1,0)) AS Expr1

FROM DateTime2

GROUP BY DateTime2.StartTime;

Oh. It finished quite quickly, but produced a few "1" entries, then the rest zero. Hm...

Oh well, on to version 3...

Nov 23 '06 #16

Killer42

8,435

Expert 8TB

By the way, I probably should mention that I was originally using a VB program to produce these stats. The problem is that it's very slow. I mean, producing the stats for the last 3 years or so, the program ran for quite a few days.

Then when I found that I had needed to make a slight change and re-run it, I decided enough was enough:(. That's why I'm searching for a quicker (and hopefully simpler) way to generate these numbers.

The nice thing is, it means I have (or can easily generate) a complete set of results for comparison.

Nov 23 '06 #17

MMcCarthy

14,534

Expert Mod 8TB

Now you're modifying my code, what is this, a conspiracy?

HEEEELP, the humans are revolting.

I have no idea what I was doing. I can't follow my own code. Now that is bad.

Mary

Nov 23 '06 #18

Killer42

8,435

Expert 8TB

Now you're modifying my code, what is this, a conspiracy?

HEEEELP, the humans are revolting.

Yeah, they are pretty disgusting... :)

I have no idea what I was doing. I can't follow my own code. Now that is bad.

No comment ;)

I'm considering giving up on the Access approach and going back to VB. But I'm thinking of modifying the technique used there. Originally I was doing something like

Expand|Select|Wrap|Line Numbers

 For Each Hour over the desired date/time range

  Generate a query to count the matching records

  Write the results to another table

  Commit

Loop

Note, this was actually complicated by the fact that I was generating a number of different statistics based on relationships between request/start/finish times. In other words, number waiting to start, number active, etc etc etc. I've been simplifying somewhat so far in this discussion.

I'm thinking I come at it from the other direction. Something along these lines

Expand|Select|Wrap|Line Numbers

 
For every existing record

  Work out which hour it belongs to

  Update (or create) that result record

  Commit (periodically, not every iteration)

Loop

I might even build up the results in a big array so that the majority of the time I'm only reading the database, not writing. Then dump them at the end.

Nov 23 '06 #19

MMcCarthy

14,534

Expert Mod 8TB

Ok, this should work but it is only a reworking of Adrians last effort. So I can't take credit.

Damnit.

Mary

Expand|Select|Wrap|Line Numbers

 SELECT StartTime, 

DCount("[Started]","Log","[Started]<=#" & [StartTime] & "# 

And [Finished]>=#" & [StartTime] & "#")

FROM DateTime2

GROUP BY StartTime;

Nov 23 '06 #20

MMcCarthy

14,534

Expert Mod 8TB

Assumption:
You are trying to work out by it's date who or what machine was logged on at the time that event took place or something following this logic.

So all you really need to know is:

Find all log records prior to this starttime. If you order the log by started Ascending and return only records where started is prior to this date. If you then check only these records for ended being after this date.

Or, my last attempt:

Expand|Select|Wrap|Line Numbers

  
SELECT StartTime, Sum(IIf([Ended]>= StartTime,1,0)

FROM DateTime2, Log

WHERE [Started] <= StartTime

GROUP BY StartTime;

Nov 24 '06 #21

Killer42

8,435

Expert 8TB

Assumption:
You are trying to work out by it's date who or what machine was logged on at the time that event took place or something following this logic.

'Fraid not.

These are a bunch of log records indicating (as you're probably aware) the date/time that various tasks were requested, started and finished. I am generating stats on the number of tasks which were in various states once each hour, minute or whatever. For example:

At dd/mm/yyyy hh:mm, how many tasks had been requested but not yet started?
During some particular hour (or minute, day, month, etc), how many tasks were active at any time? Regardless of whether they started during that hour, ended, or ran right through, they need to be counted.
And so on.

Each of the individual questions is relatively simple to answer, and I can easily throw together a select query to return the answer for a given entry (time, or range). The tough part is generating a list showing the breakdown per hour, minute, day or whatever.

Some ideas are starting to form as I discuss it. I have to leave it for now, but I'll take a copy home over the weekend and play with it. It's not really urgent - at this point the stats are really just for my own entertainment and edification, though I suspect they will have some use.

Nov 24 '06 #22

MMcCarthy

14,534

Expert Mod 8TB

What's the basic log field format?

'Fraid not.

These are a bunch of log records indicating (as you're probably aware) the date/time that various tasks were requested, started and finished. I am generating stats on the number of tasks which were in various states once each hour, minute or whatever. For example:
At dd/mm/yyyy hh:mm, how many tasks had been requested but not yet started?

During some particular hour (or minute, day, month, etc), how many tasks were active at any time? Regardless of whether they started during that hour, ended, or ran right through, they need to be counted.

And so on.

Each of the individual questions is relatively simple to answer, and I can easily throw together a select query to return the answer for a given entry (time, or range). The tough part is generating a list showing the breakdown per hour, minute, day or whatever.

Some ideas are starting to form as I discuss it. I have to leave it for now, but I'll take a copy home over the weekend and play with it. It's not really urgent - at this point the stats are really just for my own entertainment and edification, though I suspect they will have some use.

Nov 24 '06 #23

Killer42

8,435

Expert 8TB

What's the basic log field format?

Well, I don't think I'm allowed to say too much about the layout. But for the purposes of these stats, I'm simply counting records. So the only relevant fields are the three I mentioned. They are called Submitted, Started and Ended. All Date/Time format.

In a sense, we've come full-circle again. Because it was this statistical thing which originally prompted me to look into the possibility of UNIONing them all together into one big "table", to make it easier to generate the stats. :)

I may yet end up doing that yet, and reading through it in one fell swoop rather than doing searches/counts. We'll see. On the other hand, I may even extract just the date/time fields from all the different databases and stuff them into one big table. Without the rest of the info, they may fit. All of that is fairly irrelevant at this point, anyway.

Like I said, I'll take some time over the weekend to play around with it. I hope.

Nov 24 '06 #24

NeoPa

32,556

Expert Mod 16PB

Just to comment.
The performance of processing through two unconnected tables, while it may be better than including a DLookUp() within the output processing, will nevertheless be expected to be poor compared with linking two JOINed tables. Probably by some way.
If you're looking at doing this with some of your mega-tables, then you can certainly expect some delays.
I hope you find an efficient way around your problem, but I would expect it to be a non-trivial matter in terms of performance.
BTW My ideas are just worked out on theory and my understanding of how it should work. Don't take any of it as gospel. It's always possible that I've completely left out a factor which could make all the difference. Unlikely to be to such an extent that it's slower than Mary's option though - j/k ;).

Nov 24 '06 #25

Killer42

8,435

Expert 8TB

The performance of processing through two unconnected tables, ...

I didn't really follow that very well. But it's Friday afternoon and I'm tired. Maybe I'll re-read it at home and it'll sink in. :)

I hope you find an efficient way around your problem, but I would expect it to be a non-trivial matter in terms of performance.

I'm still thinking a sequential pass through the data is the way to go, spitting out stats as I go. Or better still, building up stats in memory and spitting them out at the end. Could mean a lot of memory, though. Let's see, 1440 minutes a day, times 365.25 days a year, that's 525,960 array elements per year. Hm... might have to set some reasonable limits and roll old stuff out of memory.

We'll see...

...Unlikely to be to such an extent that it's slower than Mary's option though - j/k ;).

Ouch!

Lucky you're a moderator.

Nov 24 '06 #26

NeoPa

32,556

Expert Mod 16PB

The performance of processing through two unconnected tables, ...
I didn't really follow that very well. But it's Friday afternoon and I'm tired. Maybe I'll re-read it at home and it'll sink in.

I was referring to the code I posted earlier (post #2) which included two tables without any JOIN at all.
Essentially, this produces output records for every combination of all the records of both tables.

Assume you have two tables each with an ID field. TableA with ID of records A; B; C; D and Table2 with ID of records 1; 2; 3.
The result set of

Expand|Select|Wrap|Line Numbers

 SELECT TableA.ID, Table2.ID

FROM TableA, Table2

would be :

Expand|Select|Wrap|Line Numbers

This is obviously a large dataset to come out of two tables, the count of records being the product of the counts of each table.
You can restrict the output using a WHERE clause.

Nov 24 '06 #27

MMcCarthy

14,534

Expert Mod 8TB

It's always possible that I've completely left out a factor which could make all the difference. Unlikely to be to such an extent that it's slower than Mary's option though - j/k ;).

I'm currently planning my revenge which will be when you least expect it. Keep looking over your shoulder and shaking in your boots.

I NEVER forget!

Mary

Nov 25 '06 #28

Killer42

8,435

Expert 8TB

I was referring to the code I posted earlier ... the count of records being the product of the counts of each table.

Um...

It might interest you to know that I've just counted my data records. There are a bit over 55 million. The other (postulated) table was the minutes for the time range. So if we assume one year, it was around half a million. The product comes out to...

29,018,947,290,120 records.

Hm... 29 trillion records. Per year.

Or would that be 29 billion in the UK? Either way, I think it's fair to say that's a pretty large result set.

For now though, I might look into the different approach I've been thinking over.

Nov 25 '06 #29

NeoPa

32,556

Expert Mod 16PB

Maaarryyy,

I'm not scared of you! :Help someone:
That quiver in my voice is simply due to the cold - really it is.

Killer,

I hear what you're saying.
That IS quite a few records.
A point to bear in mind though, that's going to be a mega operation no matter how it's coded.
I would expect a Dblah() call for every one of the 55 million records would be worse, however.
Do you fancy testing the hypothesis (shouldn't take long).

That's a good point about the English Billion.
I believe it predated the American Billion but nevertheless, is very rarely quoted nowadays. Everyone seems to have adopted the American version.

Nov 25 '06 #30

MMcCarthy

14,534

Expert Mod 8TB

Maaarryyy,

I'm not scared of you! :Help someone:
That quiver in my voice is simply due to the cold - really it is.

Be Afraid, Be Very Very Afraid. (HA HA HA HA - Igor tone)

That's a good point about the English Billion.
I believe it predated the American Billion but nevertheless, is very rarely quoted nowadays. Everyone seems to have adopted the American version.

What's an American Billion?

I'm so old I'm only familiar with the English version. ;)

Mary

Nov 25 '06 #31

Killer42

8,435

Expert 8TB

Be Afraid, Be Very Very Afraid. (HA HA HA HA - Igor tone)
What's an American Billion?
I'm so old I'm only familiar with the English version. ;)

In American...

1,000,000 = One Million (duh!)
1,000,000,000 = One billion
1,000,000,000,000 = One trillion

Nov 25 '06 #32

NeoPa

32,556

Expert Mod 16PB

You and me both, Mary.
We should form an oldies club of some kind (blantantly trying to squirm onto her good side).

For anyone else who may really be confused :
English billion = 1,000,000,000,000 or million ^ 2.
American billion = 1,000,000,000 or a thousand million.

Nov 25 '06 #33

MMcCarthy

14,534

Expert Mod 8TB

In American...

1,000,000 = One Million (duh!)
1,000,000,000 = One billion
1,000,000,000,000 = One trillion

That's just stupid. Sorry Yanks! :)

A million million is a billion. I remember the rhyme. So all those American Billionaires aren't really billionaires. They just made up a number so they could pretend they were.

Mary

Nov 25 '06 #34

Killer42

8,435

Expert 8TB

A point to bear in mind though, that's going to be a mega operation no matter how it's coded.
I would expect a Dblah() call for every one of the 55 million records would be worse, however.
Do you fancy testing the hypothesis (shouldn't take long).

:D Bwahahaha...

I'm still leaning toward the sequential run through the 55 million, building up the stats in memory as I go. Have started coding in VB6, so I'm a bit reluctant to change direction now. Especially since I can only get back to it from time to time bewteen doing other things. Like endless conversations on thescripts, for a start. ;)

It's Saturday afternoon here, and too hot to think.

(Also, keep in mind I have to be able to process each day's new records, without taking a week for each.)

Nov 25 '06 #35

MMcCarthy

14,534

Expert Mod 8TB

You and me both, Mary.
We should form an oldies club of some kind (blantantly trying to squirm onto her good side).

You think calling me an oldie is going to do that ?

Mary :)

Nov 25 '06 #36

Killer42

8,435

Expert 8TB

That's just stupid. Sorry Yanks! :)

There seem to be surprisingly few here. Or perhaps I just get that impression because I largely deal with you and Adrian.

A million million is a billion. I remember the rhyme. So all those American Billionaires aren't really billionaires. They just made up a number so they could pretend they were.

Well, I must admit that "billionaire" sounds better than "thousand-millionaire".

Nov 25 '06 #37

MMcCarthy

14,534

Expert Mod 8TB

It's Saturday afternoon here, and too hot to think.

What have I told you about complaining about the weather.

It's been raining here ALL day. How do you think we get all that grass in the Fair Green Isle.

Mary

Nov 25 '06 #38

NeoPa

32,556

Expert Mod 16PB

It's Saturday afternoon here, and too hot to think.

BITCH (and I don't care WHAT gender you are)!
It's the middle of the night here and I'm cold to the bone.
I had a bike accident in some attrocious weather last week.
...And you're suffering from too much sunshine!
Ouch.

Nov 25 '06 #39

MMcCarthy

14,534

Expert Mod 8TB

I know this sounds terrible but I'm not in the humour of giving coding advice tonight. I had to run two one 2 one training sessions today on Access but with two different projects.

I think I'm adviced out. (If either of you correct my grammer here you're DEAD)

Mary

Nov 25 '06 #40

NeoPa

32,556

Expert Mod 16PB

I think I'm adviced out. (If either of you correct my grammer here you're DEAD)

Not Grammar Mary, spelling!

Ow, Ouch, Oooooowwww!

(Picture NeoPa sprinting down the road (past Pat!) with Mary running close behind (belying her age) with a rolling pin in her outstretched right hand).

(OK - I know I'm dead :( )

Nov 25 '06 #41

MMcCarthy

14,534

Expert Mod 8TB

Not Grammar Mary, spelling!

Ow, Ouch, Oooooowwww!

(Picture NeoPa sprinting down the road (past Pat!) with Mary running close behind (belying her age) with a rolling pin in her outstretched right hand).

(OK - I know I'm dead :( )

Nor only dead but WRONG

I meant Advice not Advise which would indicate I'd been receiving rather than imparting said Advice.

Mary (Sticking her tongue out while running down the road with the rolling pin)

How did you know I was right handed?

Nov 25 '06 #42

NeoPa

32,556

Expert Mod 16PB

Sorry Mary, but I was referring to the spelling of 'Grammer' :S

Nov 25 '06 #43

MMcCarthy

14,534

Expert Mod 8TB

Sorry Mary, but I was referring to the spelling of 'Grammer' :S

I HATE you! :)

I didn't even see that.

Mary

STOP Grinning, I can see you ....

Nov 25 '06 #44

Killer42

8,435

Expert 8TB

Well, I managed to get my stats generated from the almost-four-years of data - you know, the 55 million records. Did it in VB6 again.

The old way
The process I had used before with some success was to somewhat arbitrarily set a start and end value for the date/time range I was interested in. Then I ran a loop through that range, incrementing by one hour each time. For each time, I generated a bunch of queries to count the records which matched various rules. For example, to find the tasks which were outstanding (not started) at that point in time, I would execute

Expand|Select|Wrap|Line Numbers

 SELECT Count <somefield> AS TheCount

  FROM Log

  WHERE ([Submitted] <= #sometime#

  AND [Started] >= #sometime#

and similarly for other combinations. Given that it was reading each record in turn, and performing a number of individual queries for each one, the performance was not too bad. But to run through all the millions of records took quite a few days.

All-new and improved :D
The new technique is to examine the data and get the highest and lowest date/time values, then set up an array and populate it with every hour between those two extremes. The fun part is that to produce stats per hour, I have... oh I forget, but tens of thousands of array elements. To do stats per minute, the array size is around two million entries. :) I’m glad I didn’t attempt this back when arrays were limited to 64KB or whatever.
So, we have our array of hours, made up of a user-defined type like so...

Expand|Select|Wrap|Line Numbers

 Private Type StatsType

  StartTime As Date    ' Start of the period (hour, or whatever)

  EndTime As Date      ' End of the period (hour, or whatever)
 
  Submitted As Long    ' Tasks which were submitted during the period.

  Started As Long      ' Tasks which started during the period.

  Ended As Long        ' Tasks which ended during the period.

  Waiting As Long      ' Outstanding tasks (snapshot, start of period)

  InProgress As Long   ' Tasks in progress (snapshot, start of period)

  Active As Long       ' Total tasks active any time over the period

End Type

Populating the array initially is done like so...

Expand|Select|Wrap|Line Numbers

 StatsCount = DateDiff("h", MinDate, MaxDate)

ReDim Stats(Zero To StatsCount)

Stats(Zero).StartTime = MinDate

Stats(Zero).EndTime = DateAdd("s", 3599, MinDate)

For I = One To StatsCount

  With Stats(I)

    .StartTime = DateAdd("h", One, Stats(I - One).StartTime)

    .EndTime = DateAdd("s", 3599, .StartTime)

  End With

Next

Then I scan through the data once, using a UNION query I created in Access, containing just the three relevant date/time fields. For each record, I determine the first and last entries in the array which could be affected. Then…

Expand|Select|Wrap|Line Numbers

 Stats(Entry_Sub).Submitted = Stats(Entry_Sub).Submitted + One

Stats(Entry_Sta).Started = Stats(Entry_Sta).Started + One

Stats(Entry_End).Ended = Stats(Entry_End).Ended + One

For I = Entry_Sub To Entry_End

  With Stats(I)

    If Value_Sub <= .StartTime And Value_Sta > .StartTime Then

      .Waiting = .Waiting + One

    End If

    If Value_Sta <= .StartTime And Value_End >= .StartTime Then

      .InProgress = . InProgress + One

    End If

    If Value_Sta <= .EndTime And Value_End >= .StartTime Then

      .Active = .Active + One

    End If

  End With

Next

I hope this is making some kind of sense.

Anyway, the upshot is that I am producing what look like reasonable stats (still to be verified), and the entire 55 million records took about 34 minutes to produce. And that’s on my home PC which is a bit slower than the one at work.

For now I’m just writing it out to a tab-delimited text file, will eventually import the file to Access. What I still have to decide is how I’m going to generate the stats on an ongoing basis for each day’s new data. Oh well, plenty of time for that. I probably won’t have a chance to look into it until I’m on holidays. Only a couple of weeks left… :)

Nov 26 '06 #45

Need something similar to JOIN, but not exactly.

Similar topics