Importance of primary keys in UNION

8,435 Expert 8TB

Hi all.

As has been mentioned here recently, I have a front-end database which just includes some queries and a bunch of (rather large) linked tables. For reasons of (very poor) performance, I had to do quite a bit of fiddling around to get my VBA code to build queries which would query each table individually, then UNION the results. Doing a single UNION query produced performance I probably could have bettered with a pencil and paper.

Something mentioned in this forum (by either NeoPa or mmccarthy, I think) inspired me to do some experimentation (which I had previously been putting off due to time pressure). This has led to the conclusion that the performance of a UNION query is hugely improved if the tables involved have primary keys. This may not come as a surprise to you, but I just hadn’t had time to check it out.

My big tables don’t have a primary key because a there isn’t room (they’re right up against the 2GB limit) and b it wasn’t necessary when working with one table at a time – performance was quite acceptable. I’ve only just moved to the split design to try and consolidate everything into a single interface.

What I’d like to know is, have I stuffed up my testing, or am I really likely to see an enormous speed improvement when doing simple queries against a bunch of “unioned” tables by adding a primary key to the tables? (The primary key that I added was totally unrelated to my test queries or to the real ones I’ll be using.)

Oct 31 '06 #1

Subscribe Post Reply

4120

PEB

1,418

Expert 1GB

The PK is very important and take care use only longinteger PK! Index your fields on which you do the search!

It's very fast after!

:)

No PK - no hope for fast performance!

Oct 31 '06 #2

MMcCarthy

14,534

Expert Mod 8TB

What I’d like to know is, have I stuffed up my testing, or am I really likely to see an enormous speed improvement when doing simple queries against a bunch of “unioned” tables by adding a primary key to the tables? (The primary key that I added was totally unrelated to my test queries or to the real ones I’ll be using.)

You'll definately see an improvement as PEB says. In cases where it's just not possible to add a primary key, try to index at least one field in the table. Even an indexed field with duplicates should give some improvement.

Oct 31 '06 #3

NeoPa

32,556

Expert Mod 16PB

I'm open to correction here but I'm quite happy this is along the right tracks.
Firstly, FYI, The size limit of Access 2K and beyond is 4GB. 97 was the last one limited to 2GB.
Beware though, a Unicode feature was installed for text fields which can double your text data unless you use 'Unicode Compression', which should be safe for most databases in the English speaking world.
Indices - what a subject!
When an RDBMS (Access for instance) processes a recordset, it will look for an index that will help the processing it's currently on. So, if you run

Expand|Select|Wrap|Line Numbers

 SELECT [CName], [Surname], [DoB]

FROM [tblStaff]

ORDER BY [Surname]

then having an index based on [Surname] will enable that to be used.
If there is no index that will help, it will have to sort the data first before use.
This is where the small PK comes in

The PK is very important, and take care use only Long Integer PK! Index your fields on which you do the search!

If you can't provide keys for all the various queries that use the table, then those that don't have keys will sort, and here's the nub, if there is a PK then it will sort via that - if no PK at all, it needs to create a temp table and sort the whole dataset :(.

Having said that, my recommendation would be to use the PK for your main unique index - though there is clearly something to be said for PEB's approach, especially where you need to process the recordset from many different directions in different queries etc - as it were.

Oct 31 '06 #4

Killer42

8,435

Expert 8TB

Hi all.

Don't have time to read and respond to everything just yet, but I wanted to make a couple of points.

I'm running Access 2003, but my databases are in Access 2000 format in case I need to work on them at home.
I have run into the 2GB limit, plenty of times. The last time was probably less than a fortnight ago. If I remember correctly, you get an "Invalid command" message, or something similar. It doesn't give any reasonable indication as to what's wrong.
All my search fields are indexed (allowing duplicates)
Performance is fine as long as I'm dealing with each 2GB database individually. They're around 12-15 million records each, and searching is nice and quick (couple of seconds). Once I UNION a couple of them, and do a simple search on a field which is indexed in both, it takes hours.
The "PK" I added was not integer, and had nothing directly to do with the search. What makes the difference appears to be simply whether there is a PK.

NeoPa, I'll read your lengthy message some time today (it's morning here now).

Oct 31 '06 #5

Killer42

8,435

Expert 8TB

I'm open to correction here but I'm quite happy this is along the right tracks.
Firstly, FYI, The size limit of Access 2K and beyond is 4GB. 97 was the last one limited to 2GB.
Beware though, a Unicode feature was installed for text fields which can double your text data unless you use 'Unicode Compression', which should be safe for most databases in the English speaking world.
Indices - what a subject!
When an RDBMS (Access for instance) processes a recordset, it will look for an index that will help the processing it's currently on. So, if you run

Expand|Select|Wrap|Line Numbers

SELECT [CName], [Surname], [DoB]

FROM [tblStaff]

ORDER BY [Surname]

then having an index based on [Surname] will enable that to be used.
If there is no index that will help, it will have to sort the data first before use.
This is where the small PK comes in

If you can't provide keys for all the various queries that use the table, then those that don't have keys will sort, and here's the nub, if there is a PK then it will sort via that - if no PK at all, it needs to create a temp table and sort the whole dataset :(.

Having said that, my recommendation would be to use the PK for your main unique index - though there is clearly something to be said for PEB's approach, especially where you need to process the recordset from many different directions in different queries etc - as it were.

Phew! Finally found the time to read all of this. Here's my response...

All of the fields I'm likely to want to search on are indexed (none unique). the indexes are exactly the same in all the databases. In fact apart from the data, the DBs are identical.
I don't have any unique field on the records. Used to have a PK called ID (automatically added by Access, I think) but removed it to save space.
As I said, performance was wonderful on all (alright, most) queries until I started "unioning" the databases.
Some of the databases had the records in a different sequence to what I really prefer, so the new PK I'm adding is made up of a couple of fields and returns them in the preferred order.
It is obviously not doing any major amount of sorting now that my sample database has a PK, as the response is back to a second or two, and I’m searching (and returning) millions of records.
As for creating a temp table and so on, I can believe it may have been doing that when I didn’t have the PK, based on the ridiculous length of time and amount of disk activity.
Anyone know why Access can’t make use of the keys on the unioned tables? It’s a very simple structure, and query.

I’ll go ahead and add a PK to all my tables (which will involve splitting each into two parts, since they are up against the size limit). Should be able to tell you in a few days whether it worked. Not much to do on my part, but dealing with these volumes, Access tends to chug along all day in the background. :) I've had a delete query running for 3-4 hours now, and it's probably around 70-80% finished.

Nov 1 '06 #6

MMcCarthy

14,534

Expert Mod 8TB

Anyone know why Access can’t make use of the keys on the unioned tables? It’s a very simple structure, and query.

Essentially, Access treats each index correctly when it's on an individual table or query.

The minute you preform a UNION access can't reconcile the indexes in the various sets and pretty much ignores them.

One solution to your problem which I have used in the past is to create a VBA routine to take in each of the tables from the various databases and to create a new recordset based on a new table structure and run a DoCmd.RunSQL INSERT query to append each record. The new table would of course have an autonumber primary key.

It works much faster than the UNION query and your only problem is to work out how to dynamicaly program it to move through the odbc and recordset connections. I used a table with each database path and table as field values.

Any questions on this approach let me know.

Mary

Nov 1 '06 #7

MMcCarthy

14,534

Expert Mod 8TB

BTW

I did this using DAO not ADO which give problems when changing connections during a process.

Nov 1 '06 #8

Killer42

8,435

Expert 8TB

Essentially, Access treats each index correctly when it's on an individual table or query.

The minute you preform a UNION access can't reconcile the indexes in the various sets and pretty much ignores them.

One solution to your problem which I have used in the past is to create a VBA routine to take in each of the tables from the various databases and to create a new recordset based on a new table structure and run a DoCmd.RunSQL INSERT query to append each record. The new table would of course have an autonumber primary key.

It works much faster than the UNION query and your only problem is to work out how to dynamicaly program it to move through the odbc and recordset connections. I used a table with each database path and table as field values.

Any questions on this approach let me know.

Mary

Um...

As far as I can see, my main problem would still be the same - the data is too big to fit in one table. So how does this help?

Anyway, I don't really need an alternative method that badly, now. I have the single table from each of the databases linked to the front-end. It comes down to two choices. If a PK provides acceptable performance, I'll go back to using a UNION query which connects up the data for all of the years (it's one year per database). If not, I'll just have to union the results of the queries on each database. I'm doing that now, but don't like it.

Nov 1 '06 #9

MMcCarthy

14,534

Expert Mod 8TB

Um...

As far as I can see, my main problem would still be the same - the data is too big to fit in one table. So how does this help?

Using this method you never actually link the tables or run any queries so the only thing you have is the new table. How many records are we talking about here. I've done it with a couple of million.

Nov 1 '06 #10

Killer42

8,435

Expert 8TB

Using this method you never actually link the tables or run any queries so the only thing you have is the new table. How many records are we talking about here. I've done it with a couple of million.

I'm afraid you've lost me completely as to what "this method" is.

I have a separate database for each year's data, with typically 12 to 15 million records, each database approaching 2GB in size (lucky coincidence). Each consists of a single table. They have identical structure and indexes. It's simply the same database copied, cleaned out and repopulated for each year.

I'd like to have a front-end database which just includes a union query to pull them all together, then play with that. It's not all that important - I can just continue to use them separately, but it's sometimes a little inconvenient.

Nov 1 '06 #11

NeoPa

32,556

Expert Mod 16PB

I posted this elsewhere, but it really should have come in here.

Killer,

You should seriously consider porting your back end database to SQL Server 2005.
I believe the free version is now much more flexible than the 2K MSDE.
It will handle your databases (/tables) without strain.
Unfortunately, the SQL Server forums are not as active as these here in the Access section (No MM of course).

PS. Sorry about the 2GB limit - I thought it had gone from 2GB to 4GB. It must have been from 1GB to 2GB.

Nov 1 '06 #12

MMcCarthy

14,534

Expert Mod 8TB

I'm afraid you've lost me completely as to what "this method" is.

I have a separate database for each year's data, with typically 12 to 15 million records, each database approaching 2GB in size (lucky coincidence). Each consists of a single table. They have identical structure and indexes. It's simply the same database copied, cleaned out and repopulated for each year.

I'd like to have a front-end database which just includes a union query to pull them all together, then play with that. It's not all that important - I can just continue to use them separately, but it's sometimes a little inconvenient.

Sorry Killer

I didn't realise how many records you were talking about.

Nov 1 '06 #13

Killer42

8,435

Expert 8TB

I didn't realise how many records you were talking about.

Just for the record, the error message when Access tries to go beyond 2GB was "Invalid argument". Why do I bring this up now? I'll give you one guess... :(

Nov 3 '06 #14

NeoPa

32,556

Expert Mod 16PB

I don't think Access checks whether or not it's reached its ceiling - It just crashes (apparently randomly) wherever its internal code fails :(.

Nov 3 '06 #15

PEB

1,418

Expert 1GB

Hi NeoPa
It's interesting this about Access 2 K that you mention.. that supports 4GB!
In fact my source of information about the 2 GB is Access Specifications in the Access documentation...

Really I think it isn't logical to have a limit of access database file to 2 GB if you use NTFS instaed FAT32...

But if Microsoft says.....

:)

I'm open to correction here but I'm quite happy this is along the right tracks.
Firstly, FYI, The size limit of Access 2K and beyond is 4GB. 97 was the last one limited to 2GB.
Beware though, a Unicode feature was installed for text fields which can double your text data unless you use 'Unicode Compression', which should be safe for most databases in the English speaking world.
Indices - what a subject!
When an RDBMS (Access for instance) processes a recordset, it will look for an index that will help the processing it's currently on. So, if you run

Expand|Select|Wrap|Line Numbers

SELECT [CName], [Surname], [DoB]

FROM [tblStaff]

ORDER BY [Surname]

then having an index based on [Surname] will enable that to be used.
If there is no index that will help, it will have to sort the data first before use.
This is where the small PK comes in

If you can't provide keys for all the various queries that use the table, then those that don't have keys will sort, and here's the nub, if there is a PK then it will sort via that - if no PK at all, it needs to create a temp table and sort the whole dataset :(.

Having said that, my recommendation would be to use the PK for your main unique index - though there is clearly something to be said for PEB's approach, especially where you need to process the recordset from many different directions in different queries etc - as it were.

Nov 4 '06 #16

NeoPa

32,556

Expert Mod 16PB

I think I was actually mistaken there PEB.
I think it's 2GB as Killer42 says.
2GB & 4GB are natural limits though, if you think about the size of the intel 32-bit word.
It can only count from -2GB to +2GB (if signed) or from 0 to 4GB if unsigned.
MS SQL Server obviously uses extra registers to handle addressing, rather than a simple address in one register.
64-bit processors can handle a little more ;) (4GB squared - quite big).

Nov 4 '06 #17

PEB

1,418

Expert 1GB

I've never tried the 2 Gb man! In fact i'm creating databses that uses ID as numbers... The strings are located in few tables and aren't repeated....

So i've databses that are more 2 billions of records but as size it is till 200 Mb

And it works good MS Access...

Coz there is years that I've used String as Primary keys and at 300 - 400 tousand records it was terrible to use Access... Hein... it was an earlier version of Access - Acccess 2.0 but it was a problem... :)

Nov 4 '06 #18

PEB

1,418

Expert 1GB

Maybe you can do it but using as Database SQL Server or Oracle..

As front application Access maybe will handle the information...

If not...

This is for strategic goals so something on SQL Server or Oracle can be developped to Treat this information...

But really if there is no problem with memory a Union query based on the linked tables can be realized... Not sure when it will work and how long ;)

:)

I'm afraid you've lost me completely as to what "this method" is.

I have a separate database for each year's data, with typically 12 to 15 million records, each database approaching 2GB in size (lucky coincidence). Each consists of a single table. They have identical structure and indexes. It's simply the same database copied, cleaned out and repopulated for each year.

I'd like to have a front-end database which just includes a union query to pull them all together, then play with that. It's not all that important - I can just continue to use them separately, but it's sometimes a little inconvenient.

Nov 5 '06 #19

Killer42

8,435

Expert 8TB

...But really if there is no problem with memory a Union query based on the linked tables can be realized... Not sure when it will work and how long ;)

I should be able to tell you in a couple of days. I'm re-copying all my data to fresh new databases which are a copy of a template, to ensure everything is 100% identical - that is, keys and everything.

Then I'll plug them into the big UNION query and see what happens.

This is more for my own info than anything else I suppose, since I have my queries working alright now by VBA code building the appropriate string to query multiple tables, then UNION the results.

The reason for the data size is that this is a yearly archive of a log of things that happen very frequently (tens of thousands of times per day). There is some textual info included in there, which I need to be able to filter on, so there's not a lot I can do about the size.

I have tried pulling the text, which is a bit repetitive, into a separate table (or even database) and linking with a long integer key. The minor size improvement was not worth the hassles.

Anyway, as I said it's really no biggy for now - things are working well enough and in a couple of days I'll see how well I can get the big UNION to perform.

Nov 5 '06 #20

junke101

I'm sure this is a dead topic, but In the process of killing time, I ended up reading the entire thread and thought I'd throw couple random ideas that might('ve) help(ed).

Have the following techniques been considered:
A:
Add a COUNTER numeric (as mentioned probably long integer will suffice) primay key, (ignoring overhead the 4-byte field * 15million records would only add roughly 60MB to the DB size)

In code, loop over your DBs selecting ONLY those primary key numbers into a temporary array variable in your application, and Then using the result array to grab the full combined row result set via a WHERE primary_key IN( array-values) clause.

-This does hit the databases doubly so it warrants some testing (the performance effects may vary considerably depending on the average rows/result-subset ratios, complexity of where & order clauses, access's query optimization engine, etc.); But the second query should be very quick (hopefully approaching the limit of memory access/read speed) being based soley on the primary (unique & indexed) keys; You also gain considerable control & flexibility on the application side and may even be able to identify specific DBs in your set that are particular performance problems and the reason for the speed descrepancy, and make problem-DB specific index/query adjustments or even breakup the DBs differently to get better overall performance.
-taking it further, (assuming access to the insert/update/delete interface) with this new performance meta-information an advanced solution might actually track the destination DB names in the Counter (global primary key) generating DB and intelligently decide based on the submitted data and/or historical speed statistics which of several active DBs to insert each new row into to assure that the seperate DBs each contain data that is condusive to search speed.

B:
If you access to the insert/update side of the application, slightly altering the DB structure such as creating a DB/table with JUST your 'searched-on' fields & using numeric keys to represent highly repeated values. You could essentially create a much smaller & faster 'search/hint' DB/table that is updated whenever an insert/update/delete is made on another DB and then your search queries need only search that highly-efficent conglomerate table, find the keys and use them to again pull data by primary key from the 'bulkier' DBs only the rows that matched the search params.

-This option can completley elimate your table scans & sorts of the massive DBs and replaces them with a single search on a single 'purposely optimized' table. The sacrifice is in slightly slower insert/update/deletes, but this is usually preffered, as most DB based applications have a very high read-to-write ratio. (and of course its significantly more work than suggestion A)

[HTML]<hr>[/HTML]
I guess to condense that novel in a single sentence: Is normalizing and/or creating a foriegn-key relationship or two of your DB (or DB of your DBs) not possible? To reduce the data exchange/overhead of your current setup would surely help

Jan 18 '07 #21

Killer42

8,435

Expert 8TB

Hi junke101.

Thanks for that. What can I say but... wow! :)

This sounds like an excellent idea, at least to try. However, I did eventually resolve (with lots of help from TheScripts) the performance problems in my big database, so it will have to go under the category of "interesting ideas to try out when you can find time".

Would love to try it, but can't really justify it for now. :(

I ended up with my database split up by year - the data for 2003 in one database, 2004 in another, and so on. The code behind the search form dynamically builds a UNION query which provides quite reasonable performance. It depends a lot on what criteria are used, of course.

I never did actually manage to get the big UNION of everything to work acceptably. The SQL that my code builds actually queries the appropriate years, and UNIONs the results. I was quite keen to do it without creating any temp tables and so on, though I did end up having to store a querydef named for the user who ran the search.

Still, thanks a lot for the input. And of course, one of the nice things is that it will "always" be there, for anyone else who comes along and searches for something similar.

Jan 18 '07 #22

Similar topics