On Thu, 15 Jan 2004 00:04:52 GMT,
dXXXfenton@bway.net.invalid (David
W. Fenton) wrote in comp.databases.ms-access:
[color=blue][color=green]
>>Actually, you're leaving out the third option, which is that pages
>>can be updated.[/color]
>
>Well, I'm assuming certain things about the PK index (which is the
>one that matters in this scenario):
>
>1. it's an autonumber.
>
>2. it can't be updated, because of 1).[/color]
Ah, I was not at all assuming (1), since it is certainly not an
inherent characteristic of a pk, but ok, let's go with that.
[color=blue]
>So, the index page is not updated for existing records, unless they
>are deleted.[/color]
OK, and that's another assumption I wasn't making (ie, no deletes
occur). This also seems to make the example a more specialized case,
since pk's must certainly expect and handle record deletions, but if
this is the example that we are to work with, then ok, so be it. An
autonumber field for the pk, and no record deletions. I'm with you
so far.
[color=blue]
>I don't know how index pages work, but data pages
>don't delete the data, just mark it deleted.[/color]
This is not true. I know I've posted on this many times, but to recap
quickly, Jet doesn't simply flag deleted data like some other desktop
dbms's. Jet copies the first deleted record on a data page over all
other deleted records on a data page. This destroys all but the first
deleted record on each page. If only one record fit on each 4kb data
page, then undeletion would be trivial, and what you say would be
true. But typically, several dozen or more records exist on a given
data page, so assuming just 20 recs per page, undeletion can only
bring back 5% of the original data (ie, one in every twenty records).
Its true that record deletion does not result in reduction in size of
the database, but it very much does result in deletion (or erasure) of
the vast majority of the deleted records.
Again, this is trivial to test. Create a table. Add a dozen records
with distinct text values. Close the db and open it in a hex editor
(or a text editor if no hex editor is available). The file contents
will look like gibberish, but that's ok. Search for the first text
value entered. You'll then see the other eleven close by, with some
misc bytes amongst them. Now close the editor, open the db, delete
the twelve records, close the db, and re-open it in the editor.
Again, search for the first value. You will find it, but you will
also find that the eleven other text values are gone, and that this
first deleted value is repeated twelve times.
So there's proof of what I'm saying.
Now consider why Msft does this. It takes time to write the first
deleted record over the other deleted records, so it is not for
efficiency's sake. It doesn't provide security (if that were the
goal, the page would be erased rater than some data duplicated). It
also prevents undeletion (but not because security is a priority!).
In other words, it's stupid. But that's how it is.
[color=blue]
>I'd assume that for
>performance purposes, there'd be something similar.[/color]
Since the former is false, the latter assumption doesn't follow.
[color=blue]
>An update, then, could occur only on an insert. But if you start
>with a compacted file, the index pages are filled, no? And,
>therefore, there's no room to add new data.[/color]
[color=blue]
>Or are index pages written differently, for example, new pages are
>allocated when only 1/2 filled, leaving plenty of room for
>additions to that leaf of the tree?[/color]
Remember that there are leaves and branches (or at least there are if
there are enough records in the table to justify branches). Even with
a compacted database, it is not required that there are one or fewer
partially full branch pages, or leaf pages.
[color=blue]
>I can't quite see, though, how this could be with an Autonumber,
>though, unless it's random.[/color]
AFAIK, indexes doesn't treat autonumbers as a special case. They are
simply long integer pk fields. Autonumber fields only have special
meaning when it comes to populating them for the first time, or
attempting to update them.
[color=blue][color=green]
>>Huh? While this is in fact not how things work (in my
>>understanding) it doesn't follow that (a) fragmentation poses a
>>problem or (b) incomplete index (I take it that's what you mean
>>when you say data) pages do either.[/color]
>
>Perhaps I wasn't clear in what I was talking about, Autonumber
>primary keys, exclusively, as this is what really has the greatest
>impact in an application where you're retrieving single records by
>PK.[/color]
I'm not sure what you mean here. Whether a pk is an autonumber field
or not really has no impact on pk maintenance, storage, etc. I was
assuming we were talking about pulling records solely by pk usage, but
as I mentioned above, I was not assuming the pk must be an autonumber
field, but then it is irrelevant whether the pk field is autonumber or
not for index purposes.
[color=blue]
>If you're retrieving by other data, the PK index is involved
>secondarily, since it's what's used to find the data pages, right?[/color]
Well, in a sense, but not really. If you pulled records by values in
other fields, then those fields were either indexed or not. If they
were indexed, then there indexes are sued INSTEAD of the pk index, but
if these other fields were not indexed at all, then the retrieval uses
the pk index since it is the primary means to navigate the table's
data. So in these other cases, either the pk isn't used, or it is
used as the primary index. Its only really used as a secondary index
if both pk and non-pk-but-otherwise-indexed fields are used for
identifying records to be retrieved, and only then under certain
circumstances.
[color=blue]
>I have always assumed that non-PK indexes are keyed not to the
>original data pages but to the PK index -- perhaps that was a wrong
>assumption?[/color]
They are keyed to the pk, but remember that the pk has its values
stored twice, once in the actual data pages and once in the index
pages. This, incidentally, is why it is usually ill-advised to
construct a multi-text-field pk, as it results in a lo of extra
overhead and duplication.
[color=blue][color=green][color=darkred]
>>>I have always assumed (perhaps wrongly) that additions to the
>>>index are in new data pages that are only merged into the full
>>>index tree at compact.[/color]
>>
>>Indexes are not required to fill their page, and indexing works
>>even for non-compacted databases. Think about what you are
>>saying. You are suggesting that not only is the index not optimal
>>after data entry, but that it is largely unused (for newly added
>>data). Of course, whatever cache you are assuming is used for
>>indexing this newly added data would have some structure
>>presumably, but not the advantages of a full tree structure that
>>exists for data added prior to the last compact. . . .[/color]
>
>I would have assumed two tree structures, just as we have for data
>pages.[/color]
Huh? Tree structures exist for indexes. There is no reason why they
should exist for data pages.
[color=blue][color=green]
>> . . . What benefit
>>would such a system have? Well, the number one, far and away,
>>benefit is that appending new records requires no significant
>>performance hit, because the tree is only adjusted at compact.
>>Try this out. Add 10,000 records to a 100,000 record indexed
>>table. These 10,000 recs should have pk values that fall
>>throughout the existing range to really make the case clearer. . .[/color]
>
>But that will *not* be the case with the implicit scenario I was
>thinking of, unless your Autonumber PK is random.[/color]
Yes, we were not on the same page (no pun intended) when I wrote that.
[color=blue]
> I have
>occasionally thought that a random Autonumber PK could improve
>concurrency.[/color]
This is precisely so. Look at replication identifiers. They are
large and random.
[color=blue][color=green]
>> . . . If
>>you were right, this should not take much longer than adding them
>>to a table with no records, or to one with 100K existing records,
>>but no pk. Of course, the results will show otherwise.[/color]
>
>I'm not sure how your answer applies to the scenario I was thinking
>of.
>
>You should learn to read my mind! :)[/color]
It just shows that this index-cache idea isn't really applicable to
Jet.
[color=blue][color=green]
>>I'm not knocking the idea of caching new data entry and postponing
>>index updates - it is a technique used in many rdbms and can work
>>well. But I am telling you that in general, Jet has a tighter
>>integration of data-entry and index updates. Index optimization
>>is a benefit of compacting, but indexes are updated at data entry
>>time, and even if you never compact your file, index pages are
>>updated and written all the time.[/color]
>
>Can you draw this out for three kinds of indexes:
>
>1. non-random Autonumber PK.
>
>2. alphanumeric non-PK index (say, on LastName).
>
>3. random Autonumber PK.[/color]
Well, the three are treated similarly. Remember that Jet doesn't
anticipate pk value usage. It simply works with what it gets.
Numeric pks work better than text ones (in my understanding), but
autonumber or non-autonumber is beside the point, as is random or
non-random.
[color=blue]
>I'm working from my experience with how quickly and efficiently Jet
>manages to retrieve small sets of records from relatively large
>data tables. I'm trying to figure out why that should be,[/color]
Lets assume that 20 records fit on a data page (this is a low
estimate) and that there are one million records in a table (this is a
large estimate), and that we wish to pull one randomly chosen record
by its id. Since index pages store more pointers than data pages
store data, let's assume that there are 100 pk values stored per index
page (this is a very low estimate). Each estimate we've made has made
the task much more difficult than would normally be the case. How
many index and data pages does the jet engine need to pull over the
wire to get this record? Four. That's 16kb. Two branches, a leaf
and a data page. That's pretty easy stuff.
[color=blue]
>and I can
>only assuming that caching of the indexes is why it works so well.[/color]
It would be lightning fast without any cache, be it within jet or at
the o/s level.
Peter Miller
__________________________________________________ __________
PK Solutions -- Data Recovery for Microsoft Access/Jet/SQL
Free quotes, Guaranteed lowest prices and best results
www.pksolutions.com 1.866.FILE.FIX 1.760.476.9051