469,636 Members | 1,527 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,636 developers. It's quick & easy.

Concurrency issues with "Tree" structures.

Hi,

I'm currently implementing a database with a tree structure in a table. The
nodes in the tree are stored as records with a column called "Parent". The
root of the tree has a "NULL" parent. The path to each node is stored in
the column "Path" and is of the form "\000001\000002\000003\" etc. The
latter enabling me to fetch subtrees using the "LIKE" predicate. I also
have created the relation "ID" <-> "ID_Parent, effectively the table is
related to itself. I did this so that an attempt to remove a parent when
that parent still has children will fail due to referential integrity.
Unfortunately, in order to delete subtrees, I have to first set all of the
nodes in the subtree to point to the "NULL" parent so that I don't get
caught with integrity errors during deletes.

Unlike a typical linear system, any given node record is related to more
than one of the other records. What I mean is it is possible to follow a
chain from any given node back to the root (ancestors) or collect a series
of branches and leaves (descendants) and so it is not reasonable to consider
any given node in isolation. Changes to any given descendant can trigger
changes which are propagated to it's ancestors. I am not using recursion to
do this, rather, I am using an iterative approach to select and update the
parent, then the parents parent, etc. according to it's new state (given by
the state of it's immediate descendants), right back up to the root.

Now, the problem comes when I consider concurrency with respect to this
scheme. It seems to me, that locking the record I am updating is not
sufficient to ensure clients are kept synchronised or the integrity of the
tree structure is correct. I think I need to lock all ancestors of the tree
(HOLDLOCK) before performing any operation on a given node. Is this
reasonable? Also, consider the "delete" problem given above. I really
should HOLDLOCK on the entire subtree of any node I wish to delete as I am
going to set the entire subtrees parent values to NULL. I don't want
another client to perform a read on part of the subtree while the nodes are
"parentless" pending deletion.

Secondly, I am not sure how to handle synchronization of the tree for each
client. How does each client know when a change has been made to the tree?
Consider a client holding a copy of the tree in memory. Another client
deletes a subtree. The first client attempts to update one of the deleted
subtree nodes and fails because the node no longer exists. At this point,
in the eyes of the first client, all of the ancestors and all of the
descendants of the node in question must become suspect. Should the
software now attempt to re-build this part of the tree? It seems that any
operation on any of the nodes in the tree will potentially make a lot of
other nodes suspect and so my application will be doing a lot of updating.

I would be interested to hear any insights people have on these issues,
particularly those implementing a "file system" structure in their database
or similar. How do you deal with these concurrency issues when manipulating
trees?

Thanks


Robin

Jul 23 '05 #1
4 1754
I believe that Joe Celko has a book that is devoted to tree structures
in a SQl environment. Do a search on Amazon for Celko and you should
see it. I'm sure that Joe will probably chime in here as well. My
thoughts on the subject though...
The path to each node is stored in the column "Path" and is of the form "\000001\000002\000003\" Ack, yuck, ick! I hate that method of storing tree information myself.
Check Joe's book for several different ways to store trees in a RDBMS.
also have created the relation "ID" <-> "ID_Parent, effectively the table is related to itself. I did this so that an attempt to remove a parent when that parent still has children will fail due to referential integrity. Unfortunately, in order to delete subtrees, I have to first set all of the nodes in the subtree to point to the "NULL" parent so that I don't get caught with integrity errors during deletes. So, what you seem to be saying is that you want RI so that you can't
delete any rows by mistake, but you want to be able to easily delete
rows. You can't have it both ways... you either need to explicitly
delete the children (or set the FK's to NULL) or you need to remove the
RI.
Changes to any given descendant can trigger changes which are propagated to it's ancestors

This sounds like a design problem with regards to normalization to me,
but without knowing the specifics I really can't say.

As for the client application tracking changes to the tree, that is
really dependent on the business requirements for your application. You
can reload the client every X seconds or you could wait for the client
to take an action then check to see if the action is still valid on the
tree in its current form. To limit the number of client refreshes that
you have to perform, each node could use a last_updated datetime column
to track changes and you could retrieve only those nodes and their
subtrees where the last_updated column has a value that is greater than
the last time that you refreshed your tree.

My suggestion would be to check out Joe's book (or Google past posts by
Joe) to find alternative tree representations. Some of them are rather
ingenious and easy to work with.

HTH,
-Tom.

Jul 23 '05 #2
Robin Tucker (id*************************@reallyidont.com) writes:
Now, the problem comes when I consider concurrency with respect to this
scheme. It seems to me, that locking the record I am updating is not
sufficient to ensure clients are kept synchronised or the integrity of
the tree structure is correct. I think I need to lock all ancestors of
the tree (HOLDLOCK) before performing any operation on a given node. Is
this reasonable?
UPDLOCK would be better. Else you could run into conversion deadlocks.
An UPDLOCK is a shared lock, so other processes can read. But only one
can have an UPDLOCK.
Also, consider the "delete" problem given above. I
really should HOLDLOCK on the entire subtree of any node I wish to
delete as I am going to set the entire subtrees parent values to NULL.
I don't want another client to perform a read on part of the subtree
while the nodes are "parentless" pending deletion.
I'm not really sure why you need to do this set NULL thing. In fact
that is something I would avoid like the plague. But then I know
very little of your actual business problem.

I think Joe Celko's trick for tress is to number each node in a way so
that a subtree is a contiguous range. Then you can blow away to whole
subtree in one delete.

Then again, you already had that path. Would not:

DELETE tbl WHERE path = 'a/b/c' or path LIKE 'a/b/c/%'

work?

Of course, even with set-based statements, you can have interesting
effects if two clients are it at the same time.
Secondly, I am not sure how to handle synchronization of the tree for
each client. How does each client know when a change has been made to
the tree?


You will have to ask you tech lead about that. :-) Seriously, with
no knowledge of the requirements etc, it is very difficult to answer.
As Thomas said, you can refresh automatically with some frequency.
For more bells and whistle you could push the change by activating
some signaling mechanism from a trigger. But you make have to ask for
a bigger budget to do this.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #3
>> I would be interested to hear any insights people have on these
issues, .. <<

Get a copy of TREES & HIERARCHIES IN SQL and you can see several
different approaches.
in order to delete subtrees, I have to first set all of the nodes in

the subtree to point to the "NULL" parent so that I don't get caught
with integrity errors during deletes. <<

Look up the nested sets model; this one atomic DELETE FROM statement.

Jul 23 '05 #4

Hi Tom,
Ack, yuck, ick! I hate that method of storing tree information myself.
Check Joe's book for several different ways to store trees in a RDBMS.

Ok, I don't want to get into a religious war about the best representation
of trees. I am aware of Celkos nested sets model but prefer the clarify of
the path string approach (my system isn't going to be operating with 1,000
users or anything, so I don't mind if it's performance isn't so good); of
course when I refer to clarity, I mean my own personal ability to visualise
whats going on and write code to perform the various manipulations required!
However, I will look once again at Celkos methods because I rejected them a
year ago for various reasons (the potential renumbering of large numbers of
nodes during inserts for example) and am now a bit wiser. I also tried out
nested sets with Tropashenkos' (apologies if I misspelled the name)
materialised path (binary fractions) and found this to be unbelievably slow
and restrictive (for various reasons mainly based on numeric accuracy).
So, what you seem to be saying is that you want RI so that you can't
delete any rows by mistake, but you want to be able to easily delete
rows. You can't have it both ways... you either need to explicitly
delete the children (or set the FK's to NULL) or you need to remove the
RI.
You are right of course, I need to throw away the RI on this, then deletion
of subtrees will be atomic and I will not have to set parents to NULL.
Changes to any given descendant can trigger changes which are propagated to it's
ancestors

This sounds like a design problem with regards to normalization to me,
but without knowing the specifics I really can't say.


I will solve this by locking the ancestors (as another person says in his
post, using UPDLOCK rather than HOLDLOCK) and then processing them in any
transaction that changes state. Thinking about it though, it will still be
possible for a client to read a node that has not yet been processed while
another client is currently processing. So that first client may well
receive some processed and some unprocessed nodes back after a read
operation (is this true though? If the processing occurs within a
transaction, can other users read the uncommitted data?). The reason for
this is because the algorithm for state propagation cannot be atomic (as
fetching "ancestors" of any given node requires iteration).

However, I suppose it is possible for the client to detect problems like
this, by checking ancestors itself for consistency and forcing a refresh if
it detects an inconsistency when it reads back parts of the tree. Extra
work I think but bearing in mind the above (Celko nested sets), I suppose it
may be possible to UPDATE all ancestors atomically. If this is indeed the
case, then I might not need an iterative approach. But I doubt this is
possible, as I would have to ensure updates on the set of ancestors were
performed in order of their "depth" (leaves first of course). I don't think
SQL has such a mechanism (tables are sets after all, not sequences). I
would love to know how, using a path approach it is possible to fetch all
ancestors of a given node. I don't think it can be done with just a path
string and ID_Parent relation.

The specifics of this is each node has a "condition" (green, yellow, red,
undefined) according to the condition of it's children. For nodes of type
0, it's condition is the highest condition of each of it's children, for
nodes of type 1, it's condition is the condition of it's "most recently
added child". For nodes of type 2, the condition is fixed by the client
when the record is created. Condition propagation basically is tasked with
ensuring these conditions are consistent after deletes, updates or inserts.
As for the client application tracking changes to the tree, that is
really dependent on the business requirements for your application. You
can reload the client every X seconds or you could wait for the client
to take an action then check to see if the action is still valid on the
tree in its current form. To limit the number of client refreshes that
you have to perform, each node could use a last_updated datetime column
to track changes and you could retrieve only those nodes and their
subtrees where the last_updated column has a value that is greater than
the last time that you refreshed your tree.

I have another idea about this (I will timestamp the nodes). I have a
worker thread that serializes access to the database from the various client
controls, firing callbacks when it performs an operation. This thread can
check through the tree in it's idle time to ensure it is synchronized. Of
course, if I execute an update and fail (ROWCOUNT = 0) or a delete come to
think of it, I can force a refresh of the entire subtree and it's ancestors.
But I would consider the entire subtree and all it's ancestors to be suspect
if I detected any given node had a later timestamp than that expected.
My suggestion would be to check out Joe's book (or Google past posts by
Joe) to find alternative tree representations. Some of them are rather
ingenious and easy to work with.

HTH,
-Tom.


Thanks for all of your ideas.

Robin
Jul 23 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by imani_technology_spam | last post: by
1 post views Thread by Jonathan Wilson | last post: by
2 posts views Thread by Robin Tucker | last post: by
1 post views Thread by Richard | last post: by
21 posts views Thread by Helge Jensen | last post: by
reply views Thread by Tom Bower | last post: by
77 posts views Thread by M.B | last post: by
30 posts views Thread by josh | last post: by
reply views Thread by gheharukoh7 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.