470,591 Members | 2,306 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 470,591 developers. It's quick & easy.

Ye Olde GUI display/sychronization problem...



I feel like the answer to this should be blinding obvious. I also
feel like it's probably an exercise in an undergraduate database
design course. I'm off to google for an answer, but I figure I'd
throw it out here to see if there's a quick and dirty solution.

I have in my database a table of every position we have here called
Holdings. It would be equivalent to an Inventory table in every
basic design course.

There are many users who could update that table, and also many users
whose display programs wish to have correct values. Here's what
I currently do:

There's a timestamp column in that file. My display apps keep track
of what the max timestamp value they've seen is. Every time they
want to do a sweep of the database, they go and "get everything
that's newer than timestamp X", update their quantities, and display.

Now, the procedure "get everything newer than timestamp X" is where
my question lies.....

It requires (of course) a scan of the entire Holdings table, since
having an index on a timestamp field would be (I think) suicidal.
There's got to be a better way to do that.

I've played around with another solution, which is to have a trigger
on Holdings for insert/update, and in that trigger, I save into
another table the PK of the Holdings which change. That "updated"
table also has an identity field, which is the PK of "updated".

Then, instead of keeping track of the max timestamp and scanning
the Holdings table, I keep track of the max identity in "updated",
and just grab the new rows from there, which are usually only a few
(and are easy to find using the PK clustered index), and then grab
individual rows from Holdings using the stored PK.

As I said in the intro, I feel this should be a completely solved
problem. I have a "correct" solution, but I'm just wondering what
the "correct, and most efficient" solution is.

Thanks,

Eric.
Jul 23 '05 #1
15 1351
A question for you:
a) user inserts a record into parent Holdings table
b) trigger inserts same record into child Updates table
c) user deletes the record he just entered into parent Holdings table

How do you inform your app that record has been deleted?

Jul 23 '05 #2
On Thu, 10 Feb 2005 14:17:40 -0000, Eric J. Holtman wrote:

(snip)
As I said in the intro, I feel this should be a completely solved
problem. I have a "correct" solution, but I'm just wondering what
the "correct, and most efficient" solution is.
Hi Eric,

I don't know about "correct", nor about "most" efficient, but wouldn't the
easiest (and also pretty efficient) solution be to just grab the data
that's about to be displayed on the screen? If a user is displaying
details about the position for Wilkinson, there's no need to check if the
data for Burroughs has been changed, as it won't be displayed anyway.
And in case that your requirement is actually to display only the data
that has recently been changed, I'd say that
having an index on a timestamp field would be (I think) suicidal.


is quite untrue - if you need to retrieve the rows with the most recent
changedate, an index on changedate would actually be very beneficial.

(Note - if you expect hundreds of updates per second, this index might
hamper performance, but in that case, displaying all rows that were
changed since the last screen refresh is silly in itself. Even if the end
user would refresh every 30 seconds, he'd be faced with thousands of rows
to inspect and I doubt he'd be able to digest all that information before
the next 30-second refresh interval...)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Jul 23 '05 #3
Hugo Kornelis <hugo@pe_NO_rFact.in_SPAM_fo> wrote in
news:1r********************************@4ax.com:

I don't know about "correct", nor about "most" efficient, but wouldn't
the easiest (and also pretty efficient) solution be to just grab the
data that's about to be displayed on the screen? If a user is
displaying details about the position for Wilkinson, there's no need
to check if the data for Burroughs has been changed, as it won't be
displayed anyway.

I wish. Heh.

The display app is very much "drill down" oriented, and actually
everyone that's watching *does* need to see all the positions. Of
course, they're not looking directly at 5,000 stock positions, but
they do see overall totals and such, and therefore need to be informed
about every change.

is quite untrue - if you need to retrieve the rows with the most
recent changedate, an index on changedate would actually be very
beneficial.

I suppose I can try this. I had just assumed, based on reading,
that indices on timestamps were discouraged because they tend to
impact performance. I guess the tradeoff is if the server spends
a little bit of time updating the index, I don't need to have 20
user apps spending time inspecting the table.

of rows to inspect and I doubt he'd be able to digest all that
information before the next 30-second refresh interval...)


Thanks for the input, but again, because of the way we do business,
everyone does need to know about all the positions. In addition
to "overall totals" and such, the GUI does other interesting things
like "examine *all* positions, but only display ones that are
'interesting'", for user configurable values of "interesting".
Obviously, in that case, I need to keep all the Holdings up to date,
so that I can decide on a per-user basis, which ones are interesting.

Jul 23 '05 #4
"louis" <lo************@gmail.com> wrote in news:1108053598.035823.36780
@c13g2000cwb.googlegroups.com:
A question for you:
a) user inserts a record into parent Holdings table
b) trigger inserts same record into child Updates table
c) user deletes the record he just entered into parent Holdings table

How do you inform your app that record has been deleted?


A legitimate question, which I will finesse by saying "that
never happens". I'm only concerned with either new, or updated
Holdings. I'm not concerned (in the day to day operation of the
User application) about Holdings disappearing.

Jul 23 '05 #5
I think the standard recommendation is to create a clustered index
containing your timeStamp column (which is really a datetime). This
way, when you do a select * from myTbl where timeStamp >= @date -- the
rows are already sorted and retrieval should be fast. As this is a
heavy OLTP table, you probably should set the fillfactor at 50 or less
-- to incur less page splits.

Jul 23 '05 #6
"louis" <lo************@gmail.com> wrote in news:1108061331.548005.196620
@o13g2000cwo.googlegroups.com:
I think the standard recommendation is to create a clustered index
containing your timeStamp column (which is really a datetime).
Has this changed since SQL Server 7.0, which is what I am running?
In SS7.0, timestamp is just a binary(8) which is gauranteed unique
and increasing across the database?

This
way, when you do a select * from myTbl where timeStamp >= @date -- the
rows are already sorted and retrieval should be fast. As this is a
heavy OLTP table, you probably should set the fillfactor at 50 or less
-- to incur less page splits.


I'm going to dust off my stress-testing program and try it. I already
have a PK and therefore clustered index on that table, and my gut tells
me I need that since it's heavily used in joins. So I think a
non-clustered index is my only option.

Of course, having done this for years, I should know better than to
trust my gut, and I should run some tests to be sure.
Jul 23 '05 #7
> Has this changed since SQL Server 7.0, which is what I am running?
In SS7.0, timestamp is just a binary(8) which is gauranteed unique
and increasing across the database?
Sorry, I got confused. (In mssql2000 the synonym 'rowversion' refers
to the binary TS datatype. In ansi sql92, timestamp is a datetime
datatype).
I'm going to dust off my stress-testing program and try it. I already have a PK and therefore clustered index on that table, and my gut tells me I need that since it's heavily used in joins. So I think a
non-clustered index is my only option.

Of course, having done this for years, I should know better than to
trust my gut, and I should run some tests to be sure.


I think a CL index would work better, because this is a "range"
operation. But the only way to know for sure is test it out.

Jul 23 '05 #8

"Eric J. Holtman" <ej*@ericholtman.com> wrote in message
news:Xn*******************************@216.168.3.3 0...
"louis" <lo************@gmail.com> wrote in news:1108053598.035823.36780
@c13g2000cwb.googlegroups.com:
A question for you:
a) user inserts a record into parent Holdings table
b) trigger inserts same record into child Updates table
c) user deletes the record he just entered into parent Holdings table

How do you inform your app that record has been deleted?


A legitimate question, which I will finesse by saying "that
never happens". I'm only concerned with either new, or updated
Holdings. I'm not concerned (in the day to day operation of the
User application) about Holdings disappearing.


----------------------------------------------------

I've used a trigger that updates a marker row with both the change date and
the row count of the table or group that is being monitored.

That way you get both new, modified, and deleted rows.

Your 'has anything changed' query is a short select of that one row, and a
comparison of both values with the ones you got the last time you checked.

If you want to bypass the seperate table indicating freshness, do a select
count(*) , max( UpdatedDate ) from base_table where GroupColumn = @x
Assuming you have GroupColumn indexed, performance should be ok, but you
will still need a trigger to set the UpdatedDate column to getdate().

You may want to look at checksum() and binary_checksum() functions (sql 2k
and greater).
I don't know how they perform, as I haven't used them in the 'real world'.


Jul 23 '05 #9
[posted and mailed]

Eric J. Holtman (ej*@ericholtman.com) writes:
I feel like the answer to this should be blinding obvious. I also
feel like it's probably an exercise in an undergraduate database
design course. I'm off to google for an answer, but I figure I'd
throw it out here to see if there's a quick and dirty solution.
It isn't obvious. In our shop, we have an entire group working with
this. OK, not only exactly this problem, but a whole lot of this has
to do with pushing data to clients.
There's a timestamp column in that file. My display apps keep track
of what the max timestamp value they've seen is. Every time they
want to do a sweep of the database, they go and "get everything
that's newer than timestamp X", update their quantities, and display.
Sounds very familiar...
It requires (of course) a scan of the entire Holdings table, since
having an index on a timestamp field would be (I think) suicidal.
There's got to be a better way to do that.
Hm, in fact we have several of our timestamp columns indexed. Although
the positions table is an exception, there we use a different technique,
since this table is so frequently updated, and not all updates concerns
columns that needs to be pushed. In fact our solution is similar to:
I've played around with another solution, which is to have a trigger
on Holdings for insert/update, and in that trigger, I save into
another table the PK of the Holdings which change. That "updated"
table also has an identity field, which is the PK of "updated".


Although our "updated" table is a delta table with the changes, and
it's updated through a stored procedures - all relevant updates to
positions happens in two stored procedures only. I should add here
we have a signaling system to tell the client-pusher that something
has happened.

Now, back to timestamps - scanning for changed timestamps is by no
means trivial in a busy environment! Here are some lessons that we
have learnt. It seems that you are doing:

SELECT * FROM Holdings WHERE tstamp > @laststamp

Assume now that while this SELECT is scanning that when you have
scanned row 1-6 (numbering in scan order), row 5 is updated. You
scan a few more rows, and when you scan row 14, row 25 is updated.
You result set will include row 25, but not row 5, since it was
not updated when you looked at the query.

Next time you run this SELECT, you will pass the timestamp from row 25,
so you will not get the update on row 5 this time either!

Things improve a lot if you use the SERIALIZABLE isolation level, but
that will kill your throughput, not the least if the timestamp column
is not indexed. The remedy we tried was to get max timestamp first,
and the query interval. But doing some devilish stress testing, my
colleagues found that neither this was safe. The current scheme, if
I remember correctly, is to have a non-clustered index on tstamp,
and first run:

INSERT @tbl (clustercol)
SELECT clustercol FROM tbl
WHERE tstamp > @laststaamp AND tstamp <= @maxstamp

And then join the table variable with the target table. The point here
is that the timestamp query uses the NC index only and therefore has a
much smaller window where it subject to simultaneous updates. It seems
that it has been working for them so far.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #10
Erland Sommarskog <es****@sommarskog.se> wrote in
news:Xn**********************@127.0.0.1:

Now, back to timestamps - scanning for changed timestamps is by no
means trivial in a busy environment! Here are some lessons that we
have learnt. It seems that you are doing:

SELECT * FROM Holdings WHERE tstamp > @laststamp

Assume now that while this SELECT is scanning that when you have
scanned row 1-6 (numbering in scan order), row 5 is updated. You
scan a few more rows, and when you scan row 14, row 25 is updated.
You result set will include row 25, but not row 5, since it was
not updated when you looked at the query.

Next time you run this SELECT, you will pass the timestamp from row 25,
so you will not get the update on row 5 this time either!

Yes... yes... yes.... I found this bug about a year ago, after
pulling my hair out trying to track down missed updates. I couldn't
figure out *why* it was happening. I'm handicapped by having learned
SQL by "the seat of my pants", without formal training, so when I get
to esoterica like this, I'm admittedly baffled.
Things improve a lot if you use the SERIALIZABLE isolation level, but
that will kill your throughput, not the least if the timestamp column
is not indexed. The remedy we tried was to get max timestamp first,
and the query interval.


This is exactly what I do now. If I have timestamp X as my max,
I first run "select max(timestamp) from Holdings", return that
value, then do "select * from Holdings where ts > X and <= max".

Then I do *not* set my max seen to be the max from that second returned
group, since that's the same bug as before, but just set it to the
max from the first returned statement. Since adopting that logic,
I haven't missed an update, and I've been able to convince myself
it is slow, but correct.

Do you have a technical understanding of why that, as well, might
not be techincally correct, because I'd love to hear it, and be
educated!

Eric.
Jul 23 '05 #11
Eric J. Holtman (ej*@ericholtman.com) writes:
Yes... yes... yes.... I found this bug about a year ago, after
pulling my hair out trying to track down missed updates. I couldn't
figure out *why* it was happening. I'm handicapped by having learned
SQL by "the seat of my pants", without formal training, so when I get
to esoterica like this, I'm admittedly baffled.
This is by no means trivial materia. I was recently in a discussions
that involved two other persons with very good understanding of SQL,
who were baffled by the fact that:

SELECT COUNT(*) FROM t UNION SELECT COUNT(*) t

would return two rows when t was subject to constant insertion while
the UNION query was running.

These are issues that SQL programmers rarely have to consider, but when
doing timestamp searches like we are discussing here, it becomes
essential.
This is exactly what I do now. If I have timestamp X as my max,
I first run "select max(timestamp) from Holdings", return that
value, then do "select * from Holdings where ts > X and <= max".

Then I do *not* set my max seen to be the max from that second returned
group, since that's the same bug as before, but just set it to the
max from the first returned statement. Since adopting that logic,
I haven't missed an update, and I've been able to convince myself
it is slow, but correct.
Yes, that is also a problem that we found. Or rather, my colleagues
identified and addressed this weakness, without my assistence.
Do you have a technical understanding of why that, as well, might
not be techincally correct, because I'd love to hear it, and be
educated!


Actually, I'm not sure that anything is technically correct. Real-time
issues are very diffictult to understand, not the least because they
are difficult to test and debug. All you can do is set up a testing
framework and keep it running for some time. And if you don't have
any lost updates, you can chalk up a good probability that your scheme
works, but not that it's foolproof.

That said, as long as you don't index the timestamp column, you may be
safe (but slow). As I mentioned, my colleagues found that this scheme was
not good enough, and they were losing updates. They brought be in on
the case, and through some philosophical pondering, I arrived at a
suspicion that the difference in time between an index hit and a
bookmark lookup had importance. But I don't even remember in detail
how that reasoning went.

SQL 2005 offers a novlety: snapshot isolation. With snapshot isolation
you get atomic SELECT of how the table looked when the query started
running. This could be very interesting for these timestamp queries. As
far as I can see the simple-minded "WHERE tstamp > @laststamp" would
work then. But the question is what impact the overhead for snapshot
has on overall performance.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #12
Erland Sommarskog <es****@sommarskog.se> wrote in
news:Xn**********************@127.0.0.1:

Actually, I'm not sure that anything is technically correct. Real-time
issues are very diffictult to understand, not the least because they
are difficult to test and debug. All you can do is set up a testing
framework and keep it running for some time. And if you don't have
any lost updates, you can chalk up a good probability that your scheme
works, but not that it's foolproof.

That said, as long as you don't index the timestamp column, you may be
safe (but slow). As I mentioned, my colleagues found that this scheme
was not good enough, and they were losing updates. They brought be in
on the case, and through some philosophical pondering, I arrived at a
suspicion that the difference in time between an index hit and a
bookmark lookup had importance. But I don't even remember in detail
how that reasoning went.

I need to do some more research, but I really can't think of a path
where I would miss an update now. I mean, I know the last one I saw
was L. I know the max in the file is M. If I select all rows between
L+1 and M, I don't much care if one of those rows gets updated while
I'm scanning, because it would then have timestamp >M. Since I'm
setting my new max to M, I'll just get it "the next time around".

But.......

Maybe that's what you meant by a missed update: You have to wait until
the next cycle. To me, that's not a huge ordeal. I was really concerned
with my original solution, because if a row got updated, and I missed
it in a sweep, I'd never see it.

Now, if I catch that "race condition", I realize I'll miss it *this*
sweep, but I know I'll get in next sweep.

Is that what *you* mean by "miss", that you should have pushed something
to the client during sweep S, but you don't get to it until sweep S+1??
If so, I understand your point of view, but for my users, we can live
with that limitation, as long as we see the update in a "reasonable
time".
SQL 2005 offers a novlety: snapshot isolation. With snapshot isolation
you get atomic SELECT of how the table looked when the query started
running. This could be very interesting for these timestamp queries.
As far as I can see the simple-minded "WHERE tstamp > @laststamp"
would work then. But the question is what impact the overhead for
snapshot has on overall performance.


I think that's what I naively thought SQL server did, by default. I now
see that while you can configure SQL to do this, you don't want to, in
a fast moving OLTP environment.
Jul 23 '05 #13
Eric J. Holtman (ej*@ericholtman.com) writes:
I need to do some more research, but I really can't think of a path
where I would miss an update now. I mean, I know the last one I saw
was L. I know the max in the file is M. If I select all rows between
L+1 and M, I don't much care if one of those rows gets updated while
I'm scanning, because it would then have timestamp >M. Since I'm
setting my new max to M, I'll just get it "the next time around".

But.......

Maybe that's what you meant by a missed update: You have to wait until
the next cycle. To me, that's not a huge ordeal. I was really concerned
with my original solution, because if a row got updated, and I missed
it in a sweep, I'd never see it.

Now, if I catch that "race condition", I realize I'll miss it *this*
sweep, but I know I'll get in next sweep.
I don't remember all the details, but it was not a case of an update
coming to late, or an update not being picked before the next update
came. The result was a permanent outdated data in the cache.

But as I said, you may be safe, since you are not indexing the timestamp
column. My theory was that something happened between the scan of the
index and reading the actual row. Ah, now I think I remember! While
seeking the index you find that row 7 qualifies. But before you read
it from the data pages, it gets updated and will have a max timestamp
that you will use for next sweep. And lose updates between determine
the actual max and this new false max. (But this was just something we
arrived at after discussion. We never verified that this could actually
happen.)

As, I said what they did to cure this was to get only key and tstamp
into a table variable. Since that's an index-covering query, it should
be safer.

Also, I should repeat that they were really hammering the table with
updates. If your update frequency is somewhat modest, you would maybe
only see this once in a blue moon.
I think that's what I naively thought SQL server did, by default.


Well, given that SQL 2000 does not have a snapshot, you should be very
glad that it doesn't, as you throughput would have been non-existing
if it did!
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #14
Erland Sommarskog <es****@sommarskog.se> wrote in
news:Xn**********************@127.0.0.1:
Eric J. Holtman (ej*@ericholtman.com) writes:

But as I said, you may be safe, since you are not indexing the
timestamp column. My theory was that something happened between the
scan of the index and reading the actual row. Ah, now I think I
remember! While seeking the index you find that row 7 qualifies. But
before you read it from the data pages, it gets updated and will have
a max timestamp that you will use for next sweep. And lose updates
between determine the actual max and this new false max. (But this was
just something we arrived at after discussion. We never verified that
this could actually happen.)

I'll have to play around with this, but I still don't see how
that causes a problem. If I do the select max() first, and use
that as my max when I'm done, I still think I'm immune to this behavior.

I.E. I do select max(), it says max ts = 37. I start scanning
Holdings for all ones >20 and <=37. I note that row 7 qualifies,
with a ts of 25. Before I get around to actually fetching the
whole row, it is updated and has a timestamp of 40.

My first (and I think, from your description, your first) attempt
at solving this problem would
have started my next scan at >40, since I saw a 40. And I'd miss
38 and 39. But I don't do that, I start at 37.

I think this results (as in this case) with me fetching row 7
twice, and sometimes results in an update going out one sweep late,
but I can't see how I ever get an always outdated cache this way.

I understand your thought of "the only way to know is to stress
test", but I'm still convinced there has to be a way to prove one
solution or another is correct. Although I guess "correct" might
involve isolation levels which just make the process run too slowly
to be useable.


Also, I should repeat that they were really hammering the table with
updates. If your update frequency is somewhat modest, you would maybe
only see this once in a blue moon.


Yeah, but I hate things that happen "Once In A Blue Moon". They're
hard to explain, especially since "computers aren't supposed to
behave like that"

Jul 23 '05 #15
Eric J. Holtman (ej*@ericholtman.com) writes:
I'll have to play around with this, but I still don't see how
that causes a problem. If I do the select max() first, and use
that as my max when I'm done, I still think I'm immune to this behavior.

I.E. I do select max(), it says max ts = 37. I start scanning
Holdings for all ones >20 and <=37. I note that row 7 qualifies,
with a ts of 25. Before I get around to actually fetching the
whole row, it is updated and has a timestamp of 40.

My first (and I think, from your description, your first) attempt
at solving this problem would
have started my next scan at >40, since I saw a 40. And I'd miss
38 and 39. But I don't do that, I start at 37.
There might have been some more twists to my colleagues' test case
that I have forgotten.

I should also add that this was not something they found in our
own system. But they sold the framework to some customers outside
our company, and it was one these sites where this problem became
painfully apparent. Thus, with moderate volumes, you might find
your current setup good enough.
I understand your thought of "the only way to know is to stress
test", but I'm still convinced there has to be a way to prove one
solution or another is correct. Although I guess "correct" might
involve isolation levels which just make the process run too slowly
to be useable.


I'm not even sure that serilaizable is waterproof. Shapshot on the
other hand sounds very promising...
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #16

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by shreddie | last post: by
2 posts views Thread by ruby_bestcoder | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.