By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,142 Members | 910 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,142 IT Pros & Developers. It's quick & easy.

MS SQL 6.5 Unable to connect.

P: n/a
Let me start by saying that, yes we know v6.5 is no longer supported by
Microsoft and that moving to SQL2000 might resolve our problem.
Actually the migration to SQL2000 is 1 to 2 months out and in the QA
phase now. Unfortunately we need to resolve the current 6.5 issue now
as it heavily impacts a major revenue stream for the company with every
outage we encounter.

The Problem:
At least once every 24 hours the SQL Server begins disallowing new user
connections. When this happens the 2 CPUs begin to thrash. About 5
minutes later the error log begins to print out the following error
message: "Unable to connect. The maximum number of '500' configured user
connections are already connected. System Administrator can configure to
a higher value with sp_configure." No new connections can be made to the
server; however, the existing connections continue to function normally.
We never see anywhere near 500 user connections in the system (it tends
to average around 350 connections). If we begin to disconnect users the
server continues reporting that the maximum number of users is
connected. Eventually running queries thru an open connection will hang
and we have to resort to a hard reboot of the server as neither SQL
Server will not shutdown nor will the operating system.

The Server:
Compaq Proliant DL380 with dual 863 MHz. processors (x86 Family 6 Model
8 Stepping 3 GenuineIntel), 917,020 KB of physical memory, Microsoft
Windows 2000 Server (Version 5.0.2195 Service Pack 4 Build 2195), Total
Virtual Memory 3,138,688 KB, Page File Space 2,221,668 KB

SQL Server:
Microsoft SQL Server 6.50 - 6.50.479 (Intel X86). Some pertinent
configuration settings: memory - 244100 (in 2K units), user connections
500, RA worker threads 3, max worker threads 255

We found one reference to the above error message in the Microsoft
Knowledge Base and that refers to a condition where the server has 2GB
or more of physical memory with 1.5GB assigned to SQL Server. This does
not pertain to our situation. Have any of you ever encountered this
problem?

I appreciate your insights.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 23 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a
InProcess (in*******@devdex.com) writes:
Let me start by saying that, yes we know v6.5 is no longer supported by
Microsoft and that moving to SQL2000 might resolve our problem.
Actually the migration to SQL2000 is 1 to 2 months out and in the QA
phase now. Unfortunately we need to resolve the current 6.5 issue now
as it heavily impacts a major revenue stream for the company with every
outage we encounter.
There may be all reason to speed up that SQL2000 migration, given the
current situation. Sounds like you are living on borrowed time.
At least once every 24 hours the SQL Server begins disallowing new user
connections. When this happens the 2 CPUs begin to thrash. About 5
minutes later the error log begins to print out the following error
message: "Unable to connect. The maximum number of '500' configured user
connections are already connected. System Administrator can configure to
a higher value with sp_configure." No new connections can be made to the
server; however, the existing connections continue to function normally.
We never see anywhere near 500 user connections in the system (it tends
to average around 350 connections). If we begin to disconnect users the
server continues reporting that the maximum number of users is
connected. Eventually running queries thru an open connection will hang
and we have to resort to a hard reboot of the server as neither SQL
Server will not shutdown nor will the operating system.
It's of course difficult to say that much on a distance, but here are
some things I like to know:

o If you run an sp_lock in such a situation, do you see any funny spids
(that is negative?) You could also use my aba_lockinfo to see if
there are any locks that does not really seem to correlate with the
spids. (See http://www.sommarskog.se/sqlutil/aba_lockinfo.html.)
o Is there anything interesting in the SQL Server error log beside the
"Unable to connect" messages? (Stack dumps, other error messages etc.)
o What sort of application do you have? (Web/custombased, client library
etc.)
o Have you examined whether the state where the CPUs goes mad are
correlated with any particular function/query? Possibly SQL Trace
could be used to find out.
o Have you checked the databases for corruption with DBCC NEWALLOC and
CHECKDB?
SQL Server:
Microsoft SQL Server 6.50 - 6.50.479 (Intel X86). Some pertinent
configuration settings: memory - 244100 (in 2K units), user connections
- 500, RA worker threads - 3, max worker threads - 255


It seems to me that you could up the memory number with 50%. If you have
the "procedure cache" at its default setting of 30, decrease it to 5-10.
(The number is a percentage of the total memory.) If you have tempdb in
RAM or have pinned tables, stop doing this.

Not that I believe that more memory to SQL Server will really help. But
possibly it could delay the bad state to occur.
--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #2

P: n/a
Thanks for your response, Erland. Here are the answers to your
questions:

I don't see anything unusual going on in terms of locks within the
database when we experience the symptoms described. There are other
unusual symptoms such as the server reporting that the tran log is
nearly full one moment and then seconds later it registers as being
almost empty. I take this to be just another erroneous message created
as a result of the server panicking.

Besides the "maximum number of users connected" message in the error
log, we do see multiple stack dumps created when we try to stop the SQL
Server. The dump reports an Exception Address at a specific address and
"Access Violation occurred" reading the same address. The stack dumps
go on and on until we force the server to reboot by powering down.

Since the last occurrence I have been running a SQLTrace on the
application that I most suspect. It is a web app running a 24 X 7
e-commerce site. The application runs on BEA weblogic servers. In
particular I'm concerned by the jnet drivers that do strange things like
open transactions that never commit unless they are terminated from the
server side. We have to do that often so that the tran log can be
dumped.

I'm running DBCC checkdb and newalloc on a warm backup of the database
now. They haven't completed yet but the checkdb has reported one
corrupt table so far with error 2506. Do you think it's likely that the
symptoms we are experiencing could be caused by table corruption?

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 23 '05 #3

P: n/a
T Dubya (ti*********@bigfoot.com) writes:
I don't see anything unusual going on in terms of locks within the
database when we experience the symptoms described. There are other
unusual symptoms such as the server reporting that the tran log is
nearly full one moment and then seconds later it registers as being
almost empty. I take this to be just another erroneous message created
as a result of the server panicking.
A server is not a human that start to things whimsly, just because under
pressure. Those messages very likely does indicate something, and there
is a clue further down in your message.

I guess that if the log goes from full to empty, is because the transaction
log is dumped regularly. I mean, you are not running with "truncate log
on checkpoint", are you?
Besides the "maximum number of users connected" message in the error
log, we do see multiple stack dumps created when we try to stop the SQL
Server. The dump reports an Exception Address at a specific address and
"Access Violation occurred" reading the same address. The stack dumps
go on and on until we force the server to reboot by powering down.
By then things have already gone out of control, I guess.
Since the last occurrence I have been running a SQLTrace on the
application that I most suspect. It is a web app running a 24 X 7
e-commerce site. The application runs on BEA weblogic servers. In
particular I'm concerned by the jnet drivers that do strange things like
open transactions that never commit unless they are terminated from the
server side. We have to do that often so that the tran log can be
dumped.
Ah, here is something. Those uncommitted transactions can very well be
part of the misery. There is a terrible gotcha: most client libraries
has a default timeout of 30 seconds. Say that a client call a stored
procedure that begins a transaction. Then this process gets blocked (or
just get stuck with a difficult query), and the timeout expires. The
transaction is *not* rolled back in this situation. SQL Server does not
know about the timeout, it only knows that the client cancelled the
operation. The client must in this situation either disconnect, or submit
a "IF @@trancount > 0 ROLLBACK TRANSACTION". (The latter is preferrable,
since disconnection could be delayed due to connection pooling.)

These open transactions are also likely to be the reason for the translog
filling up. The translog can never be truncated past the oldest open
transaction.
I'm running DBCC checkdb and newalloc on a warm backup of the database
now. They haven't completed yet but the checkdb has reported one
corrupt table so far with error 2506. Do you think it's likely that the
symptoms we are experiencing could be caused by table corruption?


It is not all implausible. Particularly, if this is something that
started to appear all of a sudden. I don't have access to the 6.5 at
work. In the SQL 2000 docs, 2506 is said to be "Could not find a table or
object name '%.*ls' in database '%.*ls'.", which does not look very
good to me. You can search for 2506 in Books Online, to see if there is
a topic that describes how to handle this error.

And, in the end, corruption can very well have its origin in hardware
problems.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #4

P: n/a
Erland Sommarskog (es****@sommarskog.se) writes:
It is not all implausible. Particularly, if this is something that
started to appear all of a sudden. I don't have access to the 6.5 at
work. In the SQL 2000 docs, 2506 is said to be "Could not find a table or
object name '%.*ls' in database '%.*ls'.", which does not look very
good to me. You can search for 2506 in Books Online, to see if there is
a topic that describes how to handle this error.


Now that I at home, I see that 2506 is a completey different error in
2506:

Table Corrupt: The values in adjust table should be in ascending order
starting from the end of the table (page#=%ld row#=%d); check adjust
table in this row

However, the explanation appears to indicate that is not a critical error.
There are instructions in Books Online on how to repair it.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #5

P: n/a
We're running checkdb on the database in question right now and so far
it has revealed only one corrupt table. The problem is that it is the
sysprocedures table and the error is 2503. BOL says that it may be
possible to fix the error by dropping and recreating the index if the
inconsistency in the page linkage is in fact pointing to an index. In
our case we're not so lucky. It's the table that the error is referring
to and not an index. BOL then goes to say that in most cases, you must
recover the database from a known clean backup. Our last clean backup
would be more than 48 hours ago by now and it would be very painful to
fall back that far in time.

Do you think there is any merit to the idea of trying to fix the
corruption in sysprocedures by recreating all of the objects referenced
by sysprocedures - in effect deleting and recreating the entire contents
of sysprocedures? BOL says that sysprocedures has entries for each
view, default, rule, trigger, CHECK constraint, DEFAULT constraint, and
stored procedure.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 23 '05 #6

P: n/a
T Dubya (ti*********@bigfoot.com) writes:
We're running checkdb on the database in question right now and so far
it has revealed only one corrupt table. The problem is that it is the
sysprocedures table and the error is 2503. BOL says that it may be
possible to fix the error by dropping and recreating the index if the
inconsistency in the page linkage is in fact pointing to an index. In
our case we're not so lucky. It's the table that the error is referring
to and not an index. BOL then goes to say that in most cases, you must
recover the database from a known clean backup. Our last clean backup
would be more than 48 hours ago by now and it would be very painful to
fall back that far in time.
48 hours? You do take transaction log dumps as well, I hope?

Anyway, since this has been going on for a while, that backup is likely
to have the same problem.
Do you think there is any merit to the idea of trying to fix the
corruption in sysprocedures by recreating all of the objects referenced
by sysprocedures - in effect deleting and recreating the entire contents
of sysprocedures? BOL says that sysprocedures has entries for each
view, default, rule, trigger, CHECK constraint, DEFAULT constraint, and
stored procedure.


I would go further than that. I would rebuild the entire database. That is,
bulk out all data, build a new from scripts and load data back. While
dropping all objects would shrink sysprocedures, that may not be enough
to fix the bad page linkage. In fact, that could lead to even more
serious accidents.

The good thing with sysprocedures being the victim is that there is no
original data in that table, so you have not lost anything.

I would also try to see if you can find some other machine that you
can move into. Hardware problems may well be the underlying reason. As
I recall for the box you have had quite modest specs, and am I right to
assume that the plan is to retire it when you move to SQL 2000?

Of course, this operation is likely give you some downtime that in
undesireable for your type of operation. But the behaviour you have no,
appears unbearable as well.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #7

P: n/a
Much has changed since this thread began. We migrated from the original
server which was a Compaq DL380 to a much newer and more powerful
hardware, a Dell 6650. The migration appeared to go fine until a few
hours afterwards when run time 1109 errors began to appear. BOL does
indicate that 1109 is "almost always caused by hardware- or
operating-system-level problems." For that reason we are trying to get
off of the new hardware as fast as we can without falling back in time
if at all possible. We realized that we have a backup created
immediately after the migration to the new hardware was completed and
before we started receiving any errors. Today we were able to load that
backup onto a third server without any errors. We are in the process of
trying to roll forward with tran logs. If we can apply all tran logs
then the plan is to back up the database will all tran logs applied and
see if we can load that backup cleanly too. It that goes well we will
also run DBCCs on it to be sure that it is clean. At that point I think
we would have a copy of the database that we can say we feel comfortable
with using. If we can't get a clean and up-to-date copy of the database
to go forward with, then we will resort to the bcp method for migrating.
We've done it before when moving from an Intel processor to an Alpha box
and then back again to Intel. It takes much longer than the normal dump
and load procedure but at least it won't mean losing any data.

By the way, we did try fixing sysprocedures with the sp_fixindex
procedure as prescribed in BOL and the Microsoft Knowledge Base. It
didn't help since the corruption is in the data pages and not an index.
After that we also tried reverse engineering all of the objects
referenced in sysprocedures, dropping them, and then rebuilding them.
We hit a snag when we got to the last rule referenced in sysprocedures.
It wouldn't let us drop it. I guess it was assoicated with the corrupt
page allocation. I'm hoping that if we have to use the bcp method of
migrating to a new server that the new sysprocedures will be fine since
it will only be repopulated by the DDL that we run to create the data
structures. Thanks again for all your feedback. It's greatly
appreciated.


*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 23 '05 #8

P: n/a
T Dubya (ti*********@bigfoot.com) writes:
Much has changed since this thread began. We migrated from the original
server which was a Compaq DL380 to a much newer and more powerful
hardware, a Dell 6650. The migration appeared to go fine until a few
hours afterwards when run time 1109 errors began to appear. BOL does
indicate that 1109 is "almost always caused by hardware- or
operating-system-level problems." For that reason we are trying to get
off of the new hardware as fast as we can without falling back in time
if at all possible.
If you moved to the new machine by means of backups and trans logs,
I would not rule out that you carried with you corruption caused by
hardware problems on the Compaq machine.

Of course, another possibility is that you did not create the segments in
the right order. But I assume that you did use sp_rev_database to get the
script to create the database on the Dell server?
If we can't get a clean and up-to-date copy of the database
to go forward with, then we will resort to the bcp method for migrating.
We've done it before when moving from an Intel processor to an Alpha box
and then back again to Intel. It takes much longer than the normal dump
and load procedure but at least it won't mean losing any data.


It may indeed take longer time, but in the end it may prove to be
the fastest ways to get things in order anyway.

Oh, so much simpler life is with SQL 2000! No sp_rev_databases, and better
stability all over.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #9

P: n/a
We did you sp_rev_database to get the script to create the new database
on the Dell server. I would agree that it is possible that we carried
corruption to the new server since we did migrate with a dump and tran
logs. In fact, we do know that there was corruption in one user table
at the time of the migration. I opted to fix it on the new server since
it is a much faster box and I had already tested the script to do it.
It was easily fixed by creating a new version of the table and inserting
all of the data into it with a select from the corrupt table. We ran a
dbcc checktable immediately afterwards on the new table which came up
clean. All objects that referenced the corrupt were recompiled to point
to the new table. The corrupt table was renamed with an _corrupt suffix
on it but it was not dropped right away. It wasn't until several hours
later that we started to receive the 1109 error messages which were
caused by attempted updates to yet another user table. It's a table
that we know was not corrupt on the original server because DBCCs showed
that it was clean just before the migration. Another run of DBCCs at
this point revealed the corruption in sysprocedures. It's clear that
the corruption on the second user table and in sysprocedures did not
occur until serveral hours afte we migrated to the Dell server. This
Dell server, by the way, was one that we had many problems with right
out of the box. Dell technicians had to come in and replace much of the
hardware because the server was locking up. After the server was
practically rebuilt, we ran diagnostics on it for several days and then
several more days of test databse loads without any errors. Still, the
BOL comments no the error messages that we have been getting since the
migration do suggest the possibility of hardware problems as the
underlying cause. Getting off this machine with a healthy database is
the trick. By trying to migrate away from our problems on the original
server it seems as if we "jumped from the frying pan into the fire".

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 23 '05 #10

P: n/a
T Dubya (ti*********@bigfoot.com) writes:
We did you sp_rev_database to get the script to create the new database
on the Dell server. I would agree that it is possible that we carried
corruption to the new server since we did migrate with a dump and tran
logs. In fact, we do know that there was corruption in one user table
at the time of the migration. I opted to fix it on the new server since
it is a much faster box and I had already tested the script to do it.
It was easily fixed by creating a new version of the table and inserting
all of the data into it with a select from the corrupt table. We ran a
dbcc checktable immediately afterwards on the new table which came up
clean. All objects that referenced the corrupt were recompiled to point
to the new table. The corrupt table was renamed with an _corrupt suffix
on it but it was not dropped right away.
When you say recompile, you do mean that you reloaded them by dropping
and recreating? sp_recompile on 6.5 when you have renamed the bad
table but not dropped it, is not sufficient to clear out the confusion.
It wasn't until several hours later that we started to receive the 1109
error messages which were caused by attempted updates to yet another
user table. It's a table that we know was not corrupt on the original
server because DBCCs showed that it was clean just before the migration.
Another run of DBCCs at this point revealed the corruption in
sysprocedures. It's clear that the corruption on the second user table
and in sysprocedures did not occur until serveral hours afte we migrated
to the Dell server.


It appears that you have others reasons to look with suspicion at that
Dell server. However, dropping a corrupt object may not always be enough
to save the show. Some errors are not recoverable, for instance 605 and
some other of the errors you experienced. (Although, as if I understand
this correctly, your originally error was 2506, and this is definitely
a recoverable error.)

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #11

P: n/a
By recompile I do mean that we dropped and recreated the dependent
objects. The checktable DBCC run afterwards showed the newly created
and loaded table to be clear of the 2506 error. That error is not so
hard to get rid of.

The corruption in sysprocedures is the only error that we cannot seem to
escape at this time. If we dump the database as it exists now on the
Dell server it finishes fine. When we load that dump with stats into
another database on another server it progresses to the point where it
indicates 100% loaded but the load doesn't actually finish. We don't
see the ususal recovery messages at the end of the load. With every
attempt we have been able to force the load to stop by stopping and
starting the SQL Server. When the server comes back up we can set the
newly loaded database to status = 0 and mode = 0 in sysdatabases and
then stop and start the SQL Server again. Surprisingly enough, the
database then opens and recovers normally! We can access it and there
doesn't appear to be any problems. I admit that it's a really odd way
to load a database but it seems to be working for now.

I understand that the Microsoft documentation states that you must have
a clean database free of corruption before you can migrate to SQL 2000.
Do you think that there is any chance that a database with message 2503
pointing at sysprocedures could be successfully migrated to SQL 2000?
I'm wondering if we would have any better chance of getting the
corruption cleaned up on that version. At least on SQL 2000 we would be
able to get support from Microsoft. Bad idea?

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!
Jul 23 '05 #12

P: n/a
T Dubya (ti*********@bigfoot.com) writes:
By recompile I do mean that we dropped and recreated the dependent
objects.
Good! I just wanted to make sure.
The corruption in sysprocedures is the only error that we cannot seem to
escape at this time. If we dump the database as it exists now on the
Dell server it finishes fine. When we load that dump with stats into
another database on another server it progresses to the point where it
indicates 100% loaded but the load doesn't actually finish. We don't
see the ususal recovery messages at the end of the load. With every
attempt we have been able to force the load to stop by stopping and
starting the SQL Server. When the server comes back up we can set the
newly loaded database to status = 0 and mode = 0 in sysdatabases and
then stop and start the SQL Server again. Surprisingly enough, the
database then opens and recovers normally! We can access it and there
doesn't appear to be any problems. I admit that it's a really odd way
to load a database but it seems to be working for now.
Certainly an unusual way. It may not really resolve the problem though.
I understand that the Microsoft documentation states that you must have
a clean database free of corruption before you can migrate to SQL 2000.
Do you think that there is any chance that a database with message 2503
pointing at sysprocedures could be successfully migrated to SQL 2000?
I'm wondering if we would have any better chance of getting the
corruption cleaned up on that version. At least on SQL 2000 we would be
able to get support from Microsoft. Bad idea?


The corruption will not carry over to SQL 2000, as the migration is
bulk out/bulk in-thing. The database structure is completely different
in SQL 2000. And there is no sysprocedures in SQL 2000.

As for how to migrate the database, there is no requirement to use
the tool provided by Microsoft. When we converted databases from 6.5
to SQL 2000, we rolled our own by running scripts and bulk-load. The
one gotcha we ran into was that my colleague insisted on using native
format, and this cause some trouble with column typed as sysname.
(sysname changes from varchar(30) to nvarchar(127).

I think that if you need to cling to 6.5 in production for another
while that you should reload the database from scripts. At least if
you see more problems. And recall that there is a risk that the
corruption spreads so that you suddenly have a user table infected
and lose data.

--
Erland Sommarskog, SQL Server MVP, es****@sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techinf...2000/books.asp
Jul 23 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.