SQL Server Instance Offline - Really? (v2.1)
PDinCA
Posts: 642 Silver 1
I've received three Alerts over the past 48 hours telling me that my cluster instance is OFFLINE.
The first time it happened my colleague was thrown out of SSMS on the cluster itself.
The 2nd time, 5+ hours later, I looked in the SQL Log and the Cluster file under c:\Windows\Cluster\Reports and there were some failures to load dlls but they were about 7 minutes AFTER the actual alert time. The cluster didn't failover when the instance went offline.
Aside from the first occurrence, where a colleague was witness to "something", I can find no trace in Event Log, or the SQL Log within SSMS of any such "instance offline" actually occurring, which deeply concerns me.
Am I looking for "evidence" in the wrong places? Did the instance actually go offline? "Prove it!" is my position right now.
Base Monitor, Repository and Web Server are on a separate box from the cluster, because, as you know, we can't install SQL Monitor ON the cluster itself. The 2 boxes are on the same domain and use the domain DBAdmin account, with sysadmin privileges, for the SQL connection. Both are 64-bit WS2008 SP2. Both are on SS2005 Enterprise SP3.
Ideas/diagnosis/reassurance/"we have a bug"...? I need something to help me because there are no diagnostics that I can see in SQL Monitor and, as already stated, nothing in the obvious places.
The first time it happened my colleague was thrown out of SSMS on the cluster itself.
The 2nd time, 5+ hours later, I looked in the SQL Log and the Cluster file under c:\Windows\Cluster\Reports and there were some failures to load dlls but they were about 7 minutes AFTER the actual alert time. The cluster didn't failover when the instance went offline.
00001234.00001b74::2010/12/21-19:49:22.415 WARN [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqclus.dll 00000c5c.00002670::2010/12/21-19:49:22.415 INFO [RCM] rcm::RcmResType::LoadDll: Got error 126; will attempt to load mqclus.dll via Wow64. 00001250.00001290::2010/12/21-19:49:22.423 WARN [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqclus.dll 00000c5c.00002670::2010/12/21-19:49:22.423 WARN [RCM] Failed to load restype MSMQ: error 126. 00001234.00001b74::2010/12/21-19:49:22.481 WARN [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqtgclus.dll 00000c5c.00002670::2010/12/21-19:49:22.481 INFO [RCM] rcm::RcmResType::LoadDll: Got error 126; will attempt to load mqtgclus.dll via Wow64. 00001250.00001290::2010/12/21-19:49:22.483 WARN [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqtgclus.dll 00000c5c.00002670::2010/12/21-19:49:22.483 WARN [RCM] Failed to load restype MSMQTriggers: error 126.The EVENTLOG for MSSQLSERVER shows only SSPI errors over the last 24 hours, during which there was one ALLEGED incident where the instance went offline.
Aside from the first occurrence, where a colleague was witness to "something", I can find no trace in Event Log, or the SQL Log within SSMS of any such "instance offline" actually occurring, which deeply concerns me.
Am I looking for "evidence" in the wrong places? Did the instance actually go offline? "Prove it!" is my position right now.
Base Monitor, Repository and Web Server are on a separate box from the cluster, because, as you know, we can't install SQL Monitor ON the cluster itself. The 2 boxes are on the same domain and use the domain DBAdmin account, with sysadmin privileges, for the SQL connection. Both are 64-bit WS2008 SP2. Both are on SS2005 Enterprise SP3.
Ideas/diagnosis/reassurance/"we have a bug"...? I need something to help me because there are no diagnostics that I can see in SQL Monitor and, as already stated, nothing in the obvious places.
Jesus Christ: Lunatic, liar or Lord?
Decide wisely...
Decide wisely...
Comments
SQL Server offline alert is raised if SQL Server doesn't respond to ping. We try .. I think 15 pings with 1s timeout or something similar. If SQL Server doesn't respond then SQL Monitor raises offline alert. So even if the SQL Server service is still running then also you can get this alert if for some reason or another SQL Server stops responding.
So, as you noticed your colleague was disconnected from SSMS. I would assume that SQL Server was not responding to pings and hence this alert got raised.
Hope this make sense.
Thanks,
Priya
Project Manager
Red Gate Software
Decide wisely...
Could you please look at the SQL Monitor log files on Base monitor machine please or send me at priya.sinha@red-gate.com? Around the time when these alerts where raised .. there should be some timeouts or connection errors in the log files.
The location is C:\ProgramData\Red Gate\Logs\SQL Monitor 2 or C:\Documents and Settings\All Users\Application Data\Red Gate\Logs\SQL Monitor 2.
Thanks,
Priya
Project Manager
Red Gate Software
Thanks.
Decide wisely...
Thanks,
Priya
Project Manager
Red Gate Software
This could have been caused by many things eg: a peek in network traffic increasing latency or causing packet loss, a switch or router dropping traffic, etc...
None of these would have left any trace in the SQL Server log files.
I have inspected the SQL Monitor log files and can't find anything which would indicate a false positive alert.
Regards,
--
Daniel
I'm still left with the fundamental issue of SQL Monitor issuing an "Instance Offline" when, as you pointed out, the cluster didn't failover, therefore the instance was clearly not "offline" in any way that affected the Production website...
I'm tempted to simply DISABLE the Instance Offline Alert as it is very much "noise" for every one of the 11 times it has occurred over the last 54 hours.
The Base Monitor and Cluster being monitored are hosted 2,500 miles away. Do you suggest I have the Hosting Company look into why the 2 boxes appear to lose each other? Out of interest, I have a T-SQL Job running a Stored Procedure locally that pulls data every minute over a Linked Server from the Cluster DB to a Demo DB on the same box as the Base Monitor - it NEVER has a connectivity issue although it typically runs for 45 to 50 seconds of every minute. Care to comment?
Re the Log Files I sent:
I noticed there were a LOT of "ERROR" messages recorded. Please comment on them as an ERROR to me is significant and I'd like to be assured that the Monitor is not missing something I need to be aware of.
Decide wisely...