Options

SQL Server Instance Offline - Really? (v2.1)

PDinCAPDinCA Posts: 642 Silver 1
edited January 6, 2011 12:41PM in SQL Monitor Previous Versions
I've received three Alerts over the past 48 hours telling me that my cluster instance is OFFLINE.

The first time it happened my colleague was thrown out of SSMS on the cluster itself.

The 2nd time, 5+ hours later, I looked in the SQL Log and the Cluster file under c:\Windows\Cluster\Reports and there were some failures to load dlls but they were about 7 minutes AFTER the actual alert time. The cluster didn't failover when the instance went offline.
00001234.00001b74::2010/12/21-19:49:22.415 WARN  [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqclus.dll
00000c5c.00002670::2010/12/21-19:49:22.415 INFO  [RCM] rcm::RcmResType::LoadDll: Got error 126; will attempt to load mqclus.dll via Wow64.
00001250.00001290::2010/12/21-19:49:22.423 WARN  [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqclus.dll
00000c5c.00002670::2010/12/21-19:49:22.423 WARN  [RCM] Failed to load restype MSMQ: error 126.
00001234.00001b74::2010/12/21-19:49:22.481 WARN  [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqtgclus.dll
00000c5c.00002670::2010/12/21-19:49:22.481 INFO  [RCM] rcm::RcmResType::LoadDll: Got error 126; will attempt to load mqtgclus.dll via Wow64.
00001250.00001290::2010/12/21-19:49:22.483 WARN  [RHS] ERROR_MOD_NOT_FOUND(126), unable to load resource DLL mqtgclus.dll
00000c5c.00002670::2010/12/21-19:49:22.483 WARN  [RCM] Failed to load restype MSMQTriggers: error 126.
The EVENTLOG for MSSQLSERVER shows only SSPI errors over the last 24 hours, during which there was one ALLEGED incident where the instance went offline.

Aside from the first occurrence, where a colleague was witness to "something", I can find no trace in Event Log, or the SQL Log within SSMS of any such "instance offline" actually occurring, which deeply concerns me.

Am I looking for "evidence" in the wrong places? Did the instance actually go offline? "Prove it!" is my position right now.

Base Monitor, Repository and Web Server are on a separate box from the cluster, because, as you know, we can't install SQL Monitor ON the cluster itself. The 2 boxes are on the same domain and use the domain DBAdmin account, with sysadmin privileges, for the SQL connection. Both are 64-bit WS2008 SP2. Both are on SS2005 Enterprise SP3.

Ideas/diagnosis/reassurance/"we have a bug"...? I need something to help me because there are no diagnostics that I can see in SQL Monitor and, as already stated, nothing in the obvious places.
Jesus Christ: Lunatic, liar or Lord?
Decide wisely...

Comments

  • Options
    Hi,

    SQL Server offline alert is raised if SQL Server doesn't respond to ping. We try .. I think 15 pings with 1s timeout or something similar. If SQL Server doesn't respond then SQL Monitor raises offline alert. So even if the SQL Server service is still running then also you can get this alert if for some reason or another SQL Server stops responding.

    So, as you noticed your colleague was disconnected from SSMS. I would assume that SQL Server was not responding to pings and hence this alert got raised.

    Hope this make sense.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Options
    PDinCAPDinCA Posts: 642 Silver 1
    The 1st instance made sense but it doesn't help with the 2nd and 3rd where SQL Server noted nothing and the only thing saying anything was offline was SQL Monitor. Where else can I look, please, for evidence of the 2nd and 3rd being real outages...
    Jesus Christ: Lunatic, liar or Lord?
    Decide wisely...
  • Options
    Hi,

    Could you please look at the SQL Monitor log files on Base monitor machine please or send me at priya.sinha@red-gate.com? Around the time when these alerts where raised .. there should be some timeouts or connection errors in the log files.

    The location is C:\ProgramData\Red Gate\Logs\SQL Monitor 2 or C:\Documents and Settings\All Users\Application Data\Red Gate\Logs\SQL Monitor 2.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Options
    PDinCAPDinCA Posts: 642 Silver 1
    Email has been sent with a set of 5 Base Monitor log files. The Web Log files contained just informational lines and very few of them, so I omitted sending them.

    Thanks.
    Jesus Christ: Lunatic, liar or Lord?
    Decide wisely...
  • Options
    Thanks we have received the log files and looking into it. Will update you as soon as possible.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Options
    The instance offline alert indicates that either the service stopped or that there was some sort of connectivity issue between the instance and the base monitor. Given that one of your users experienced a connectivity issue at the same time as one of the alerts was raised and the instance didn't restart or fail over I would suggest that some sort of connectivity issue was experienced on your network.

    This could have been caused by many things eg: a peek in network traffic increasing latency or causing packet loss, a switch or router dropping traffic, etc...

    None of these would have left any trace in the SQL Server log files.

    I have inspected the SQL Monitor log files and can't find anything which would indicate a false positive alert.

    Regards,
    --
    Daniel
  • Options
    PDinCAPDinCA Posts: 642 Silver 1
    Thanks for examining the files.

    I'm still left with the fundamental issue of SQL Monitor issuing an "Instance Offline" when, as you pointed out, the cluster didn't failover, therefore the instance was clearly not "offline" in any way that affected the Production website...

    I'm tempted to simply DISABLE the Instance Offline Alert as it is very much "noise" for every one of the 11 times it has occurred over the last 54 hours.

    The Base Monitor and Cluster being monitored are hosted 2,500 miles away. Do you suggest I have the Hosting Company look into why the 2 boxes appear to lose each other? Out of interest, I have a T-SQL Job running a Stored Procedure locally that pulls data every minute over a Linked Server from the Cluster DB to a Demo DB on the same box as the Base Monitor - it NEVER has a connectivity issue although it typically runs for 45 to 50 seconds of every minute. Care to comment?

    Re the Log Files I sent:
    I noticed there were a LOT of "ERROR" messages recorded. Please comment on them as an ERROR to me is significant and I'd like to be assured that the Monitor is not missing something I need to be aware of.
    Jesus Christ: Lunatic, liar or Lord?
    Decide wisely...
Sign In or Register to comment.