New Data Collection Monitoring Errors

I installed release 2.2 this morning. Since I installed the update, I've received 67 host machine monitoring or SQL Server monitoring error alerts over a 6 hour period for one of the servers. When I look at the Manage Monitored Servers page, everything looks fine. These apparently can be pretty transient and it would probably be good to have some sort of threshold before an alert is triggered, though I'm not sure exactly how that would work. It's still nice to have them and I'm wondering if you have any ideas about how to determine why I'm seeing so many from one server. It's in a remote location, but there are two other servers there as well and I haven't seen any from them. Thanks.

cehottle 14 years ago

Comments

11 comments

Hi

It is possible to see the error that causes monitoring to stop by clicking the Show Log link for the relevant server on the Manager Servers config page. However, this only displays the last five or so minutes worth of logging so wouldn't help here.

In most cases the base deployment log files for the time period in question would be the best place to look. These are located at "C:\ProgramData\Red Gate\Logs\SQL Monitor 2" or "C:\Documents and Settings\All Users\Application Data\Red Gate\Logs\SQL Monitor 2" depending on your operating system. If you send them to chris.spencer@red-gate.com I would gladly look through them and see if anything unusual is getting logged.

Regards
Chris

Chris Spencer 14 years ago

0
We're having the same thing here since the update to 2.2.

Red Gate wants to blame WMI but there are no WMI issues. We use OpManage to monitor our host servers through WMI and there are no issues at all with it.

Last week I found three jobs that failed on the servers and there were no indication of failed jobs in the SQL Monitor. I get a string of "data collection errors" about twice a day, no specific times, it's random and I get it across the various sites we have to multiple servers, but not all of them.

Anyone else having this issue as well?

rmrussell1970 14 years ago

0
I ended up having to exclude the server from these two new alerts. I turned on some additional logging at the request of support and all that showed was that there was timeout occurring. It was the only server that was having this issue out of 15 servers, but the number of alerts ended upo being noise.

cehottle 14 years ago

0
Hi

We only use WMI to collect the following information.

â€¢ Cluster configuration and status
â€¢ Total amount of physical memory
â€¢ OS version and service pack
â€¢ Window process user
â€¢ Host name and DNS name of the machine

We mostly use perfmon and the recent issues appear to be related to this. It is possible to see what the monitoring error is by going to the Monitored Servers page and clicking the Show Log link for the relevant server. This will only show the last 5 minutes of logging however.

Regards
Chris

Chris Spencer 14 years ago

0
Hey Chris! How are you? I actually emailed you Friday and this morning on this.

So what would cause me to not get a warning about a failed job?

rmrussell1970 14 years ago

0
rmrussell1970 wrote:

I actually emailed you Friday and this morning on this.

That's strange - I don't appear to have received any emails. I've double checked and the email address linked above is definitely correct.

rmrussell1970 wrote:

So what would cause me to not get a warning about a failed job?

I'm not 100% sure. The job failed alert is relatively uncomplicated and triggers on seeing a job failure in the job history. We do collect this data and any failed jobs should be present in the SQL Monitor data repository.

This SQL should show any failures:
```
SELECT Utils.TicksToDateTime&#40;CollectionDate&#41; as &#91;CollectionDateTime&#93;
      ,&#91;_Message&#93;
      ,&#91;_RunStatus&#93;      
  FROM &#91;RedGateMonitor&#93;.&#91;data&#93;.&#91;Cluster_SqlServer_Agent_Job_History_Instances&#93;
  WHERE _RunStatus = 0
```
It would be worth checking if a row exists at the specific point of time that your job failed. I could probably cobble together some more complicated SQL that displays the job name etc if that helps?

Regards
Chris
Chris Spencer 14 years ago

0
Oh sorry, it was Chris Kelly. How many Chris's you guys got over there?! We have the same thing with "Daniels". Hahahaha.

Anyway, I'll check on the entry. I do have a lot of entries in the log about fialed triggers on all the servers. That's one thing I have been wondering about and asking Chris about.

rmrussell1970 14 years ago

0
Chris,

Ran the query and i have entries in the tables for the failures. Nothing showing in the website GUI though. Very strange.

Any ideas?

rmrussell1970 14 years ago

0
Even more info, in the query results I have entries for jobs that show to have succeeded on the server and multiple entries for them as well right down to the timestamp.

rmrussell1970 14 years ago

0

I will double check that the _RunStatus field is the one we use to trigger the job failed alerts.

In the meantime I've created some SQL to check the alerts tables:

SELECT  alert.AlertId ,
        alert.TargetObject ,
        RedGateMonitor.Utils.TicksToDateTime&#40;severity.Date&#41; AS &#91;SeverityDate&#93;
FROM    &#91;RedGateMonitor&#93;.&#91;alert&#93;.&#91;Alert&#93; alert
        JOIN &#91;RedGateMonitor&#93;.&#91;alert&#93;.&#91;Alert_Severity&#93; severity ON alert.AlertId = severity.AlertId
        JOIN &#91;RedGateMonitor&#93;.&#91;alert&#93;.&#91;Alert_Type&#93; type ON alert.AlertType = type.AlertType
WHERE   type.Name = 'Job failed'
ORDER BY AlertId DESC

The SeverityDate column should be the time that the alert is raised as there is usually only one possible severity for Job Failed alerts. It would be interesting to know if there are any records for the minutes after a job failure was reported on the server.

Regards
Chris

Chris Spencer 14 years ago

The query here after I changed the DB name to RedGateMonitor produced 4 rows of result. None of these were for the server with the job failures discussed above.

rmrussell1970 14 years ago

0