Options

New Data Collection Monitoring Errors

cehottlecehottle Posts: 38
edited February 28, 2011 2:53PM in SQL Monitor Previous Versions
I installed release 2.2 this morning. Since I installed the update, I've received 67 host machine monitoring or SQL Server monitoring error alerts over a 6 hour period for one of the servers. When I look at the Manage Monitored Servers page, everything looks fine. These apparently can be pretty transient and it would probably be good to have some sort of threshold before an alert is triggered, though I'm not sure exactly how that would work. It's still nice to have them and I'm wondering if you have any ideas about how to determine why I'm seeing so many from one server. It's in a remote location, but there are two other servers there as well and I haven't seen any from them. Thanks.

Comments

  • Options
    Hi

    It is possible to see the error that causes monitoring to stop by clicking the Show Log link for the relevant server on the Manager Servers config page. However, this only displays the last five or so minutes worth of logging so wouldn't help here.

    In most cases the base deployment log files for the time period in question would be the best place to look. These are located at "C:\ProgramData\Red Gate\Logs\SQL Monitor 2" or "C:\Documents and Settings\All Users\Application Data\Red Gate\Logs\SQL Monitor 2" depending on your operating system. If you send them to chris.spencer@red-gate.com I would gladly look through them and see if anything unusual is getting logged.

    Regards
    Chris
    Chris Spencer
    Test Engineer
    Red Gate
  • Options
    We're having the same thing here since the update to 2.2.

    Red Gate wants to blame WMI but there are no WMI issues. We use OpManage to monitor our host servers through WMI and there are no issues at all with it.

    Last week I found three jobs that failed on the servers and there were no indication of failed jobs in the SQL Monitor. I get a string of "data collection errors" about twice a day, no specific times, it's random and I get it across the various sites we have to multiple servers, but not all of them.

    Anyone else having this issue as well?
  • Options
    I ended up having to exclude the server from these two new alerts. I turned on some additional logging at the request of support and all that showed was that there was timeout occurring. It was the only server that was having this issue out of 15 servers, but the number of alerts ended upo being noise.
  • Options
    Hi

    We only use WMI to collect the following information.

    • Cluster configuration and status
    • Total amount of physical memory
    • OS version and service pack
    • Window process user
    • Host name and DNS name of the machine

    We mostly use perfmon and the recent issues appear to be related to this. It is possible to see what the monitoring error is by going to the Monitored Servers page and clicking the Show Log link for the relevant server. This will only show the last 5 minutes of logging however.

    Regards
    Chris
    Chris Spencer
    Test Engineer
    Red Gate
  • Options
    Hey Chris! How are you? I actually emailed you Friday and this morning on this.

    So what would cause me to not get a warning about a failed job?
  • Options
    I actually emailed you Friday and this morning on this.
    That's strange - I don't appear to have received any emails. I've double checked and the email address linked above is definitely correct.

    So what would cause me to not get a warning about a failed job?

    I'm not 100% sure. The job failed alert is relatively uncomplicated and triggers on seeing a job failure in the job history. We do collect this data and any failed jobs should be present in the SQL Monitor data repository.

    This SQL should show any failures:
    SELECT Utils.TicksToDateTime(CollectionDate) as [CollectionDateTime]
          ,[_Message]
          ,[_RunStatus]      
      FROM [RedGateMonitor].[data].[Cluster_SqlServer_Agent_Job_History_Instances]
      WHERE _RunStatus = 0
    

    It would be worth checking if a row exists at the specific point of time that your job failed. I could probably cobble together some more complicated SQL that displays the job name etc if that helps?

    Regards
    Chris
    Chris Spencer
    Test Engineer
    Red Gate
  • Options
    Oh sorry, it was Chris Kelly. How many Chris's you guys got over there?! We have the same thing with "Daniels". Hahahaha.

    Anyway, I'll check on the entry. I do have a lot of entries in the log about fialed triggers on all the servers. That's one thing I have been wondering about and asking Chris about.
  • Options
    Chris,

    Ran the query and i have entries in the tables for the failures. Nothing showing in the website GUI though. Very strange.

    Any ideas?
  • Options
    Even more info, in the query results I have entries for jobs that show to have succeeded on the server and multiple entries for them as well right down to the timestamp.
  • Options
    Chris SpencerChris Spencer Posts: 301
    edited March 1, 2011 11:42AM
    I will double check that the _RunStatus field is the one we use to trigger the job failed alerts.

    In the meantime I've created some SQL to check the alerts tables:
    SELECT  alert.AlertId ,
            alert.TargetObject ,
            RedGateMonitor.Utils.TicksToDateTime(severity.Date) AS [SeverityDate]
    FROM    [RedGateMonitor].[alert].[Alert] alert
            JOIN [RedGateMonitor].[alert].[Alert_Severity] severity ON alert.AlertId = severity.AlertId
            JOIN [RedGateMonitor].[alert].[Alert_Type] type ON alert.AlertType = type.AlertType
    WHERE   type.Name = 'Job failed'
    ORDER BY AlertId DESC
    

    The SeverityDate column should be the time that the alert is raised as there is usually only one possible severity for Job Failed alerts. It would be interesting to know if there are any records for the minutes after a job failure was reported on the server.

    Regards
    Chris
    Chris Spencer
    Test Engineer
    Red Gate
  • Options
    The query here after I changed the DB name to RedGateMonitor produced 4 rows of result. None of these were for the server with the job failures discussed above.
Sign In or Register to comment.