Intermittent Job Failure alert for stopped jobs

fifthracefifthrace Posts: 15
edited April 18, 2013 9:51AM in SQL Monitor Previous Versions
We have a series of SSIS SQL Agent jobs that are running. Every hour another job runs that recycles the SSIS jobs by running the following:
USE msdb ;
GO
EXEC dbo.sp_stop_job N'JOB_NAME' ;
GO

USE msdb ;
GO
EXEC dbo.sp_start_job N'JOB_NAME' ;
GO

Looking at the job history for the SSIS jobs, I see this in the step details:
Executed as user: DOMAIN\User. The step was cancelled (stopped) as the result of a stop job request.

I can look at the history and compare that to my "Job failed" alerts in SQL Monitor and there are numerous examples of the monitoring software not showing a failure. This seems to happen in waves. For instance, one day I'll get a bunch of alerts and the next I won't get any.

Is SQL Monitor supposed to log "stopped" jobs as failures? If so, is there a way to change the default behavior? If not, I'd be happy to supply more information as needed to help troubleshoot this further.

Thanks all,
Robbie

Comments

  • Brian DonahueBrian Donahue Posts: 6,590 Bronze 1
    SQL Monitor records the job_status from the job history. sp_stop_job should end the job gracefully and allow the current jobstep to complete, however, MS warns that if the job was doing some CmdExec or PowerShell command, the outcome could be unpredictable...
    http://msdn.microsoft.com/en-us/library/ms182793.aspx

    This may be the reason for the inconsistent behavior.
  • Thanks for the reply.

    I really don't see anything different between the runs that show up in sql monitor and those that don't from a job status perspective. Could you tell me about how the software queries for this data?

    I'm just looking at the sysjobhistory table and seeing a 3 for run_status. Is the software looking at something different than this for it's information.

    Thanks for all your help,

    Robbie
  • Brian,
    Are you guys taking a look at this any more?

    Thanks,
    Robbie
  • Hi Robbie,

    I apologize for delay in reply.

    SQL Monitor retrieves information from msdb.dbo.sysjobhistory. Then it looks at run_status value and raises an alert for all run_status other than 1. In other words, it alerts you for failed, retry and cancelled jobs.

    Is this the behaviour you are seeing in your environment?

    Regarding default behaviour, unfortunately there is no way to change this behaviour.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Priya,
    Thanks for the reply. If what you are saying is true, then SQL Monitor is misbehaving for me. As stated in my original post, we are stopping the jobs via another job. This should result in a run_status of 3 for the jobs that were stopped. When I use this query I get 188 rows returned:
    select step_name, run_status, run_date, run_time
    from sysjobhistory
    where run_status <> 1 and run_date = '20120822'
    	and step_name <> '(Job outcome)'
    
    SQL Monitor only registered 22 job failures yesterday (22nd of August). I can look at some of the jobs that SQL Monitor is showing as failed and see that the run_status was indeed 3....however, there are many jobs with a run_status = 3 that did not get alerts yesterday.

    When SQL Monitor does report on a run_status = 3, it tends to report on all the jobs that were just set to that status as a result of the last stop jobs run. Most of the time SQL Monitor only reports on jobs that have a run_status = 0 (not sure on 2, since we don't get a lot of those).

    Is there anything in the code that would explain this behavior? Could this be a timing issue of some sort?

    Thanks for your help,
    Robbie
  • Hi Robbie.

    Thanks for the reply.

    Could you please run the following query on Monitor database for the job which is causing problem?
    select count(*) from [SqlMonitorDatabase].[data].[Cluster_SqlServer_Agent_Job_History_Instances_View] where Cluster_Name = '[MachineName]' and Cluster_SqlServer_Name = '[SQLServerInstanceName]' and Cluster_SqlServer_Agent_Job_Name = '[JobName]' and Cluster_SqlServer_Agent_Job_History_RunDate_DateTime > '2012-08-21'
    

    Then run your query (bit modified):
    select  count(*)
    from msdb.dbo.sysjobhistory
    where run_date > '20120821'
       and step_name <> '(Job outcome)' and job_id = 'jobid'
    

    Could you let us know how many rows each query return?

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • I changed your 2nd query a little bit. Changed the > to >=. If I ran your original code I wasn't getting any records from 8/21.
    select count(*)
    from msdb.dbo.sysjobhistory 
    where run_date >= '20120821' 
       and step_name <> '(Job outcome)' and job_id = 'jobid'
    

    Anyway, with the revised code, I got 38 for both queries. So, they seem to match at least.

    I dug into the data a bit. For this particular job there were 37 times that it had a run_status of 3 (stopped) and 1 time run_status of 0 (failed). Both msdb and the SQL Monitor database match on these values.

    I don't know how to dig into the alert.Alert table in the SQL Monitor database, but I did look through the UI. I only received 3 alerts for the job in question.

    Thanks for digging into this with me.

    Robbie
  • Hi Robbie,

    What is your purge setting?

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Basic machine data: 2 weeks
    Windows process data: 1 week
    Basic SQL Server data: 2 weeks
    SQL process data: 1 week
    Top 10 queries data: 1 week
    Database performance counter data: 2 weeks
    Custom metric data: 1 week
    Basic alert data: 1 month
    SQL Trace Data: 3 days
  • Hi Robbie,

    Apologies for delay in reply. Could you please run the following query in your environment?

    Counts the number of unsuccessful job runs in Monitor database since yesterday
    select COUNT(*) from [data].[Cluster_SqlServer_Agent_Job_History_Instances_View] where Cluster_Name = [MachineName] and Cluster_SqlServer_Name = [InstanceName] and Cluster_SqlServer_Agent_Job_History_RunDate_DateTime > = '2012-09-04' and Cluster_SqlServer_Agent_Job_History_RunStatus <> 1
    

    Counts the number of unsuccessful runs in sysjobhistory since yesterday. This is to make sure that we have collected the job history correctly. You may have to verify whether SQL Server enters the date in this table in UTC or local date time. If it is local date time then you may have to change query to accommodate your time zone changes. Everything in Monitor database is in UTC .
    select count(*) from msdb.dbo.sysjobhistory
    where run_status <> 1 and run_date >= '20120904'
       and step_name <> '(Job outcome)'
    

    Will show the number of each type of alerts which has been raised since yesterday grouped by server.
    WITH   target_obj
                  AS ( SELECT  [c].Name AS [clustername] ,
                                             '7:Cluster,1,4:Name,s'
                                             + CAST(LEN([c].[Name]) AS VARCHAR(10)) + ':'
                                             + [c].[Name] + ',' AS to_Cluster ,
                                             ss.Name AS [servername] ,
                                             '9:SqlServer,1,4:Name,s'
                                             + CAST(LEN([ss].[Name]) AS VARCHAR(10)) + ':'
                                             + [ss].[Name] AS to_SQLServer
                           FROM        [settings].[Clusters] AS c
                                             INNER JOIN [settings].[SqlServers] AS ss ON [c].[Id] = [ss].[ParentId]
                         )
           SELECT [t2].[clustername] AS [Cluster] ,
                         t2.[servername] AS [SQLServer] ,
                         ISNULL(cad.[Name], at2.[Name]) AS [Alert Name] ,
                         COUNT(*) AS [Alert Count]
           FROM   [alert].[Alert_Current] AS a
                         INNER JOIN [alert].[Alert_Type] AS at2 ON [a].[AlertType] = [at2].[AlertType]
                         LEFT JOIN [settings].[CustomAlertDefinitions] AS cad ON [a].[SubType] = [cad].[Id]
                         CROSS  APPLY ( SELECT      [t].[clustername] ,
                                                                  [t].[to_Cluster] ,
                                                                  [t].[servername] ,
                                                                  [t].[to_SQLServer] ,
                                                                  [t].[to_Cluster] + [t].[to_SQLServer] AS to_cluster_sqlserver
                                               FROM        [target_obj] AS t
                                               WHERE       LEFT([a].[TargetObject],
                                                                         LEN([t].[to_Cluster]
                                                                               + [t].[to_SQLServer])) COLLATE database_default = [t].[to_Cluster]
                                                                  + [t].[to_SQLServer]
                                               
                                             ) AS t2
           WHERE a.Raised >= utils.DateTimeToTicks('2012/09/04')
           GROUP BY [t2].[clustername] ,
                         t2.[servername] ,
                         ISNULL(cad.[Name], at2.[Name])
           ORDER BY [t2].[clustername] ,
                         t2.[servername] ,
                         ISNULL(cad.[Name], at2.[Name])
    

    Could you please let us know the result of each query?

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Here are the results (I hope I got the UTC stuff figured out right, but can't guarantee it. Either way the number of alerts is quite a bit lower.)
    The number of unsuccessful job runs in Monitor database since 9/4
    390

    The number of unsuccessful runs in sysjobhistory since 9/4
    447

    The number of each type of alerts which has been raised since 9/4 grouped by server. SERVER3 is the one in question here:
    SERVER1 Long-running query 23
    SERVER1 Schema modified - ISDB01 34
    SERVER1 Instance1 Long-running query 18
    SERVER1 Instance1 Schema modified - ISDB01 3
    SERVER1 Instance2 Long-running query 19
    SERVER1 Instance2 Schema modified - ISDB01 13
    SERVER2 Long-running query 3
    SERVER3 Blocked process 27
    SERVER3 Job failed 22
    SERVER3 Long-running query 71
    SERVER3 SQL Server error log entry 2
    SERVER3 SQL Server instance unreachable 1

    SERVER3 Instance1 Long-running query 17
    SERVER4 Deadlock 2
    SERVER4 Long-running query 40
    SERVER4 Monitoring error (SQL Server data collection) 1
    SERVER5 Deadlock 31
    SERVER5 Instance1 Long-running query 3
    SERVER6 Blocked process 1
    SERVER6 Deadlock 1
    SERVER6 Log backup overdue 1
    SERVER6 Long-running query 1
    SERVER6 Instance1 Deadlock 1
    SERVER6 Instance1 Long-running query 15
    SERVER7 Backup overdue 46
    SERVER7 Deadlock 17
    SERVER7 Job failed 8
    SERVER7 Long-running query 3
  • Hello,

    Have you just got one job on this server? Also, is any 'job failure' alert disabled in Alert Settings?

    I notice that you have one 'SQL Server instance unreachable' alert on this server. How long this alert was live for?

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • There are more than 20 jobs running on this particular server.

    The "SQL Server instance unreachable" lasted for 14 seconds and has only occurred one other time for this server. That was on August 7th. I think this is probably a coincidence and not related to this issue at all. The number of job failures seems to be significantly higher than the number of alerts I'm seeing.

    None of the alert settings are disabled from what I can tell. There are a couple jobs that I manually changed the recipient of the email that gets sent out.

    Thanks for continuing to look into this.

    Robbie
  • Hi Robbie,

    I have emailed you directly.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Hi Robbie and Priya,

    We are seeing the same inconsistencies here. Did you happen to get the bottom of this?

    Cheers,
    Del
  • Hi Del,

    We are still investigating this issue. Are you seeing the inconsistencies for any particular type of jobs? Robbie is having issue only with SSIS jobs.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Agreed with SSIS jobs.
    The other jobs that do not use SSIS packages are reporting as failed correctly.
  • Thanks for confirming. I will update this thread as soon as we get to the bottom of the issue.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • Hi Priya,

    Have there been any updates on this yet?
    We are still getting the same issue.
  • Hi Adelenda,

    Apologies for not updating this forum. We have been investigating this with Robbie and he provided us with backup of his repository. We could see that there is an issue but we are not able to reproduce it internally and hence it is proving difficult to fix. We are currently tracking this as SRP-8159. Can I get some more information from you? I would drop you an email now.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
  • This issue is now fixed. The bug fix will be released as part of v 3.4 release. Thanks to Robbie and Achilleas for help.

    Thanks,
    Priya
    Priya Sinha
    Project Manager
    Red Gate Software
Sign In or Register to comment.