Intermittent Job Failure alert for stopped jobs

fifthrace · August 8, 2012 2:46PM

We have a series of SSIS SQL Agent jobs that are running. Every hour another job runs that recycles the SSIS jobs by running the following:

USE msdb ;
GO
EXEC dbo.sp_stop_job N'JOB_NAME' ;
GO

USE msdb ;
GO
EXEC dbo.sp_start_job N'JOB_NAME' ;
GO

Looking at the job history for the SSIS jobs, I see this in the step details:
Executed as user: DOMAIN\User. The step was cancelled (stopped) as the result of a stop job request.

I can look at the history and compare that to my "Job failed" alerts in SQL Monitor and there are numerous examples of the monitoring software not showing a failure. This seems to happen in waves. For instance, one day I'll get a bunch of alerts and the next I won't get any.

Is SQL Monitor supposed to log "stopped" jobs as failures? If so, is there a way to change the default behavior? If not, I'd be happy to supply more information as needed to help troubleshoot this further.

Thanks all,
Robbie

Brian Donahue · August 10, 2012 11:33AM

SQL Monitor records the job_status from the job history. sp_stop_job should end the job gracefully and allow the current jobstep to complete, however, MS warns that if the job was doing some CmdExec or PowerShell command, the outcome could be unpredictable...
http://msdn.microsoft.com/en-us/library/ms182793.aspx

This may be the reason for the inconsistent behavior.

fifthrace · August 10, 2012 12:55PM

Thanks for the reply.

I really don't see anything different between the runs that show up in sql monitor and those that don't from a job status perspective. Could you tell me about how the software queries for this data?

I'm just looking at the sysjobhistory table and seeing a 3 for run_status. Is the software looking at something different than this for it's information.

Thanks for all your help,

Robbie

fifthrace · August 17, 2012 9:01AM

Brian,
Are you guys taking a look at this any more?

Thanks,
Robbie

priyasinha · August 23, 2012 11:23AM

Hi Robbie,

I apologize for delay in reply.

SQL Monitor retrieves information from msdb.dbo.sysjobhistory. Then it looks at run_status value and raises an alert for all run_status other than 1. In other words, it alerts you for failed, retry and cancelled jobs.

Is this the behaviour you are seeing in your environment?

Regarding default behaviour, unfortunately there is no way to change this behaviour.

Thanks,
Priya

fifthrace · August 23, 2012 2:08PM

Priya,
Thanks for the reply. If what you are saying is true, then SQL Monitor is misbehaving for me. As stated in my original post, we are stopping the jobs via another job. This should result in a run_status of 3 for the jobs that were stopped. When I use this query I get 188 rows returned:

select step_name, run_status, run_date, run_time
from sysjobhistory
where run_status &lt;&gt; 1 and run_date = '20120822'
	and step_name &lt;&gt; '&#40;Job outcome&#41;'

SQL Monitor only registered 22 job failures yesterday (22nd of August). I can look at some of the jobs that SQL Monitor is showing as failed and see that the run_status was indeed 3....however, there are many jobs with a run_status = 3 that did not get alerts yesterday.

When SQL Monitor does report on a run_status = 3, it tends to report on all the jobs that were just set to that status as a result of the last stop jobs run. Most of the time SQL Monitor only reports on jobs that have a run_status = 0 (not sure on 2, since we don't get a lot of those).

Is there anything in the code that would explain this behavior? Could this be a timing issue of some sort?

Thanks for your help,
Robbie

priyasinha · August 24, 2012 9:35AM

Hi Robbie.

Thanks for the reply.

Could you please run the following query on Monitor database for the job which is causing problem?

select count&#40;*&#41; from &#91;SqlMonitorDatabase&#93;.&#91;data&#93;.&#91;Cluster_SqlServer_Agent_Job_History_Instances_View&#93; where Cluster_Name = '&#91;MachineName&#93;' and Cluster_SqlServer_Name = '&#91;SQLServerInstanceName&#93;' and Cluster_SqlServer_Agent_Job_Name = '&#91;JobName&#93;' and Cluster_SqlServer_Agent_Job_History_RunDate_DateTime &gt; '2012-08-21'

Then run your query (bit modified):

select  count&#40;*&#41;
from msdb.dbo.sysjobhistory
where run_date &gt; '20120821'
   and step_name &lt;&gt; '&#40;Job outcome&#41;' and job_id = 'jobid'

Could you let us know how many rows each query return?

Thanks,
Priya

fifthrace · August 24, 2012 10:11AM

I changed your 2nd query a little bit. Changed the > to >=. If I ran your original code I wasn't getting any records from 8/21.

select count&#40;*&#41;
from msdb.dbo.sysjobhistory 
where run_date &gt;= '20120821' 
   and step_name &lt;&gt; '&#40;Job outcome&#41;' and job_id = 'jobid'

Anyway, with the revised code, I got 38 for both queries. So, they seem to match at least.

I dug into the data a bit. For this particular job there were 37 times that it had a run_status of 3 (stopped) and 1 time run_status of 0 (failed). Both msdb and the SQL Monitor database match on these values.

I don't know how to dig into the alert.Alert table in the SQL Monitor database, but I did look through the UI. I only received 3 alerts for the job in question.

Thanks for digging into this with me.

Robbie

priyasinha · August 24, 2012 10:39AM

Hi Robbie,

What is your purge setting?

Thanks,
Priya

fifthrace · August 24, 2012 10:42AM

Basic machine data: 2 weeks
Windows process data: 1 week
Basic SQL Server data: 2 weeks
SQL process data: 1 week
Top 10 queries data: 1 week
Database performance counter data: 2 weeks
Custom metric data: 1 week
Basic alert data: 1 month
SQL Trace Data: 3 days

priyasinha · September 5, 2012 10:44AM

Hi Robbie,

Apologies for delay in reply. Could you please run the following query in your environment?

Counts the number of unsuccessful job runs in Monitor database since yesterday

select COUNT&#40;*&#41; from &#91;data&#93;.&#91;Cluster_SqlServer_Agent_Job_History_Instances_View&#93; where Cluster_Name = &#91;MachineName&#93; and Cluster_SqlServer_Name = &#91;InstanceName&#93; and Cluster_SqlServer_Agent_Job_History_RunDate_DateTime &gt; = '2012-09-04' and Cluster_SqlServer_Agent_Job_History_RunStatus &lt;&gt; 1

Counts the number of unsuccessful runs in sysjobhistory since yesterday. This is to make sure that we have collected the job history correctly. You may have to verify whether SQL Server enters the date in this table in UTC or local date time. If it is local date time then you may have to change query to accommodate your time zone changes. Everything in Monitor database is in UTC .

select count&#40;*&#41; from msdb.dbo.sysjobhistory
where run_status &lt;&gt; 1 and run_date &gt;= '20120904'
   and step_name &lt;&gt; '&#40;Job outcome&#41;'

Will show the number of each type of alerts which has been raised since yesterday grouped by server.

WITH   target_obj
              AS &#40; SELECT  &#91;c&#93;.Name AS &#91;clustername&#93; ,
                                         '7:Cluster,1,4:Name,s'
                                         + CAST&#40;LEN&#40;&#91;c&#93;.&#91;Name&#93;&#41; AS VARCHAR&#40;10&#41;&#41; + ':'
                                         + &#91;c&#93;.&#91;Name&#93; + ',' AS to_Cluster ,
                                         ss.Name AS &#91;servername&#93; ,
                                         '9:SqlServer,1,4:Name,s'
                                         + CAST&#40;LEN&#40;&#91;ss&#93;.&#91;Name&#93;&#41; AS VARCHAR&#40;10&#41;&#41; + ':'
                                         + &#91;ss&#93;.&#91;Name&#93; AS to_SQLServer
                       FROM        &#91;settings&#93;.&#91;Clusters&#93; AS c
                                         INNER JOIN &#91;settings&#93;.&#91;SqlServers&#93; AS ss ON &#91;c&#93;.&#91;Id&#93; = &#91;ss&#93;.&#91;ParentId&#93;
                     &#41;
       SELECT &#91;t2&#93;.&#91;clustername&#93; AS &#91;Cluster&#93; ,
                     t2.&#91;servername&#93; AS &#91;SQLServer&#93; ,
                     ISNULL&#40;cad.&#91;Name&#93;, at2.&#91;Name&#93;&#41; AS &#91;Alert Name&#93; ,
                     COUNT&#40;*&#41; AS &#91;Alert Count&#93;
       FROM   &#91;alert&#93;.&#91;Alert_Current&#93; AS a
                     INNER JOIN &#91;alert&#93;.&#91;Alert_Type&#93; AS at2 ON &#91;a&#93;.&#91;AlertType&#93; = &#91;at2&#93;.&#91;AlertType&#93;
                     LEFT JOIN &#91;settings&#93;.&#91;CustomAlertDefinitions&#93; AS cad ON &#91;a&#93;.&#91;SubType&#93; = &#91;cad&#93;.&#91;Id&#93;
                     CROSS  APPLY &#40; SELECT      &#91;t&#93;.&#91;clustername&#93; ,
                                                              &#91;t&#93;.&#91;to_Cluster&#93; ,
                                                              &#91;t&#93;.&#91;servername&#93; ,
                                                              &#91;t&#93;.&#91;to_SQLServer&#93; ,
                                                              &#91;t&#93;.&#91;to_Cluster&#93; + &#91;t&#93;.&#91;to_SQLServer&#93; AS to_cluster_sqlserver
                                           FROM        &#91;target_obj&#93; AS t
                                           WHERE       LEFT&#40;&#91;a&#93;.&#91;TargetObject&#93;,
                                                                     LEN&#40;&#91;t&#93;.&#91;to_Cluster&#93;
                                                                           + &#91;t&#93;.&#91;to_SQLServer&#93;&#41;&#41; COLLATE database_default = &#91;t&#93;.&#91;to_Cluster&#93;
                                                              + &#91;t&#93;.&#91;to_SQLServer&#93;
                                           
                                         &#41; AS t2
       WHERE a.Raised &gt;= utils.DateTimeToTicks&#40;'2012/09/04'&#41;
       GROUP BY &#91;t2&#93;.&#91;clustername&#93; ,
                     t2.&#91;servername&#93; ,
                     ISNULL&#40;cad.&#91;Name&#93;, at2.&#91;Name&#93;&#41;
       ORDER BY &#91;t2&#93;.&#91;clustername&#93; ,
                     t2.&#91;servername&#93; ,
                     ISNULL&#40;cad.&#91;Name&#93;, at2.&#91;Name&#93;&#41;

Could you please let us know the result of each query?

Thanks,
Priya

fifthrace · September 6, 2012 9:48AM

Here are the results (I hope I got the UTC stuff figured out right, but can't guarantee it. Either way the number of alerts is quite a bit lower.)
The number of unsuccessful job runs in Monitor database since 9/4
390

The number of unsuccessful runs in sysjobhistory since 9/4
447

The number of each type of alerts which has been raised since 9/4 grouped by server. SERVER3 is the one in question here:
SERVER1 Long-running query 23
SERVER1 Schema modified - ISDB01 34
SERVER1 Instance1 Long-running query 18
SERVER1 Instance1 Schema modified - ISDB01 3
SERVER1 Instance2 Long-running query 19
SERVER1 Instance2 Schema modified - ISDB01 13
SERVER2 Long-running query 3
SERVER3 Blocked process 27
SERVER3 Job failed 22
SERVER3 Long-running query 71
SERVER3 SQL Server error log entry 2
SERVER3 SQL Server instance unreachable 1
SERVER3 Instance1 Long-running query 17
SERVER4 Deadlock 2
SERVER4 Long-running query 40
SERVER4 Monitoring error (SQL Server data collection) 1
SERVER5 Deadlock 31
SERVER5 Instance1 Long-running query 3
SERVER6 Blocked process 1
SERVER6 Deadlock 1
SERVER6 Log backup overdue 1
SERVER6 Long-running query 1
SERVER6 Instance1 Deadlock 1
SERVER6 Instance1 Long-running query 15
SERVER7 Backup overdue 46
SERVER7 Deadlock 17
SERVER7 Job failed 8
SERVER7 Long-running query 3

priyasinha · September 6, 2012 11:45AM

Hello,

Have you just got one job on this server? Also, is any 'job failure' alert disabled in Alert Settings?

I notice that you have one 'SQL Server instance unreachable' alert on this server. How long this alert was live for?

Thanks,
Priya

fifthrace · September 6, 2012 1:39PM

There are more than 20 jobs running on this particular server.

The "SQL Server instance unreachable" lasted for 14 seconds and has only occurred one other time for this server. That was on August 7th. I think this is probably a coincidence and not related to this issue at all. The number of job failures seems to be significantly higher than the number of alerts I'm seeing.

None of the alert settings are disabled from what I can tell. There are a couple jobs that I manually changed the recipient of the email that gets sent out.

Thanks for continuing to look into this.

Robbie

priyasinha · September 7, 2012 10:32AM

Hi Robbie,

I have emailed you directly.

Thanks,
Priya

adelenda · September 20, 2012 5:33AM

Hi Robbie and Priya,

We are seeing the same inconsistencies here. Did you happen to get the bottom of this?

Cheers,
Del

priyasinha · September 20, 2012 6:15AM

Hi Del,

We are still investigating this issue. Are you seeing the inconsistencies for any particular type of jobs? Robbie is having issue only with SSIS jobs.

Thanks,
Priya

adelenda · September 20, 2012 6:22AM

Agreed with SSIS jobs.
The other jobs that do not use SSIS packages are reporting as failed correctly.

priyasinha · September 20, 2012 7:47AM

Thanks for confirming. I will update this thread as soon as we get to the bottom of the issue.

Thanks,
Priya

adelenda · December 19, 2012 6:06AM

Hi Priya,

Have there been any updates on this yet?
We are still getting the same issue.

priyasinha · December 20, 2012 5:08AM

Hi Adelenda,

Apologies for not updating this forum. We have been investigating this with Robbie and he provided us with backup of his repository. We could see that there is an issue but we are not able to reproduce it internally and hence it is proving difficult to fix. We are currently tracking this as SRP-8159. Can I get some more information from you? I would drop you an email now.

Thanks,
Priya

priyasinha · April 18, 2013 9:51AM

This issue is now fixed. The bug fix will be released as part of v 3.4 release. Thanks to Robbie and Achilleas for help.

Thanks,
Priya

Intermittent Job Failure alert for stopped jobs

Comments

Product Learning

Community Forums

Events & Friends

Simple Talk