Intermittent Job Failure alert for stopped jobs
fifthrace
Posts: 15
We have a series of SSIS SQL Agent jobs that are running. Every hour another job runs that recycles the SSIS jobs by running the following:
Looking at the job history for the SSIS jobs, I see this in the step details:
Executed as user: DOMAIN\User. The step was cancelled (stopped) as the result of a stop job request.
I can look at the history and compare that to my "Job failed" alerts in SQL Monitor and there are numerous examples of the monitoring software not showing a failure. This seems to happen in waves. For instance, one day I'll get a bunch of alerts and the next I won't get any.
Is SQL Monitor supposed to log "stopped" jobs as failures? If so, is there a way to change the default behavior? If not, I'd be happy to supply more information as needed to help troubleshoot this further.
Thanks all,
Robbie
USE msdb ; GO EXEC dbo.sp_stop_job N'JOB_NAME' ; GO USE msdb ; GO EXEC dbo.sp_start_job N'JOB_NAME' ; GO
Looking at the job history for the SSIS jobs, I see this in the step details:
Executed as user: DOMAIN\User. The step was cancelled (stopped) as the result of a stop job request.
I can look at the history and compare that to my "Job failed" alerts in SQL Monitor and there are numerous examples of the monitoring software not showing a failure. This seems to happen in waves. For instance, one day I'll get a bunch of alerts and the next I won't get any.
Is SQL Monitor supposed to log "stopped" jobs as failures? If so, is there a way to change the default behavior? If not, I'd be happy to supply more information as needed to help troubleshoot this further.
Thanks all,
Robbie
Comments
http://msdn.microsoft.com/en-us/library/ms182793.aspx
This may be the reason for the inconsistent behavior.
I really don't see anything different between the runs that show up in sql monitor and those that don't from a job status perspective. Could you tell me about how the software queries for this data?
I'm just looking at the sysjobhistory table and seeing a 3 for run_status. Is the software looking at something different than this for it's information.
Thanks for all your help,
Robbie
Are you guys taking a look at this any more?
Thanks,
Robbie
I apologize for delay in reply.
SQL Monitor retrieves information from msdb.dbo.sysjobhistory. Then it looks at run_status value and raises an alert for all run_status other than 1. In other words, it alerts you for failed, retry and cancelled jobs.
Is this the behaviour you are seeing in your environment?
Regarding default behaviour, unfortunately there is no way to change this behaviour.
Thanks,
Priya
Project Manager
Red Gate Software
Thanks for the reply. If what you are saying is true, then SQL Monitor is misbehaving for me. As stated in my original post, we are stopping the jobs via another job. This should result in a run_status of 3 for the jobs that were stopped. When I use this query I get 188 rows returned: SQL Monitor only registered 22 job failures yesterday (22nd of August). I can look at some of the jobs that SQL Monitor is showing as failed and see that the run_status was indeed 3....however, there are many jobs with a run_status = 3 that did not get alerts yesterday.
When SQL Monitor does report on a run_status = 3, it tends to report on all the jobs that were just set to that status as a result of the last stop jobs run. Most of the time SQL Monitor only reports on jobs that have a run_status = 0 (not sure on 2, since we don't get a lot of those).
Is there anything in the code that would explain this behavior? Could this be a timing issue of some sort?
Thanks for your help,
Robbie
Thanks for the reply.
Could you please run the following query on Monitor database for the job which is causing problem?
Then run your query (bit modified):
Could you let us know how many rows each query return?
Thanks,
Priya
Project Manager
Red Gate Software
Anyway, with the revised code, I got 38 for both queries. So, they seem to match at least.
I dug into the data a bit. For this particular job there were 37 times that it had a run_status of 3 (stopped) and 1 time run_status of 0 (failed). Both msdb and the SQL Monitor database match on these values.
I don't know how to dig into the alert.Alert table in the SQL Monitor database, but I did look through the UI. I only received 3 alerts for the job in question.
Thanks for digging into this with me.
Robbie
What is your purge setting?
Thanks,
Priya
Project Manager
Red Gate Software
Windows process data: 1 week
Basic SQL Server data: 2 weeks
SQL process data: 1 week
Top 10 queries data: 1 week
Database performance counter data: 2 weeks
Custom metric data: 1 week
Basic alert data: 1 month
SQL Trace Data: 3 days
Apologies for delay in reply. Could you please run the following query in your environment?
Counts the number of unsuccessful job runs in Monitor database since yesterday
Counts the number of unsuccessful runs in sysjobhistory since yesterday. This is to make sure that we have collected the job history correctly. You may have to verify whether SQL Server enters the date in this table in UTC or local date time. If it is local date time then you may have to change query to accommodate your time zone changes. Everything in Monitor database is in UTC .
Will show the number of each type of alerts which has been raised since yesterday grouped by server.
Could you please let us know the result of each query?
Thanks,
Priya
Project Manager
Red Gate Software
The number of unsuccessful job runs in Monitor database since 9/4
390
The number of unsuccessful runs in sysjobhistory since 9/4
447
The number of each type of alerts which has been raised since 9/4 grouped by server. SERVER3 is the one in question here:
SERVER1 Long-running query 23
SERVER1 Schema modified - ISDB01 34
SERVER1 Instance1 Long-running query 18
SERVER1 Instance1 Schema modified - ISDB01 3
SERVER1 Instance2 Long-running query 19
SERVER1 Instance2 Schema modified - ISDB01 13
SERVER2 Long-running query 3
SERVER3 Blocked process 27
SERVER3 Job failed 22
SERVER3 Long-running query 71
SERVER3 SQL Server error log entry 2
SERVER3 SQL Server instance unreachable 1
SERVER3 Instance1 Long-running query 17
SERVER4 Deadlock 2
SERVER4 Long-running query 40
SERVER4 Monitoring error (SQL Server data collection) 1
SERVER5 Deadlock 31
SERVER5 Instance1 Long-running query 3
SERVER6 Blocked process 1
SERVER6 Deadlock 1
SERVER6 Log backup overdue 1
SERVER6 Long-running query 1
SERVER6 Instance1 Deadlock 1
SERVER6 Instance1 Long-running query 15
SERVER7 Backup overdue 46
SERVER7 Deadlock 17
SERVER7 Job failed 8
SERVER7 Long-running query 3
Have you just got one job on this server? Also, is any 'job failure' alert disabled in Alert Settings?
I notice that you have one 'SQL Server instance unreachable' alert on this server. How long this alert was live for?
Thanks,
Priya
Project Manager
Red Gate Software
The "SQL Server instance unreachable" lasted for 14 seconds and has only occurred one other time for this server. That was on August 7th. I think this is probably a coincidence and not related to this issue at all. The number of job failures seems to be significantly higher than the number of alerts I'm seeing.
None of the alert settings are disabled from what I can tell. There are a couple jobs that I manually changed the recipient of the email that gets sent out.
Thanks for continuing to look into this.
Robbie
I have emailed you directly.
Thanks,
Priya
Project Manager
Red Gate Software
We are seeing the same inconsistencies here. Did you happen to get the bottom of this?
Cheers,
Del
We are still investigating this issue. Are you seeing the inconsistencies for any particular type of jobs? Robbie is having issue only with SSIS jobs.
Thanks,
Priya
Project Manager
Red Gate Software
The other jobs that do not use SSIS packages are reporting as failed correctly.
Thanks,
Priya
Project Manager
Red Gate Software
Have there been any updates on this yet?
We are still getting the same issue.
Apologies for not updating this forum. We have been investigating this with Robbie and he provided us with backup of his repository. We could see that there is an issue but we are not able to reproduce it internally and hence it is proving difficult to fix. We are currently tracking this as SRP-8159. Can I get some more information from you? I would drop you an email now.
Thanks,
Priya
Project Manager
Red Gate Software
Thanks,
Priya
Project Manager
Red Gate Software