Rogue Job Duration alerts since upgrade from version 9 to 10
AlecRM
Posts: 7 Bronze 1
Hi,
I upgraded to Version 10 last week, and since then there have been a few SQL Job Duration alerts for jobs which are running fine when checking the Job history in SSMS.
For example:
However in SSMS, the job has been running fine.
Is there a way of clearing any rogue job data out of SQL Monitor without losing genuine historical data?
Thanks,
I upgraded to Version 10 last week, and since then there have been a few SQL Job Duration alerts for jobs which are running fine when checking the Job history in SSMS.
For example:
Job name: | Tableau MI |
User: | sa |
Job started at: | 24 May 2018 05:40 |
Job ended at: | Unknown |
Job outcome: | In progress |
Duration: | 676.04:32:04 |
Baseline duration (median of last 10 runs): | 00:15:21 |
Deviation from baseline: | 6343301% |
Job next scheduled to run at: | 30 Mar 2020 10:10 |
However in SSMS, the job has been running fine.
Is there a way of clearing any rogue job data out of SQL Monitor without losing genuine historical data?
Thanks,
Tagged:
Best Answer
-
Alex B Posts: 1,157 Diamond 4Hi All,
The team have just released SQL Monitor 10.0.9 which should now fix the other portion of the issue.
You can download it here https://download.red-gate.com/checkforupdates/SQLMonitorWeb/SQLMonitorWeb_10.0.9.28110.exe
Please do let me know if any of you still have issues after updating!
Kind regards,
Alex
Oh and I almost forgot - @SeanPerkins the validation on the % duration should have been removed in the Job Duration Unusual alert configuration as well now in 10.0.9 so you should be able to set it above 100%
Answers
In version 10.0.4 we made a change to allow the Job Duration Unusual (JDU) alert to fire while the job was still running, but unfortunately there were various issues with this that cause JDU alerts to fire in incorrect circumstances. The team are currently working on a fix under internal reference SRP-12900 for this and we will update here when that is available.
Kind regards,
Alex
Have you visited our Help Center?
As I cannot view the internal ticket, perhaps you're already aware of these three issues and working on them, but I thought it best to mention them just in case:
1. Alerts are triggering when jobs are running more quickly than the average (a negative baseline deviation).
2. Alerts are triggering for jobs from many months earlier stating that they never completed.
3. There is no way to modify the "Job Duration Unusual" alert thresholds for baseline deviation above 100% anymore (even though it defaults to 300%) in the configuration settings.
The team have just released SQL Monitor version 10.0.8 that includes a fix for this issue, which you can download here:
https://download.red-gate.com/checkforupdates/SQLMonitorWeb/SQLMonitorWeb_10.0.8.27810.exe
Please do let us know if this helps the incorrect alerts (which should cover all cases of it), but if any have escaped please do let us know.
@SeanPerkins for not being able to set the alert threshold above 100%, this wasn't part of that investigation but I have asked the team if it is intentional or not. I believe the 300% threshold was default in previous versions but with versions installed later 100% is the default for it (or at least it was set to 100% for me and i don't remember changing it). I'll check on this and come back to you.
Kind regards,
Alex
Have you visited our Help Center?
I installed the update to version 10.0.8 and re-enabled the alert using its default values. We were immediately hit with around 100 "job duration unusual" alerts. I cleared those and lots of additional alerts are still coming in with various baseline deviations (-14%, -2%, 4%, 16%, 297%), all of which are under the threshold of 300% set in the configuration. Unfortunately, the fix doesn't appear to have worked for us.
As for the configuration settings I mentioned, the same problem still exists in this version. The baseline deviation cannot be set above 100% like it could before. This prevents us from setting it to 200% for instance -- indicating the job is running twice as long as usual. The multiple alerts likewise can only be set to trigger between 0-100%. The only way I found to set it higher (300%) was to use the button to restore the default settings. Otherwise, the setting only lets me know that something is running longer than a fraction of the average, when the majority of the time we want to know if a job is running longer than the average.
For the moment, we're going to have to disable the alert again.
Just a point to note, we decommissioned a SQL server a while ago, and it has become "stuck" in SQL Monitor. Looking into this has not been a priority. Is there a secure location I can sent the log export to?
Thanks for the update all, and my apologies that hasn't corrected the issue.
We're now collecting further information for the team but the logging for the JDU isn't enabled by default (which I suppose is good otherwise it may get a bit busy in the logs). If anyone is up for restarting the alert to get some further logging here's what you need to do first:
Please edit the logging config
C:\Program Files\Red Gate\SQL Monitor\BaseMonitor\RedGate.SqlMonitor.Engine.Alerting.Base.Service.exe.logging.config
to add this element to the bottom of the file just above the
</log4net>
element:Then restart the basemonitor and wait for further false alerts to occur before getting the log files from Configuration > Retrieve all log files and send that into support.
Also, @SeanPerkinsit seems the default has been 50% for some time, but the validation to limit it to 100% was not intentional so they have made a change for this which will be available in the next release.Kind regards,
Alex
Have you visited our Help Center?
Just wanted to update here that the developers have identified another contributor to this issue, that being with old/orphaned entries in the sysjobactivity table, which we need to account for.
This is being worked on under internal reference SRP-12937 and I'll update here further when it's available in a release.
Kind regards,
Alex
Have you visited our Help Center?
fyi - just tried to install the new version, 3 times. Site no longer functioning at all.
Then, saw the Redgate SQL Monitor Base Monitor and SQL Monitor Web Service were not running (doesn't do that normally). I started these up and now get [404 - File or directory not found.] error
RedGate.SqlMonitor.Common.Utilities.ErrorReporting.RaygunErrorReporter - System.ArgumentNullException: String reference not set to an instance of a String.
Parameter name: s
at System.Text.Encoding.GetBytes(String s)
at RedGate.SqlMonitor.Common.Persistence.CredentialsStore.CredentialManager.AddOrUpdate(String key, CredentialManagerDetails detail)
at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectoryConfigRepository.GetServiceAccountForDomain(String domain)
at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectoryConfigRepository.<GetAllConfigs>b__10_0(ValueTuple`2 tuple)
at System.Linq.Enumerable.WhereSelectListIterator`2.MoveNext()
at System.Linq.Buffer`1..ctor(IEnumerable`1 source)
at System.Linq.Enumerable.ToArray[TSource](IEnumerable`1 source)
at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectoryConfigRepository.GetAllConfigs()
at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectory.ActiveDirectoryService.get_FirstConfig()
at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectory.ActiveDirectoryConfigService.GetConfig()System.ArgumentNullException: String reference not set to an instance of a String.
404 error detail - this Account folder is missing...
Module : IIS Web Core
Physical Path : C:\Program Files\Red Gate\SQL Monitor\Web\Website\Account\LogIn "
It seems the default install path for the web portion may have changed for some people.
When I updated to 10.0.9 yesterday my path stayed the same (as
/website
) and I was able to access the page normally.Today I was checking the default web option and it was also
/website
but when I swapped back to IIS it changed to/web
and I got the same message about the page not found (since it had moved). I then updated the Physical path attribute in IIS (see below) and it worked normally again.
We are looking into why the default path might have changed for some since it's not likely people decided to swap install options and back several times.
In the meantime, please ensure that the path that the web portion is installed to is the one that your website is looking for (as shown in the error when you try to navigate to the page, or in the setting above) or change to the physical path of the website to the new install path (again, as above).
Hopefully that will get things going and we can see if the original JDU alert issue has been corrected!
Kind regards,
Alex
Have you visited our Help Center?
I updated SQL Monitor this morning, and the Global Dashboard would not load afterwards, and the error logged was:
This did eventually auto correct. So far, the Job Duration alerts have calmed down.
@SeanPerkins - I believe they are also looking into this (possibly related to the one from AlecRM above). Has it cleared up for you or is it still unusuable?
Have you visited our Help Center?
@SeanPerkins - Righto, it's not what I had thought initially. The 404 looks like it's looking for /Account sub directory, but that doesn't exist even for me where it does work, so something else is going on.
To all,
We have pulled 10.0.9 from the check for updates mechanism. To anyone who has updated to 10.0.9 and isn't able to access the website unfortunately you will need to downgrade to 10.0.8 following the below instructions. It's very strange as a portion of people have updated successfully with no issues and a portion are seeing the same as Scott and Sean.
If anyone is able to, could you please get this information and send it to support@red-gate.com referencing this forum post that would be great:
A list of any credentials in credential manager that start with SQL_Monitor_AD_Service_Account
The full values stored in your settings.KeyValuePairs table by running the following:
And also return the results of this query as well please:
To downgrade to 10.0.8 please perform the following:
You may then want to disable the Job Duration Unusual alert to avoid erroneous alerts.
My apologies for the inconvenience caused here we're getting information to the developers to have a look, but it will likely be next week before anything else will be released.
Kind regards,
Alex
Have you visited our Help Center?
uninstall and revert back to 10.0.8.27810 as instructed gave us back a working version.
Note that the provided script dropped [settings].[ActiveDirectoryDomains] so won't be able to send the results of that query to support as instructed (might want to run these queries BEFORE running the 278-273.sql script)
@DonFerguson - Thank you for that, indeed that's what we're looking into with the (previously poorly placed) request for further information in my post above,
I'll update here again when I have more information!
Kind regards,
Alex
Have you visited our Help Center?
Many alert emails still generated ( over 300 ), but now say "Not enough data to calculate baseline". Though these seem to all have been from replication jobs that run Continuously...
So, don't know if has to now occur 10 times before creating an average??? turned off alert again for now....
@ScottRG I'm following up with the team on that baseline message and will let you know. Also, when you say you had to install twice, did the first attempt fail on starting the base monitor or was it some other reason?
@DevendraSingh - the team are correcting this as we speak.
Have you visited our Help Center?