Rogue Job Duration alerts since upgrade from version 9 to 10

Hi, 
I upgraded to Version 10 last week, and since then there have been a few SQL Job Duration alerts for jobs which are running fine when checking the Job history in SSMS.

For example:

Job name:Tableau MI
User:sa
Job started at:24 May 2018 05:40
Job ended at:Unknown
Job outcome:In progress
Duration:676.04:32:04
Baseline duration (median of last 10 runs):00:15:21
Deviation from baseline:6343301%
Job next scheduled to run at:30 Mar 2020 10:10

However in SSMS, the job has been running fine.

Is there a way of clearing any rogue job data out of SQL Monitor without losing genuine historical data?

Thanks,

Best Answer

«1

Answers

  • Alex BAlex B Posts: 1,158 Diamond 4
    Hi @AlecRM,

    In version 10.0.4 we made a change to allow the Job Duration Unusual (JDU) alert to fire while the job was still running, but unfortunately there were various issues with this that cause JDU alerts to fire in incorrect circumstances.  The team are currently working on a fix under internal reference SRP-12900 for this and we will update here when that is available.

    Kind regards,
    Alex

    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • Russell DRussell D Posts: 1,324 Diamond 5
    edited March 31, 2020 10:20AM
    Unfortunately,  it'll be nearly impossible to identify when an alert was incorrectly raised or not. The alerts should end automatically when the job is next seen.
    Have you visited our Help Centre?
  • AlecRMAlecRM Posts: 7 Bronze 1
    Thanks @Russell D and @Alex B  I have temporarily disabled the alert so we don't get spammed.

  • SeanPerkinsSeanPerkins Posts: 13 New member
    Alex B said:
    In version 10.0.4 we made a change to allow the Job Duration Unusual (JDU) alert to fire while the job was still running, but unfortunately there were various issues with this that cause JDU alerts to fire in incorrect circumstances.  The team are currently working on a fix under internal reference SRP-12900 for this and we will update here when that is available.
    We definitely experienced this issue too when updating from version 10.0.3 to 10.0.7 yesterday and were forced to disable the alert until the problem is resolved. I'm glad to hear that a fix is forthcoming.

    As I cannot view the internal ticket, perhaps you're already aware of these three issues and working on them, but I thought it best to mention them just in case:
    1. Alerts are triggering when jobs are running more quickly than the average (a negative baseline deviation).
    2. Alerts are triggering for jobs from many months earlier stating that they never completed.
    3. There is no way to modify the "Job Duration Unusual" alert thresholds for baseline deviation above 100% anymore (even though it defaults to 300%) in the configuration settings.
  • Alex BAlex B Posts: 1,158 Diamond 4
    Hi @AlecRM and @SeanPerkins,

    The team have just released SQL Monitor version 10.0.8 that includes a fix for this issue, which you can download here:
    https://download.red-gate.com/checkforupdates/SQLMonitorWeb/SQLMonitorWeb_10.0.8.27810.exe

    Please do let us know if this helps the incorrect alerts (which should cover all cases of it), but if any have escaped please do let us know.

    @SeanPerkins for not being able to set the alert threshold above 100%, this wasn't part of that investigation but I have asked the team if it is intentional or not.  I believe the 300% threshold was default in previous versions but with versions installed later 100% is the default for it (or at least it was set to 100% for me and i don't remember changing it).  I'll check on this and come back to you.

    Kind regards,
    Alex
    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • SeanPerkinsSeanPerkins Posts: 13 New member
    Thanks for providing that link.

    I installed the update to version 10.0.8 and re-enabled the alert using its default values. We were immediately hit with around 100 "job duration unusual" alerts. I cleared those and lots of additional alerts are still coming in with various baseline deviations (-14%, -2%, 4%, 16%, 297%), all of which are under the threshold of 300% set in the configuration. Unfortunately, the fix doesn't appear to have worked for us.

    As for the configuration settings I mentioned, the same problem still exists in this version. The baseline deviation cannot be set above 100% like it could before. This prevents us from setting it to 200% for instance -- indicating the job is running twice as long as usual. The multiple alerts likewise can only be set to trigger between 0-100%. The only way I found to set it higher (300%) was to use the button to restore the default settings. Otherwise, the setting only lets me know that something is running longer than a fraction of the average, when the majority of the time we want to know if a job is running longer than the average.

    For the moment, we're going to have to disable the alert again.
  • DevendraSinghDevendraSingh Posts: 16 Bronze 1
    Same here :(
  • Russell DRussell D Posts: 1,324 Diamond 5
    Is anyone able to send some log files in please?
    Have you visited our Help Centre?
  • AlecRMAlecRM Posts: 7 Bronze 1
    I have submitted a change request to upgrade our instance this morning. I will share my findings as soon as I can. 
  • AlecRMAlecRM Posts: 7 Bronze 1
    edited April 2, 2020 10:18AM
    The upgrade was a success, but like the others, the Job Duration alerts persisted. 
    Just a point to note, we decommissioned a SQL server a while ago, and it has become "stuck" in SQL Monitor. Looking into this has not been a priority. Is there a secure location I can sent the log export to?
  • Alex BAlex B Posts: 1,158 Diamond 4
    Hi all,

    Thanks for the update all, and my apologies that hasn't corrected the issue. 

    We're now collecting further information for the team but the logging for the JDU isn't enabled by default (which I suppose is good otherwise it may get a bit busy in the logs).  If anyone is up for restarting the alert to get some further logging here's what you need to do first:

    Please edit the logging config 
    C:\Program Files\Red Gate\SQL Monitor\BaseMonitor\RedGate.SqlMonitor.Engine.Alerting.Base.Service.exe.logging.config

    to add this element to the bottom of the file just above the </log4net> element:

    <logger name="RedGate.SqlMonitor.Engine.Alerting.Base.Core.Alerter.Sql.JobDurationUnusualAlerter">
    <level value="INFO" />
    </logger>

    Then restart the basemonitor and wait for further false alerts to occur before getting the log files from Configuration > Retrieve all log files and send that into support.

    Also, @SeanPerkinsit seems the default has been 50% for some time, but the validation to limit it to 100% was not intentional so they have made a change for this which will be available in the next release.

    Kind regards,
    Alex
    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • ScottRGScottRG Posts: 10 New member
    Also experiencing the same issue after upgrade from 9 to 10.0.8.27810.   Jobs reporting StartTime from years ago with infinite duration deviation, causing (Ended) alerts to fire 1000's of emails.  Also, many (High) alerts being fired for 1 second duration, ignoring the "Don't raise an alert if the job run time is less than" setting.

  • Alex BAlex B Posts: 1,158 Diamond 4
    Hi All,

    Just wanted to update here that the developers have identified another contributor to this issue, that being with old/orphaned entries in the sysjobactivity table, which we need to account for.

    This is being worked on under internal reference SRP-12937 and I'll update here further when it's available in a release.

    Kind regards,
    Alex
    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • ScottRGScottRG Posts: 10 New member
    edited April 8, 2020 8:55PM

    fyi - just tried to install the new version, 3 times.  Site no longer functioning at all.

    Then, saw the Redgate SQL Monitor Base Monitor and SQL Monitor Web Service were not running (doesn't do that normally).  I started these up and now get [404 - File or directory not found.]  error

  • SeanPerkinsSeanPerkins Posts: 13 New member
    Yes, installing the latest update broke SQL Monitor for us too. The site is now offline and the logs are filled with errors such as this:
    2020-04-08 20:34:40,953 [             45] ERROR
    RedGate.SqlMonitor.Common.Utilities.ErrorReporting.RaygunErrorReporter - System.ArgumentNullException: String reference not set to an instance of a String.
    Parameter name: s
       at System.Text.Encoding.GetBytes(String s)
       at RedGate.SqlMonitor.Common.Persistence.CredentialsStore.CredentialManager.AddOrUpdate(String key, CredentialManagerDetails detail)
       at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectoryConfigRepository.GetServiceAccountForDomain(String domain)
       at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectoryConfigRepository.<GetAllConfigs>b__10_0(ValueTuple`2 tuple)
       at System.Linq.Enumerable.WhereSelectListIterator`2.MoveNext()
       at System.Linq.Buffer`1..ctor(IEnumerable`1 source)
       at System.Linq.Enumerable.ToArray[TSource](IEnumerable`1 source)
       at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectoryConfigRepository.GetAllConfigs()
       at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectory.ActiveDirectoryService.get_FirstConfig()
       at RedGate.SqlMonitor.Engine.Monitoring.Core.Services.ActiveDirectory.ActiveDirectoryConfigService.GetConfig()System.ArgumentNullException: String reference not set to an instance of a String.
  • ScottRGScottRG Posts: 10 New member

    404 error detail - this Account folder is missing... 

    Module : IIS Web Core
    Physical Path : C:\Program Files\Red Gate\SQL  Monitor\Web\Website\Account\LogIn "

  • Alex BAlex B Posts: 1,158 Diamond 4
    Hi all,

    It seems the default install path for the web portion may have changed for some people.

    When I updated to 10.0.9 yesterday my path stayed the same (as /website) and I was able to access the page normally.


    Today I was checking the default web option and it was also /website but when I swapped back to IIS it changed to /web​  and I got the same message about the page not found (since it had moved).  I then updated the Physical path attribute in IIS (see below) and it worked normally again.  



    We are looking into why the default path might have changed for some since it's not likely people decided to swap install options and back several times.

    In the meantime, please ensure that the path that the web portion is installed to is the one that your website is looking for (as shown in the error when you try to navigate to the page, or in the setting above) or change to the physical path of the website to the new install path (again, as above).

    Hopefully that will get things going and we can see if the original JDU alert issue has been corrected!

    Kind regards,

    Alex

    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • AlecRMAlecRM Posts: 7 Bronze 1
    Hi, 

    I updated SQL Monitor this morning, and the Global Dashboard would not load afterwards, and the error logged was:

    2020-04-09 09:53:53,080 [             22] ERROR RedGate.SqlMonitor.UI.Website.MvcApplication - Unhandled exception for request /GlobalDashboard
    Autofac.Core.DependencyResolutionException: An exception was thrown while activating RedGate.SqlMonitor.UI.Website.Controllers.GlobalDashboardController -> RedGate.SqlMonitor.RPC.RpcDataPresenter`1[[RedGate.SqlMonitor.Common.Services.Services.IDataPresenterService, RedGate.SqlMonitor.Common.Services, Version=10.0.9.28110, Culture=neutral, PublicKeyToken=7f465a1c156d4d57]] -> ?:RedGate.SqlMonitor.Channels.ChannelTree`1[[RedGate.SqlMonitor.Channels.PropertyMap`1[[RedGate.SqlMonitor.Channels.Property.PropertySchema, RedGate.SqlMonitor.Channels, Version=10.0.9.28110, Culture=neutral, PublicKeyToken=7f465a1c156d4d57]], RedGate.SqlMonitor.Channels, Version=10.0.9.28110, Culture=neutral, PublicKeyToken=7f465a1c156d4d57]] -> RedGate.SqlMonitor.Channels.Config.FlatChannelConfig -> ?:RedGate.SqlMonitor.Channels.ChannelTree`1[[RedGate.SqlMonitor.Channels.Config.ChannelConfig, RedGate.SqlMonitor.Channels, Version=10.0.9.28110, Culture=neutral, PublicKeyToken=7f465a1c156d4d57]]. ---> System.IO.FileNotFoundException: Could not load file or assembly 'RedGate.SqlMonitor.Default.Config, Version=10.0.9.28110, Culture=neutral, PublicKeyToken=7f465a1c156d4d57' or one of its dependencies. The system cannot find the file specified.

     This did eventually auto correct. So far, the Job Duration alerts have calmed down.
  • Alex BAlex B Posts: 1,158 Diamond 4
    @AlecRM - The team are looking into what may have caused that issue but I'm glad that it sorted out and things are looking good with the JDU alert!

    @SeanPerkins - I believe they are also looking into this (possibly related to the one from AlecRM above).  Has it cleared up for you or is it still unusuable?
    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • ScottRGScottRG Posts: 10 New member
    edited April 9, 2020 1:34PM
    But if not using IIS (using RG SQL Monitor web server instead), where do we change the web default path?
  • SeanPerkinsSeanPerkins Posts: 13 New member
    It's still unusable for us, even after an uninstall and reinstall. We don't use the default install location, but the physical path in IIS to the website folder is accurate. If it matters, the site is using "ApplicationPoolIdentity" credentials.
  • Alex BAlex B Posts: 1,158 Diamond 4
    edited April 10, 2020 7:03AM
    @ScottRG - If not using IIS then it should automatically work as you don't specify a location for it, it installs and confgures itself.

    @SeanPerkins - Righto, it's not what I had thought initially.  The 404 looks like it's looking for /Account sub directory, but that doesn't exist even for me where it does work, so something else is going on.

    To all,

    We have pulled 10.0.9 from the check for updates mechanism.  To anyone who has updated to 10.0.9 and isn't able to access the website unfortunately you will need to downgrade to 10.0.8 following the below instructions.  It's very strange as a portion of people have updated successfully with no issues and a portion are seeing the same as Scott and Sean.

    If anyone is able to, could you please get this information and send it to support@red-gate.com referencing this forum post that would be great:

    A list of any credentials in credential manager that start with SQL_Monitor_AD_Service_Account 


    The full values stored in your settings.KeyValuePairs table by running the following:

    SELECT * FROM [settings].[KeyValuePairs] AS [kvp]


    And also return the results of this query as well please:

    SELECT * FROM [settings].[ActiveDirectoryDomains] AS [add]


    To downgrade to 10.0.8 please perform the following:

    1. Uninstall 10.0.9
    2. Run the attached script on the SQL Monitor data repository (RedGateMonitor) (renamed .txt for forum)
    3. Reinstall 10.0.8 which you can download here if needed: https://download.red-gate.com/checkforupdates/SQLMonitorWeb/SQLMonitorWeb_10.0.8.27810.exe


    You may then want to disable the Job Duration Unusual alert to avoid erroneous alerts.


    My apologies for the inconvenience caused here we're getting information to the developers to have a look, but it will likely be next week before anything else will be released.

    Kind regards,

    Alex

    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • ScottRGScottRG Posts: 10 New member

    uninstall and revert back to 10.0.8.27810 as instructed gave us back a working version.

    Note that the provided script dropped [settings].[ActiveDirectoryDomains] so won't be able to send the results of that query to support as instructed  (might want to run these queries BEFORE running the 278-273.sql script)

  • DonFergusonDonFerguson Posts: 202 Silver 5
    I think the issue is related to AD authentication.  I tried a brand new install of 10.0.9 and it worked when prompting me for Administrator password.  But switching the installation to AD authentication failed.  So look at AD authentication.
  • Alex BAlex B Posts: 1,158 Diamond 4
    @ScottRG - Yeah, I put the bits in the wrong order, I have have edited that now!

    @DonFerguson - Thank you for that, indeed that's what we're looking into with the (previously poorly placed) request for further information in my post above, 

    I'll update here again when I have more information!

    Kind regards,
    Alex
    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • ScottRGScottRG Posts: 10 New member
    edited April 16, 2020 8:27PM
    Installed SQL Monitor 10.0.10.28314 (released today).   This version did work after second install attempt. 

    Many alert emails still generated ( over 300 ), but now say "Not enough data to calculate baseline".    Though these seem to all have been from replication jobs that run Continuously...

    So, don't know if has to now occur 10 times before creating an average???  turned off alert again for now....
  • getting frequent monitoring errors after upgrading to 10.0.10









  • Alex BAlex B Posts: 1,158 Diamond 4
    Righto-
    @ScottRG I'm following up with the team on that baseline message and will let you know.  Also, when you say you had to install twice, did the first attempt fail on starting the base monitor or was it some other reason?

    @DevendraSingh - the team are correcting this as we speak.
    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
Sign In or Register to comment.