How do you use cloud databases? Take the survey.

Phantom alerts when restarting secondary node of Availability Group

We are currently running SQL Monitor v.11.1.9 on our servers. One instance is an availability group, consisting of 3 servers, let's call them "A", "B", and "C" - with "A" being the primary replica, "B" a synchronous secondary, and "C" an asynchronous secondary.
We are doing some maintenance on "C" this morning, therefore we set up alert suppression on it - just this server, not the cluster/availability group. Consequently we expected to receive the "replica not healthy" alerts, but not much more.

When we restarted server "C" we however also received several "job failing" alerts - not from server "C" but from server "B". Even stranger is that the "time raised" on the alerts show "10 Jan 2022 19:30" on some, and "12 Jan 2022 22:31" on another. None of them reflects today's date (20 January 2022).
Tagged:

Answers

  • Alex BAlex B Posts: 1,127 Diamond 4
    Hi @PieterL,

    Are the jobs related to taking backups or are they other types of jobs?  And was "C" normally the replica that the jobs were run on, but then that shifted to "B" when you did the maintenance on "C"?  And were those jobs last run on "B" at the times stated in the alert or had those jobs been failing on "C" since the time stated in the alert (10/12th Jan)?

    The behaviour of the job failing alert has changed since 11.2.13, but prior to that a new job failing alert would be raised each time a job failed and at some point in there, it was using the date of the original failure (which I don't think it was supposed to).  Ultimately, addressing a few issues we were seeing with this, the alert was changed to where there is now only one job failing alert raised and it ends when the job runs again successfully and the alert has a record of each attempt that was performed during the alert duration in the details of the alert.

    Kind regards,
    Alex


    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
  • PieterPieter Posts: 2 Bronze 1
    Hi @Alex B,

    There are 2 specific job involved. They are both normal user jobs, including steps to execute SSIS packages - no connection to backups. Both did legitimately fail on the dates and times stated in the alerts, but have subsequently run successfully.

    Neither has been run on server "C" in the past month; both were running on server "A" from 22 Dec to 4 Jan, then on "B" from 5 to 17 Jan (including the failures), then again on "A" from 18 to 20 Jan, and back to "B" from 21 Jan.

    Another anomaly is that the "view full alert details" link of the one alert returns a 404 error ("Sorry, this page cannot be found.
    No alert found for the given ID. It may have already been purged from the SQL Monitor database.").
    The other shows the alert, with the following details:
    Status: Ended  Time raised: 12 Jan 2022 22:31  Time ended: 20 Jan 2022 22:59
    and the history as follows:

  • Alex BAlex B Posts: 1,127 Diamond 4
    Hi @pieter,

    At this point, we're not sure as we've not been able to reproduce this behaviour ourselves, though we have some open issues similar to it that we are investigating still, so it would be good to update and then if the issue recurs we can investigate further at that time against the latest code (and would likely need to raise a ticket and get log files and some further information as well).

    Kind regards,
    Alex
    Product Support Engineer | Redgate Software

    Have you visited our Help Center?
Sign In or Register to comment.