Question

Solved1.19K views2nd June 2021Failover

1

Tiago Dias [SLC] [DevOps Member]431 31st May 2021 0 Comments

Hi,

In a Failover setup, when the system lost temporarily access to the corporate network, the alarm takes around 5min to be resolved on the Alarm Console.

Is this intended despite the fact that, on the Failover Status UI, there was no longer any indication of a sync issue?

Before the alarm was resolved on the Alarm console, we can see on the Failover Configuration UI (first image), the Backup agent as a slightly red tone, that is resolved once the alarm is resolved. Is this intended? What is the trigger that resolved the alarm on the Alarm console?

Tiago Dias [SLC] [DevOps Member] Selected answer as best 2nd June 2021

1 Answer

score 2 · Answer 1 · 2021-06-02T12:42:38+00:00

Hi Tiago,

The notice appears on the online agent when its heartbeat checks towards the offline agent report that the offline agent has one of the following problems (even though the heartbeat itself is succeeding):

Offline agent is not running
Offline agent has its sync connection to the online agent failing
Offline agent has problems with the database
Offline agent has open RTEs
Offline agent has a mismatch in cluster name with the online agent (since RN26683 = 10.1 / 10.0.11 / 10.0 CU6 / 9.6 CU18)

The notice disappears as soon as these problems are no longer present.

The result for the database and RTE check get cached on the offline agent for 5 minutes, which would explain why it takes 5 minutes before the notice disappears on the online agent.

The Failover Status UI however bypasses this caching and shows the most recent status whenever it gets opened.

I believe that there’s a possible improvement to be made here by having invalidating the cached value when the Failover Status UI info request reloads the most current state.

Failover – Alarm Status Update

1 Answer