Hi,
In a Failover setup, when the system lost temporarily access to the corporate network, the alarm takes around 5min to be resolved on the Alarm Console.
Is this intended despite the fact that, on the Failover Status UI, there was no longer any indication of a sync issue?
Before the alarm was resolved on the Alarm console, we can see on the Failover Configuration UI (first image), the Backup agent as a slightly red tone, that is resolved once the alarm is resolved. Is this intended? What is the trigger that resolved the alarm on the Alarm console?
Hi Tiago,
The notice appears on the online agent when its heartbeat checks towards the offline agent report that the offline agent has one of the following problems (even though the heartbeat itself is succeeding):
- Offline agent is not running
- Offline agent has its sync connection to the online agent failing
- Offline agent has problems with the database
- Offline agent has open RTEs
- Offline agent has a mismatch in cluster name with the online agent (since RN26683 = 10.1 / 10.0.11 / 10.0 CU6 / 9.6 CU18)
The notice disappears as soon as these problems are no longer present.
The result for the database and RTE check get cached on the offline agent for 5 minutes, which would explain why it takes 5 minutes before the notice disappears on the online agent.
The Failover Status UI however bypasses this caching and shows the most recent status whenever it gets opened.
I believe that there's a possible improvement to be made here by having invalidating the cached value when the Failover Status UI info request reloads the most current state.