Hi Dojo
We have a DataMiner Failover setup (DMA-1 and DMA-2), DMA-1 is active.
Today some SL* processes crashed on one DMA-1 and restarted itself.
The VIP was released by the online DMA-1, but neither DMA-1 nor DMA-2 did go online.
DMA-2 detected that DMA-1 failed and reported in the SLFailover.txt:
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|ActionsOnFirstFail|ERR|0|18|First Failure
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|18|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|18|Refreshed Failover Config
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshed Failover Config
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshed Failover Config
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|HandleSetDrsState|CRU|0|109|Ignored SetDrsState request by Redundancy on DMA-1 because all heartbeat paths are failing (prevent ping-pong)
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|DoFail|INF|0|18|Reached 6 failures via x.x.x.x.1 => checking whether DataMiner Failover state needs to change
SLNet.exe|NotifyMaxFailuresReached|CRU|0|18|GOING ONLINE (AUTOMATIC DATAMINER FAILOVER)
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Offline
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Preparing to go online
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Notifying buddy agent to go offline (DMA-2 (x.x.x.x.2) wants to go online)...
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Trying to notify other agent to go offline via one of x.x.x.x.1;x.x.x.x.1
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Failed setting state for 'x.x.x.x.1' to Offline: x.x.x.x.1 ignored request: All heartbeat paths are failing. Preventing ping-pong.
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Offline (couldn't force other agent to go offline)
SLNet.exe|DoSwitch|CRU|0|18|NOT switching state: failed to force buddy agent to offline
SLNet.exe|DoSwitch|CRU|0|18|Local agent = Offline
DMA-1 started and tried to connect to the running DMA-2, but reports in the SLFailover.txt:
SLNet.exe|SendHeartBeat|CRU|0|77|Going online because partner x.x.x.x.2 is offline (partner lastonline: 2024-02-06 10:43:30 < local lastonline: 2024-02-12 09:35:45)
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Offline
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Preparing to go online
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Notifying buddy agent to go offline (DMA-1 (x.x.x.x.1) wants to go online)...
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Trying to notify other agent to go offline via one of x.x.x.x.2 (agent appears to be missing)
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Failed setting state for 'x.x.x.x.2' to Offline: x.x.x.x.2 ignored request: All heartbeat paths are failing. Preventing ping-pong.
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Offline (couldn't force other agent to go offline)
SLNet.exe|DoSwitch|CRU|0|77|NOT switching state: failed to force buddy agent to offline
SLNet.exe|DoSwitch|CRU|0|77|Local agent = Offline
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|77|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|77|Refreshed Failover Config
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|98|Previous check for x.x.x.x.2 (thread 77) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|98|Previous check for x.x.x.x.2 (thread 77) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|HandleSetDrsState|CRU|0|21|Ignored SetDrsState request by Redundancy on DMA-2 because all heartbeat paths are failing (prevent ping-pong)
SLNet.exe|SendHeartBeat|ERR|0|77|!!!WARNING!!! Agent 'x.x.x.x.2' is not able to correctly sync at the moment.
By the way the log entry (partner lastonline: 2024-02-06 10:43:30 < local lastonline: 2024-02-12 09:35:45) is confusing. 2024-02-12 was the date of the last update of the "failover config" on DMA-1.
024/02/12 11:11:38.791|SLNet.exe|ActionsOnFirstNonFail|INF|0|173|First success after failures
2024/02/12 11:11:38.791|SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|173|Refreshing Failover Config...
2024/02/12 11:11:38.792|SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|173|Refreshed Failover Config
Both DMA did not acquire the VIP so both DMA were offline until I stopped DMA-2 and restarted DMA-1.
How can I prevent the failover system from such a dead-lock ?
Hi Joerg
I'll need to make some assumptions and guesses without knowing the full details how the failover is configured (sync, heartbeat) and without all the log files (mainly SLNet and NATS).
Both agents are indicating that "all heartbeat paths are failing (prevent ping-pong)"
In a normal scenario, when the offline agent detects a failing heartbeat on the online agent, it'll take over.
Now all heartbeats are failing, meaning it's possible that both agents would constantly be switching (i.e. ping-pong).
=> To prevent this, both agents remain in their offline or online state. Which was the case here.
I would advise you to re-evaluate the Failover configuration, the following links will surely help with this:
Preferred configuration using virtual IP addresses (best practice) | DataMiner Docs
Advanced Failover options | DataMiner Docs
Note - depending on the DMA version, it might also be the following known issue
Failover Agents remain offline after upgrade | DataMiner Docs