Hi Dojo
We have a DataMiner Failover setup (DMA-1 and DMA-2), DMA-1 is active.
Today some SL* processes crashed on one DMA-1 and restarted itself.
The VIP was released by the online DMA-1, but neither DMA-1 nor DMA-2 did go online.
DMA-2 detected that DMA-1 failed and reported in the SLFailover.txt:
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|ActionsOnFirstFail|ERR|0|18|First Failure
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|18|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|18|Refreshed Failover Config
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshed Failover Config
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|SendHeartBeat|ERR|0|137|!!!WARNING!!! Agent 'x.x.x.x.1' is not able to correctly sync at the moment.
SLNet.exe|SendHeartBeat|ERR|0|137|SLNet to 'x.x.x.x.1' failed: Remote Agent is Not Running (but still reachable)
SLNet.exe|Scheduler_OnTimer_Inner|INF|0|137|SLNet x.x.x.x.1 = NOK
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|137|Refreshed Failover Config
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|HandleSetDrsState|CRU|0|109|Ignored SetDrsState request by Redundancy on DMA-1 because all heartbeat paths are failing (prevent ping-pong)
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|18|Previous check for x.x.x.x.1 (thread 137) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|DoFail|INF|0|18|Reached 6 failures via x.x.x.x.1 => checking whether DataMiner Failover state needs to change
SLNet.exe|NotifyMaxFailuresReached|CRU|0|18|GOING ONLINE (AUTOMATIC DATAMINER FAILOVER)
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Offline
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Preparing to go online
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Notifying buddy agent to go offline (DMA-2 (x.x.x.x.2) wants to go online)...
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Trying to notify other agent to go offline via one of x.x.x.x.1;x.x.x.x.1
SLNet.exe|ForceOtherAgentsToOffline|INF|0|18|Failed setting state for 'x.x.x.x.1' to Offline: x.x.x.x.1 ignored request: All heartbeat paths are failing. Preventing ping-pong.
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|18|Failover Status => Offline (couldn't force other agent to go offline)
SLNet.exe|DoSwitch|CRU|0|18|NOT switching state: failed to force buddy agent to offline
SLNet.exe|DoSwitch|CRU|0|18|Local agent = Offline
DMA-1 started and tried to connect to the running DMA-2, but reports in the SLFailover.txt:
SLNet.exe|SendHeartBeat|CRU|0|77|Going online because partner x.x.x.x.2 is offline (partner lastonline: 2024-02-06 10:43:30 < local lastonline: 2024-02-12 09:35:45)
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Offline
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Preparing to go online
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Notifying buddy agent to go offline (DMA-1 (x.x.x.x.1) wants to go online)...
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Trying to notify other agent to go offline via one of x.x.x.x.2 (agent appears to be missing)
SLNet.exe|ForceOtherAgentsToOffline|INF|0|77|Failed setting state for 'x.x.x.x.2' to Offline: x.x.x.x.2 ignored request: All heartbeat paths are failing. Preventing ping-pong.
SLNet.exe|UpdateFailoverSwitchStatus|CRU|0|77|Failover Status => Offline (couldn't force other agent to go offline)
SLNet.exe|DoSwitch|CRU|0|77|NOT switching state: failed to force buddy agent to offline
SLNet.exe|DoSwitch|CRU|0|77|Local agent = Offline
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|77|Refreshing Failover Config...
SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|77|Refreshed Failover Config
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|98|Previous check for x.x.x.x.2 (thread 77) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|Scheduler_OnMissedInterval|ERR|0|98|Previous check for x.x.x.x.2 (thread 77) still in progress after 20000ms => ASSUME FAILURE
SLNet.exe|HandleSetDrsState|CRU|0|21|Ignored SetDrsState request by Redundancy on DMA-2 because all heartbeat paths are failing (prevent ping-pong)
SLNet.exe|SendHeartBeat|ERR|0|77|!!!WARNING!!! Agent 'x.x.x.x.2' is not able to correctly sync at the moment.
By the way the log entry (partner lastonline: 2024-02-06 10:43:30 < local lastonline: 2024-02-12 09:35:45) is confusing. 2024-02-12 was the date of the last update of the "failover config" on DMA-1.
024/02/12 11:11:38.791|SLNet.exe|ActionsOnFirstNonFail|INF|0|173|First success after failures
2024/02/12 11:11:38.791|SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|173|Refreshing Failover Config...
2024/02/12 11:11:38.792|SLNet.exe|NotifyFailoverInfoChange_Inner|INF|0|173|Refreshed Failover Config
Both DMA did not acquire the VIP so both DMA were offline until I stopped DMA-2 and restarted DMA-1.
How can I prevent the failover system from such a dead-lock ?