Hi there!
Our customer has a redundancy group with a main DMA and a backup DMA, when the main DMA went down, it failover to the backup DMA. But during failover, 5mins or more service down time occured. P.s. There are no timeout elements. We tried adjust the heartbeats, but there seems no visible improvement.
Is there any method or solution to reduce the service downtime? The customer expects less than 1 min service downtime.
Hi Zheng,
I assume the downtime you are talking about is the time it takes to switch the agent from the online to the offline?
This indeed takes a while because we close down all the communication and process properly on one side and spin everything up on the other side, this is to avoid conflicts in polling and overloading the network that way
There is however an option we have "AlwaysBruteForceOffline ", this is standard not enabled to have this done in a more brute force way, there are still some bugs in this option but should work fine in the latest versions. I would reach out to your technical contact person to see if this option would be a solution/improvement for you.
more information can be found in one of the RNs that fixes the issues:
Failover: Problem with AlwaysBruteForceOffline option [ID_33775]
General Feature Release 10.2.8 | DataMiner Docs