We have a self-hosted DMS with 10 main/failover pair just set up, DataMiner 10.4 CU1.
If we stop one main DMA we get an information event "DataMiner Agent lost" from every other main DMA.
So far so good. However, not all main DMA:s will throw the timeout alarm "Connection lost with DMA"
What could be wrong? What should we check?
The difference between the 2 alarms is that the "Dataminer agent lost..." event is when a DMA detects that it can no longer reach the SLNet process of the agent it is trying to contact. This sets off some procedures that try to get contact to said agent back.
One of those things is a ping command being executed to the IP/hostname of the lost agent. If this doesn't succeed as well then we get the "Connection lost with DMA" alarm, indicating there might be some network related issue