We have 3 DMA nodes in a cluster.
on One Node When the B01 server is active it will initiate a reconnect if a network issue causes a short hit.
When the M01 server is active on the same node it does not initiate a reconnect.
We are running 9.5 and are in the process of upgrading the hardware ready for a software upgrade.
This issue is relevant on the new and old M01 server so it looks like a configuration issue.
A second possibly related issue.
With the original M01 server active it would only disconnect on one of the other nodes.
With the new M01 server active it disconnects on both the other nodes.
Hi Neil,
apologies for the late reply, do you still run the 9.5 version?
There are some things i can actually add to this problem.
There are 2 different ways a DMA can disconnect from the cluster, option 1 is obviously a network issue. There are some SLNet configurations that can be tweaked to make this more or less senstive, unfortunately to know which metric is the "most sensitive" one in your environment is hard to say, it should be in the "SLNet.txt" log file the moment the disconnect happens
You can find more troubleshooting tips here:
Troubleshooting - SLNet - DataMiner Dojo
A "callback timeout" is a typical one:
Debugging callback timeout errors | DataMiner Docs
The second possibility is if for some reason there is too much delay talking between the DMAs or we get way too many events we will also initiate a disconnect for self protection.
All these metrics can be increased and i believe we did standard in the 9.6 and throughout the different 9.6 main release versions we did make things more robust towards DMA communication