A protocol can retrieve data from a device using an SSH interface. Here I would like to understand as detailed as possible, how Dataminer manages the SSH connections, e.g., when it opens or closes the connection, how it realizes the connection is down or not active, when it decides to re-open the connection, if it re-uses the connection between polling cycles, etc.
I am also interested in knowing, how the implemented algorithm for recovering the SSH connections in case of unexpected situations does affect the element state in terms of timeout alarms.
For security reasons, some devices are configured to automatically close an open connection after some time, e.g., 5 minutes. In this case, if we have an element sending commands via SSH to the device, let us assume every 15 minutes, how does Dataminer handle the fact the connection is no longer open? Does this generate a ‘Not responding’ alarm? For how long?
The questions above are just examples, my goal is to understand how the protocol works when it polls data via SSH, more specifically when something unexpected happens in the connection or in the network and what alarms can we expect in these situations. Also, if the ‘normal’ behavior can be changed or not (via configuration or via SW changes).
Thanks in advance.
DataMiner will open the connection as soon as the element starts and will keep a persistent connection as long as possible. This is to avoid overhead of setting up connection over and over again. Especially with SSL/TLS connections (which is the case for SSH) with more and more secure ciphers and key-exchange algorithms, the handshake process comes with a certain cost in terms of CPU power. That’s why by default the connection remains open as long as possible.
When the element stops, DataMiner will close the connection.
When the connection is closed explicitly by the device, DataMiner will establish it again upon the next execution of a group on that connection and will remain in the disconnected state until then. While this shouldn’t result in a timeout alarm, I’ve played a bit around with this and noticed it can actually happen if the configured command timeout is not sufficient to setup connection with the device and wait for the response.
So imagine a command timeout of 500 ms, if the connection would be closed by the device, upon the next execution of the group, the first command would fail because the connection is down and DataMiner is not aware of this, so it will reconnect and send out the command again. But if the connect takes 420 ms, that leaves only 80 ms for the response still to be received before being considered a timeout. This is not in line with what happens at startup of the element. The connect should be separated from the sending of the command and the full 500 ms should be taken into account to wait for the response. I have created an internal ticket (ID: 150661) to rectify this.
It's also possible to take full control of the connection state from within the protocol. This can be achieved through a close action:
<Action id="1">
<On id="0">protocol</On>
<Type>close</Type>
</Action>
Where 0 would be the ID of the port that needs to be closed.
This could then be executed after each group (or a selection of groups that need to be executed on the SSH connection)
<Trigger id="1">
<On id="each">group</On>
<Time>after</Time>
<Type>action</Type>
<Content>
<Id>1</Id>
</Content>
</Trigger>
By doing this, after each group, you’ll close the SSH connection every time.
Similarly there's also an open action, but that would be optional to implement as DataMiner will automatically open the connection before executing a group if it notices it's not connected with the remote endpoint.