There is an issue on the Aurora Network CH3000 driver, where very occasionally, mainly during the night (between 2 AM and 5 AM), the element enters on Timeout, and remains like that until it is restarted. Once restarted, the element continues to work perfectly. This problem have occurred on multiple elements.
The communication is all being done through SNMP, and I am having some difficulties understanding why the element is not recovering from the timeout by itself.
I have no information as to which tables go into timeout, since it has not been possible to reproduce the issue and as I have mentioned, it is a very occasionally occurrence.
I tried running a simulation for the driver, stop the simulation until timeout and run it again to see if the driver recovers, and it works fine.
The ping group of this SNMP protocol is the group responsible for the general SNMP parameters (description, contact, name, location).
What could be the possible causes for this?
Hi,
Possible root causes could be:
-When using a host name instead of an IP, that the host name could not be resolved.
-The request id that is used goes above a certain value, resulting in a negative value
Looking at the StreamViewer and/or taking a Wireshark capture (when not SNMPv3) when the element is in timeout could provide some better insights.
Regards,
The elements in question only use IP.
It’s a bit hard to catch the element in this state. However, I don’t understand why the “request id” would cause this issue. If that was the case, when restarting the element, shouldn’t it go back to timeout again?
In this case, after restart, it won’t go into timeout again.
When there is a Wireshark capture when the element is in timeout, it will provide more info about what is going wrong. It can then also be compared with a Wireshark capture of when the device is responding. It can then be ruled out if the negative request id is the root cause or if it would be something else
Hi Tomas,
Regarding your question ‘why the request id would cause this issue’, you can find more information about it in a previous question in Dojo:
https://community.dataminer.services/question/snmp-negative-request-ids-result-in-timeout/
@tomas.
We were having a similar issue with elements in timeout.
As per above, the Stream viewer is a great starting place to see what the element is doing and what parameters are not responding etc.
We implemented Slow Poll for those elements as, on some sites, the connection was causing dropouts. You have to make sure in the protocol that the first group is a “Poll” group, as these are used to ping the element until a connection can be restored. The best thing about Slow Poll is that you are not bombarding the IP device with hundreds of requests, which is what helped us solve our problem.
Have a read about it in the dataminer.docs for more information as it has it’s advantages and disadvantages.
I see that this question has been inactive for some time. Have any of the suggestions below been helpful for you, or have you been able to find a solution in the meantime? If yes, could you select the answer to indicate that this question is resolved?