Hi,
We recently updated our polling frequency to our HTTP Connection endpoint from polling every second to every 10 seconds. Since this change we are now seeing the elements frequently timing out then quickly recovering. The elements are set to timeout after 5 seconds. The error seen in stream viewer is:
<- 10:41:45 - GET *https address*
-> 10:41:45 - Error : 12152. [ERROR_WINHTTP_INVALID_SERVER_RESPONSE]
-> 10:41:45 - Get for Http_Response_Content_Element_Status () had error : Error : 12152. [ERROR_WINHTTP_INVALID_SERVER_RESPONSE]
-> 10:41:45 - Continuing get for Http_Response_Content_Element_Status ()
<- 10:41:55 - GET *https address*
-> 10:41:57 - HTTP/1.1 200 OK
Is the change in polling frequency the cause of the behaviour we are seeing? I am thinking the error predates the change but we are only picking up the alarms now due to the change in polling frequency?
Any suggestions on how to improve the situation would be greatly appreciated. Thanks!
Hi,
It seems that from time to time an error is being returned when trying to poll the HTTP endpoint.
As far as I can see, there are no retries configured.
This means if the element is set to poll every 10s and the element is configured to go in timeout after 5s that as soon as there is an error returned that the element goes in timeout.
In the past these errors were probably also returned, but because the polling happened every second a valid response was received before the element could go in timeout.
Suggestions to avoid the timeout toggle:
-Configure retries in the element config. When an HTTP command goes in timeout to try again
-Or increase the value when the element should go in timeout with a value > 10s, to allow a second poll cycle to happen
The error points that a response seems to be received back from the HTTP server but the response could not be parsed. That could be that the response is indeed having a wrong format that could not be parsed, or something went wrong if the response was split over multiple packets and some were lost. As the connection is HTTPS it will be difficult to look at the data that is entering as that is encrypted. The error is coming from Windows level, before entering in DataMiner so logging will also not be possible. I don’t think there are downsides to ignore this error (besides the data of that time not entering in DataMiner), it’s simply saying “a response from the server has been received, but we’re not able to read/decrypt the content”, the next attempt gives back a valid response so then the data is received.
Thanks for the question @ Lisa.
@Lauren,
Thank you for the suggestions on avoiding the timeout toggle by configuring retries or increasing the timeout value. I believe these are viable solutions, but I wanted to share some concerns and suggestions regarding the configuration of retries with a view to finding the right balance between reliability and network performance.
Currently, with the polling frequency set at every 10 seconds and the timeout set to 5 seconds, I understand the retries would help mitigate the effects of transient errors. However, if we introduce retries, there’s a potential for increasing network overhead, especially if errors occur frequently.
To avoid this, please do vet some adjustments that I am thinking through and advise accordingly:
Limit the number of retries: Perhaps starting with a conservative retry value (e.g., 1 or 2 retries) to avoid overwhelming the network with unnecessary requests.
Exponential backoff: If the system allows, implementing an exponential backoff for retries would prevent rapid successive retries, which could add unnecessary load on the endpoint.
Monitoring performance: Before making aggressive changes, it may also be worth monitoring the endpoint’s performance over a period of time to ensure that these issues aren’t stemming from underlying network or server-side problems.
may i know the main reason behind timeout of a element again and again.
Thanks Laurens!
This has confirmed what I was thinking. In regards to the error itself, do you think this is anything to be concerned about? We can of course change the configuration to not raise an alarm as you have suggested but I’m just concerned if there is any downsides to ignoring this error as it looks to be an issue with the HTTP connection endpoint.