On a driver using a custom polling configuration table, frequent RTE’s were reported related to a timer thread. After checking, it seems the timer that starts the polling logic (that checks the polling table to determine whether tables need to be polled), remains stuck, until all the tables have been polled. The explanation for this is found in the way group execution queue works. (See Inner Workings: SLProtocol)
All groups can be split up in 2 types:
- Non-poll actions:
When the group type does not contain ‘poll’, this means the timer thread itself will immediately execute the logic defined in the group.
- Poll actions:
If the group type contains ‘poll’ the group is put on the queue for later execution.
For all ‘poll’ logic, it’s also important to know that although there is a single group execution queue, a certain priority system is in place. The main reason is that custom actions defined via trigger>action are usually supposed to be more important and shouldn’t wait too long.
- Groups defined in a timer will by default have a lower priority and will be executed after groups that are called from actions.
- Groups called from actions are put in the default queue, together with the access to options to put them on top or at the end of the queue (still before the timer groups). With the ‘add to execute’ option, it is even possible to put them below (lower prio) then the timer groups as well.
Getting back to the RTE issue, we found that the group that starts the polling table logic has ‘poll action’ type (the action is put on the queue). When the device starts, there is nothing on the default queue, so the polling table logic that is thrown on the lower priority timer queue is executed immediately. This will trigger all table groups to be added on the default queue. The timer knows all its logic is done and since usually the polling table logic needs to be done very frequently, in most cases, it will almost immediately run again. However, this time it is stuck on the lower priority queue until all the groups on the normal queue (the table that are polling) are done. When the element is in timeout, each group will take quite long because of the timeout and retry mechanism. This can even take longer when redundancy connection switching happens. If too many tables are polled by the driver, this will lead to the timer group having to wait longer than the RTE time of 15 minutes. Since the timer thread is waiting for feedback that its group has executed, this thread will eventually trigger the RTE.
As for a solution, we could change the current polling logic group that is called via timer from ‘poll action’ to ‘action’. This means that the logic would be executed immediately instead of ending up on the lower priority queue. However, DIS validator clearly states that the last group of a timer should be a ‘poll’ type action. One clear reason for this is because of how a timer is notified that it can execute its groups again. For this, the last group is used. If this one is finished, it is assumed that all (if any) previous groups in the timer are also done. However, if there are ‘poll’ type groups (put on the queue) and the timer ends with a non-poll type (executes instantly), this means the timer will trigger again and again, while the groups that are put on the queue are not nearly finished yet, meaning the same groups are put on this queue again and again as well, filling up the queue with items that will eventually take too long to execute and cause RTE’s. An example of this can be found here: Debugging connectors: RTE caused by non-poll group in timer
Here comes the question. Would there be any harm in having the polling logic be of group type ‘action’ (instant) if there are no other groups in the same time timer. This should avoid any groups stacking up on the queue. We also tested various cases to confirm the action does not stack itself on the queue (which it doesn’t). I would like to know if there are other known issues that could be encountered by doing it this way, or if the DIS check is only used to prevent the use case I just described with other groups being in the same timer.
The check concerning the last group of a timer should be a ‘poll’ type action.
The only reason is to guide developers into writing robust code. In 99% of all cases you don't want a timer to keep putting actions on the queue when the previous ones it added haven't finished.
You can use a non-poll type action in a timer if you know what you're doing and that it might cause a growing queue if the content takes longer than the polling time.
A common use-case in the past for having timers with non-poll type actions has been actions that 'increment' a counter somewhere. For example, every 5 seconds you want to increment a counter with 5. You would not use a 'poll' type action here but a normal action. This way you know even if something causes a delay of > 5s your counter will eventually end up with the right number.
If you do use a timer without 'poll' type as last group, make sure you clearly comment the reasoning in your connector sourcecode, to avoid someone mistakenly 'fixing' it.
Hi Robin,
Considering that this issue appears to occur only when the element is in timeout, is it possible that the problem and its solution are related to the slow polling and ping group?
https://docs.dataminer.services/develop/devguide/Connector/ConnectionsPingGroup.html
Kind regards,
Hi Robin,
Thank you for this very interesting question!
After re-reading following use-case: Debugging connectors: RTE caused by non-poll group in timer | DataMiner Docs. it seems to make perfect sense that the reason of having the last group being of type poll is to prevent the timer from adding more and more groups to the queue when the same groups being added by a previous iteration of the same timer are not yet finished. This means that the rule of having the last group be of type poll is only needed if the timer does contain some groups of type poll. Having a timer with one or multiple non-poll groups and no poll group at all is perfectly fine. As soon as the timer had a poll group, the last group of that timer should also be of type poll.
Hi Flavio
When checking the pending calls, we saw all groups being executed one after another (pretty slow because of timeout, but overall just fine), while the timer thread was hanging (presumably because its logic is waiting to be executed, according to the above explained logic)