On a driver using a custom polling configuration table, frequent RTE’s were reported related to a timer thread. After checking, it seems the timer that starts the polling logic (that checks the polling table to determine whether tables need to be polled), remains stuck, until all the tables have been polled. The explanation for this is found in the way group execution queue works. (See Inner Workings: SLProtocol)
All groups can be split up in 2 types:
- Non-poll actions:
When the group type does not contain ‘poll’, this means the timer thread itself will immediately execute the logic defined in the group.
- Poll actions:
If the group type contains ‘poll’ the group is put on the queue for later execution.
For all ‘poll’ logic, it’s also important to know that although there is a single group execution queue, a certain priority system is in place. The main reason is that custom actions defined via trigger>action are usually supposed to be more important and shouldn’t wait too long.
- Groups defined in a timer will by default have a lower priority and will be executed after groups that are called from actions.
- Groups called from actions are put in the default queue, together with the access to options to put them on top or at the end of the queue (still before the timer groups). With the ‘add to execute’ option, it is even possible to put them below (lower prio) then the timer groups as well.
Getting back to the RTE issue, we found that the group that starts the polling table logic has ‘poll action’ type (the action is put on the queue). When the device starts, there is nothing on the default queue, so the polling table logic that is thrown on the lower priority timer queue is executed immediately. This will trigger all table groups to be added on the default queue. The timer knows all its logic is done and since usually the polling table logic needs to be done very frequently, in most cases, it will almost immediately run again. However, this time it is stuck on the lower priority queue until all the groups on the normal queue (the table that are polling) are done. When the element is in timeout, each group will take quite long because of the timeout and retry mechanism. This can even take longer when redundancy connection switching happens. If too many tables are polled by the driver, this will lead to the timer group having to wait longer than the RTE time of 15 minutes. Since the timer thread is waiting for feedback that its group has executed, this thread will eventually trigger the RTE.
As for a solution, we could change the current polling logic group that is called via timer from ‘poll action’ to ‘action’. This means that the logic would be executed immediately instead of ending up on the lower priority queue. However, DIS validator clearly states that the last group of a timer should be a ‘poll’ type action. One clear reason for this is because of how a timer is notified that it can execute its groups again. For this, the last group is used. If this one is finished, it is assumed that all (if any) previous groups in the timer are also done. However, if there are ‘poll’ type groups (put on the queue) and the timer ends with a non-poll type (executes instantly), this means the timer will trigger again and again, while the groups that are put on the queue are not nearly finished yet, meaning the same groups are put on this queue again and again as well, filling up the queue with items that will eventually take too long to execute and cause RTE’s. An example of this can be found here: Debugging connectors: RTE caused by non-poll group in timer
Here comes the question. Would there be any harm in having the polling logic be of group type ‘action’ (instant) if there are no other groups in the same time timer. This should avoid any groups stacking up on the queue. We also tested various cases to confirm the action does not stack itself on the queue (which it doesn’t). I would like to know if there are other known issues that could be encountered by doing it this way, or if the DIS check is only used to prevent the use case I just described with other groups being in the same timer.
Hi Robin,
Considering that this issue appears to occur only when the element is in timeout, is it possible that the problem and its solution are related to the slow polling and ping group?
https://docs.dataminer.services/develop/devguide/Connector/ConnectionsPingGroup.html
Kind regards,
Hi Flavio
When checking the pending calls, we saw all groups being executed one after another (pretty slow because of timeout, but overall just fine), while the timer thread was hanging (presumably because its logic is waiting to be executed, according to the above explained logic)