Question

Solved2.40K views6th August 2020redundancy groups software redundancy

3

Pieter-Jan Vuylsteke [SLC] [DevOps Member]332 27th July 2020 0 Comments

During a DataMiner training, I explained the concept of redundancy groups and software redundancy.
There was the expectation/question that within e.g. 500 ms that something goes wrong, the switch should start.

Is this behavior something that is common?
How is it done? E.g. is it related to polling parameters with a high frequency, so DataMiner can quickly react?
There was also a concern this could maybe put too much load on the device and/or DataMiner. Can this indeed be the case?

Thanks!

Marieke Goethals [SLC] [DevOps Catalyst] Changed status to publish 6th August 2020

2 Answers

3

Ben Vandenberghe [SLC] [DevOps Advocate]9.54K Posted 27th July 2020 0 Comments

There are a lot of layers in the overall timing between something happening with a service and everything being executed for the service to be restored again. And DataMiner definitely does not control all variables involved in that.

T0: when the condition actually occurs (e.g. an input stream is no longer present / a signal input cable is disconnected / …)

T1: when the involved product is aware of this, and can flag this towards the outside world via its remote control APIs

There will be a delay between T0 and the product being fully aware of this situation and the product reflecting this via its APIs, and this is entirely defined by the internal processes of the product. Note that sometimes there will be even a deliberate delay build-in (sometimes user-definable even).

T2: when DataMiner is aware of this

This entirely depends on the protocol used, and whether this is push / unsolicited (e.g. SNMP trap) or pull technology (i.e. polling). For pull type of data collection, you have to be careful with the design of the driver to optimize for this (i.e. timing could depend on your poll timers and other triggers).

T3: when the DataMiner Redundancy Group is triggered

This could again be different depending on what the condition is for DataMiner to trigger the redundancy group. There are conditions on Alarm State and conditions on Parameter Value for example. If it is based on Alarm State, you have to make sure that you don’t have things like hysteresis that add to the timing.

NOTE: also in the redundancy group configuration there are options that could add to the overall timing when they are used (e.g. persistency, i.e. a specific condition has to be true to a user-defined time before it triggers).

T4: when DataMiner executed all actions to perform the fail-over on the third party product

This can be an immediate single setting of a control on a device, but this could also be done by an automation script (to facilitate more complex fail-over processes). This will again impact on the overall timing. It can be a big difference if the redundancy can be done with a single set on the product or if it involves a much more complex process where potentially several metrics need to be configured (and maybe even some settings need to be retrieved to be double checked).

T5: when the third party product accepted and applied all the necessary actions instructed by DataMiner.

Some products take some time to accept a new setting, and especially if you have a whole series of settings that need to be done, this can add up.

T6: when the third party product is active again with that new configuration

It is not uncommon at all that it still takes some time before a service becomes active, after all the settings have been done (e.g. encoders may take quite a bit of time before you have an active output after you have applied certain new settings).

And while I might have overlooked some other important details in the entire process, you can probably see already that timing is a very delicate thing. First thing to do always is to establish what you are talking about (T0 to T6? T2 to T4?), because depending on that, you may not be in control of all variables involved in the equation. And there are a lot of things that you can do to get a very fast response time, but you really have to engineer it for that purpose and be conscient about all the cog wheels in the entire end to end process. But if needed and done properly, it can be done very fast by DataMiner.

PS: be cautious also on how this is validated. Note for example that a reading on the screen in DataMiner, after you have performed a set, is typically only changing AFTER DataMiner retrieved that reading again from the product after the set was done (i.e. what you see on the screen is what DataMiner effectively retrieved from the target product – this is usually quite fast because a setting should automatically trigger right after a reading of that same metric). I’m just saying because once I had somebody claiming another software application was faster in performing settings than DataMiner, but that software just showed on the screen the new setting as soon as you clicked the button to apply that setting (i.e. a copy of the new setting was instantly pushed in the reading field of the software, even before it was send out to the target product), and of course that is faster and we could do the same in DataMiner, but it simply does not reflect reality.

Pieter-Jan Vuylsteke [SLC] [DevOps Member] Selected answer as best 27th July 2020

score 4 · Answer 1 · 2020-07-27T13:56:34+00:00

The delay between the moment the issue occurs at the side of the equipment/data source, and the moment DataMiner receives and processes that data, will indeed determine how fast a software redundancy switch can be triggered.

While DataMiner is capable of quite fast polling – we have built systems which operate at polling speeds of <500ms – it’s not always best practice. As you indicate, both the load on the DataMiner system, but especially also on the target equipment/data source, are valid concerns.

In case a switch needs to be executed with a very low reaction time (eg <1 sec), it is likely more advised to look into having the data pushed to DataMiner, rather than polling it. This could be done through an SNMP trap, sending a message over a TCP socket to which DataMiner listens, or any other supported “push-interface”.

Note that there will also be a delay – usually rather small – between the moment the data is received, and the moment the switch is actually completed. This mostly depends on the load on the DataMiner system, as well as the speed at which the target equipment/data source can process and execute the switch request.

How quick can a software switch take place after something goes wrong?

2 Answers