Hello,
Usually we have a lot of elements with a lot of parameters in their Alarm templates.
For some reasons we have a scenario, when the alarm storm may appears. Term "Alarm storm" in my concern means the SERVER-SIDE event, when we have a lot of alarms. Of cause, in this situation Dataminer Cube should display "Alarm storm" banner but in DMA itself we will have a lot of useless alarms for each parameters per each devices.
My task is to reduce a number of alarms in this case. Imagine, that I can detect, when the "Alarm storm" starts and ends. For example, I can run Automation script time-to-time and get a number of alarms, and if it more than defined - we are running into "Alarm storm" event. After this trigger fired - I will wait for a while and turn this trigger off and go out from "Alarm storm" event.
For this situation, I suggest to define a "important" and "not so important" parameters in all Alarm templates, and once I detect an "Alarm storm" event - I want to disable monitoring for "not so important" parameters. And when I will leave a "Alarm storm" situation - I should enable these parameters.
To do this, I can have two alarm templates for each protocol : "Default template" and "Storm template" and switch it in Automation script for each elements. But this operation is very long and costly.
But each alarm template has "Conditions" per each parameter. So, if it will be possible to use some Global variables in this conditions - it may help me to solve this task.
Is it possible to do? What is your opinion about this feature? I think it should be useful for many customers.
Answering to your questions :
"How does an alarm storm looks like? How many alarms would be active in an alarm storm?" - about 10000 alarms simultaneously, some of them being flapped during to event.
"Is there any correlation between the alarms triggering the alarm storm? (e.g. alarms from devices in the same location, same type of alarm like timeout or something, same parameter but on different rows/elements, ...)" - yes, physically alarms depends one-by-one. If you haven't incoming signals or signal is broken - you should get a number of alarms for another parameters.
"Does your alarm storm follow a typical pattern like for example a first period when a lot of alarms are coming up, but also clearing immediately and then coming up again (values are being measured around the alarm thresholds)." - yes, typically each alarm storm continues about 1-2 minutes, but at this time there are huge number of jumps between thresholds.
"What is your main goal with this alarm storm detection?" - reduce number of useless alarms to be stored in DB (decrease a load of DMA itself) and allow users to detect and handle this storms instead of a list of "waste" atomic alarms, which will be appeared in all searches, reports, once they has been registered and stored. I know that DM has a RCA and Alarm grouping by correlation features, but they are not solve my case indeed.
Hi, Pieter.
1) Grouping, correlation works ok only with small number of alarms. For huge alarms it may impact a performance of DMA. Also I can’t create a grouping “by event” – create my own alarm “Alarm storm detected” and group all “waste” alarms to this base alarm.
2) Yes, for the first time we will store alarms that caused a trigger, but we will prevent a huge number of alarms during a storm, when alarms will appear and go away many times.
3) I can show you alarm storm event in my real network. Once it appears again – I will ask TAM to communicate with you.
ok, we’ll continue this topic offline together with the TAM
Hi Pieter,
I have about 200 remote sites with Satellite receivers and some other devices on it. Receivers should receive a TV signal from a satellite, convert it to IP interface and stream it locally. Each receiver has dozens parameters to be monitored:
- Link margin
- C/N ratio
- Signal lock
- Modulation
- Frequency
- Symbol rate
- Current bitrate
- ...
When something wrong with satellite downstream (issues with satellite transponder on uplink side, Solar interference, EMF noises, whatever else) - I receive a huge number of alarms per each parameter per each device. Moreover, all other devices, which should receive an IP stream from receiver, also generates an alarms. So, I have 200 receivers (ten parameters per element) and ~500 other devices (15 parameters per element). In this case I receive 10*200+15*500=9500 alarms.
For my users these alarms are useless because users recognize this event as "Alarm storm" and they never use these "atomic" alarms in their workflow. But when my network is not it "Alarm storm" mode - each alarms should be taken into account.
According to my approach - I can define parameters "Signal lock" and "Link margin" as "Important", and all other as "Not so important".
Really, you don't need to control other parameters, if you haven't locked signal, but you should control link margin to "wait" while network is "calming down".
So I have two events:
- Step in to alarm storm : I have similar alarms more than 70% receivers.
- Leave from alarm storm : I have "Signal lock" or "Link margin" alarms less than 10% receivers.
I want to make a driver and create a special element, which should get active alarms (by Timer), and once it detects alarm storm - it will generate it's own alarm "Alarm storm detected in satellite downlink - Critical" and set the Global variable, and this variable may be used as a condition in each alarm template in each driver.
Interesting question.
My main concern about your suggestion is the fact that applying other alarm templates has an impact on the number of alarms and as such has an impact on the 'alarm storm' state itself, no matter how you define that 'alarm storm state' (I guess it will always be based on a number of alarms somehow). You can easily run into cycles of 'in' and 'out' alarm storm mode.
Still, the alarm storm case is an interesting one and looks a bit different in different situations.
May I ask, in your case:
- How does an alarm storm looks like? How many alarms would be active in an alarm storm?
- Is there any correlation between the alarms triggering the alarm storm? (e.g. alarms from devices in the same location, same type of alarm like timeout or something, same parameter but on different rows/elements, ...)
- Does your alarm storm follow a typical pattern like for example a first period when a lot of alarms are coming up, but also clearing immediately and then coming up again (values are being measured around the alarm thresholds).
Then having a second period where there are a lot of active alarms, but the number of alarms coming up/being cleared is normal.
And then finally, when the alarm storm is over, a period where a lot of alarms getting cleared, but shortly coming up again for a short time (again values around the alarm thresholds)?
- What is your main goal with this alarm storm detection? Do you believe DataMiner is suffering from the high number of alarms? Or is your main concern about the user experience, do you want to give the user a good view on the system and mask/consolidate everything related to the storm somehow?
ok great, let’s dive a bit deeper into the goal you want to reach.
We have some grouping & correlation features in place indeed. Do you believe these are sufficient to give a workable ‘realtime’ experience? What I mean is, do you believe these features allow the users to focus on the alarms that really matter and allow them to mask away the ‘waste’ as you call it?
Secondly, we indeed have the storage. The impact on the storage will depend on the number of alarms that your are able to ‘prevent’ by assigning those new templates. All alarms that caused your trigger of the storm will remain in db.
Last, but not least, the impact on searches, reports.
That’s an interesting one, I believe, because that might be something we can improve on. Feel free to further detail this case but also let me know if you want to contact me directly on that topic (in case this is more confidential for example).