Hello,
Usually we have a lot of elements with a lot of parameters in their Alarm templates.
For some reasons we have a scenario, when the alarm storm may appears. Term "Alarm storm" in my concern means the SERVER-SIDE event, when we have a lot of alarms. Of cause, in this situation Dataminer Cube should display "Alarm storm" banner but in DMA itself we will have a lot of useless alarms for each parameters per each devices.
My task is to reduce a number of alarms in this case. Imagine, that I can detect, when the "Alarm storm" starts and ends. For example, I can run Automation script time-to-time and get a number of alarms, and if it more than defined - we are running into "Alarm storm" event. After this trigger fired - I will wait for a while and turn this trigger off and go out from "Alarm storm" event.
For this situation, I suggest to define a "important" and "not so important" parameters in all Alarm templates, and once I detect an "Alarm storm" event - I want to disable monitoring for "not so important" parameters. And when I will leave a "Alarm storm" situation - I should enable these parameters.
To do this, I can have two alarm templates for each protocol : "Default template" and "Storm template" and switch it in Automation script for each elements. But this operation is very long and costly.
But each alarm template has "Conditions" per each parameter. So, if it will be possible to use some Global variables in this conditions - it may help me to solve this task.
Is it possible to do? What is your opinion about this feature? I think it should be useful for many customers.
Answering to your questions :
"How does an alarm storm looks like? How many alarms would be active in an alarm storm?" - about 10000 alarms simultaneously, some of them being flapped during to event.
"Is there any correlation between the alarms triggering the alarm storm? (e.g. alarms from devices in the same location, same type of alarm like timeout or something, same parameter but on different rows/elements, ...)" - yes, physically alarms depends one-by-one. If you haven't incoming signals or signal is broken - you should get a number of alarms for another parameters.
"Does your alarm storm follow a typical pattern like for example a first period when a lot of alarms are coming up, but also clearing immediately and then coming up again (values are being measured around the alarm thresholds)." - yes, typically each alarm storm continues about 1-2 minutes, but at this time there are huge number of jumps between thresholds.
"What is your main goal with this alarm storm detection?" - reduce number of useless alarms to be stored in DB (decrease a load of DMA itself) and allow users to detect and handle this storms instead of a list of "waste" atomic alarms, which will be appeared in all searches, reports, once they has been registered and stored. I know that DM has a RCA and Alarm grouping by correlation features, but they are not solve my case indeed.
Hi, Pieter.
1) Grouping, correlation works ok only with small number of alarms. For huge alarms it may impact a performance of DMA. Also I can’t create a grouping “by event” – create my own alarm “Alarm storm detected” and group all “waste” alarms to this base alarm.
2) Yes, for the first time we will store alarms that caused a trigger, but we will prevent a huge number of alarms during a storm, when alarms will appear and go away many times.
3) I can show you alarm storm event in my real network. Once it appears again – I will ask TAM to communicate with you.
ok, we’ll continue this topic offline together with the TAM
ok great, let’s dive a bit deeper into the goal you want to reach.
We have some grouping & correlation features in place indeed. Do you believe these are sufficient to give a workable ‘realtime’ experience? What I mean is, do you believe these features allow the users to focus on the alarms that really matter and allow them to mask away the ‘waste’ as you call it?
Secondly, we indeed have the storage. The impact on the storage will depend on the number of alarms that your are able to ‘prevent’ by assigning those new templates. All alarms that caused your trigger of the storm will remain in db.
Last, but not least, the impact on searches, reports.
That’s an interesting one, I believe, because that might be something we can improve on. Feel free to further detail this case but also let me know if you want to contact me directly on that topic (in case this is more confidential for example).