With SRM we have a lot of alarm updates due to the view/service impact that changes at the start and end of bookings. Due to this, we get a lot of notices that the alarm history exceeds 100 alarms. We have a Cassandra and Elastic Cluster DB, so the alarms are now stored in Elastic and we have the default threshold for alarms per parameter. Is it now safe with Elastic DB to set the 'recurring' attribute to false to prevent us from having all those notices or is there still a problem with having such large alarm trees?
Hi Alberto,
Information events and alarms are very similar, the main difference is that information events mostly don’t have a tree and are just single events. There is a lot of value in having the updates in the alarm tree as with this we can also track at what time what views/services are impacted by retrieving the alarm tree (by retrieving the history of the alarm you can track back these things). The problem with large alarm trees goes back to when alarms were still stored in Cassandra and large alarm trees would have a negative impact on the performance of the system. Now that alarms are stored in Elastic, I’m wondering if large alarm trees are still a problem. If it would not be a problem anymore we can disable the notifications that warn us about this.
Michiel,
I don't think it's safe to disable it because it can still be a performance hit. As far as I know the active alarms for the entire tree is returned at startup and then only the last x alarms are sent into the system (by default 100). The back end still processes the entire tree.
Basically, the notice is there to ask yourself the question, if you have alarm tree which is never closing, does it need to be an alarm? If you use the alarm for other purposes than an alarm, then it shouldn't be an alarm.
Hi Davy,
Thanks for answering this question for me.
In addition to your remarks, I also want to let you know that avoiding large alarm trees is not straightforward. Alarm templates are under the control of the user and these can be changed from time to time. Minor alarms or warnings in the system don’t always get high priority to be looked at and might be in the system for months. When using automation (e.g. SRM) you can have many alarm updates on short periods of time.
Would it not make more sense that we clean up the oldest entries in the alarm tree (not the root alarm) dynamically (e.g. if we reach 200 entries remove the 100 oldest entries)?
If this is normal within the standard cycle of the booking, could it work to log this as an info event, rather than an alarm?