Some strange behavior was observed on some elements of a certain protocol (Evertz 570J2K-HW-X19). The general parameter [Element alarm state] with id 65008 was toggling between Normal and Major state several hundred times per second. Since this parameter was monitored, this led to an alarm storm and an unresponsive DMS system unfortunately.
As far as I could see, there were no other alarms besides the alarms related to the [Element alarm state]. What I would like to understand is what could cause such a behavior.
The concerned protocol is an SNMP protocol with quite some large tables but the fastest timer is 1 min so it's strange to see the [Element alarm state] toggling so fast. Any clues on where to look further? Any detailed logs to look into?
Hi Koen, it's strange that there were no other alarms besides the element state itself. Maybe as a first step you could look in the alarm template which other parameters have a threshold defined for the "Major" severity. Hopefully that can already help to point in the correct direction.
Thanks Tom. After further checking, the alarm storm seems to have started with an element timeout. And a “Timeout” was configured as “Critical” in the alarm template by means of the [Element alarm state] parameter while a “Critical” was defined as “Major”. I believe this strange configuration of the “[Element alarm state]” parameter in the alarm template has somehow caused a loop.