Question

Solved1.82K views19th December 2020Alarm Console notice

4

Aston Galvin [DevOps Advocate]220 18th December 2020 0 Comments

Hi All,

I have a two-part question both related to the Alarm history exceeded 100 alarms Notice. The first question is related to the notice itself, from my understanding the notice is raised against a single parameter. The thing that I can’t tell is over what period is the alarm evaluated?

The second question is actually an issue that we had encountered the other day. We had an Alarm history exceeded 100 alarms notice raised for a parameter which from all accounts seems unwarranted. The Notice was raised yet there hasn’t been any alarms recorded in the system for that parameter for at least 2 weeks. I have attached an image to show our findings, I did spend a fair bit of time with our DataMiner Account Manager and together couldn’t find the source of the issue. This issue prompted the first question without knowing how the notice is evaluated we are unable to determine if it is a legit or not.

Ben Vandenberghe [SLC] [DevOps Enabler] Answered question 18th December 2020

1 Answer

score 9 · Answer 1 · 2020-12-18T20:26:31+00:00

9

Ben Vandenberghe [SLC] [DevOps Enabler]9.24K Posted 18th December 2020 5 Comments

As to the first part of your question. The notice is not related to a specific time span. It indeed applies to a specific parameter, and more specifically it is related to the life cycle of an alarm on that parameter.

The life cycle of an alarm by definition starts when the alarm is being triggered by DataMiner (it comes to existence), and it stops when the alarm is cleared again by DataMiner. In between those two events happening, there are updates of the alarm based on other types of events happening (e.g. it can change severity (minor > major > critical > major), somebody can add a comment to the alarm, it can be masked/unmasked, other properties can change, etc.).

This specific notice you have refers to the fact that there are alarms, which have over 100 events within a single life cycle (e.g. that would happen if you have an alarm constantly toggling between two severities without actually clearing – hence the recommendation to check the alarm threshold definitions).

Note that this means that you can have an active alarm at this point in time, with a life cycle that contains a lot of events that happened, but that when you query the history, you will not find anything about it in the last 2 weeks (which means that nothing happened with that alarm in the last 2 weeks). Because if you query the history database, it gives you everything that happened in that specific time frame.

Hope this information already helps a bit.

Jeff Douglass Posted new comment 23rd December 2020

Aston Galvin commented 18th December 2020

Hi Ben,

Thank you very much for your prompt reply.
You certainly cleared up the logic behind the alarm.

In regards to the second issue, I’m still not sure what caused it to occur.
As you mentioned once the alarm has alarm cleared/returned to the normal, the life cycle ends. The parameter in question was not in alarm when the notice was raised and as the image shows the last time an alarm was raised was 2 weeks prior. We attempted to look for alarms that might have still been in its life cycle but found nothing in the past 6 months. I would imagine that we would expect to see in the alarm history 100+ alarms with the same root alarm ID?

Ben Vandenberghe [SLC] [DevOps Enabler] commented 18th December 2020

Indeed, there is still something off here. It’s a little difficult to judge from the screen cap, but it appears that the last event in your active alarm console is a clear event. But that should conclude the life cycle and make the alarm be removed from the Active Alarm list. Long shot maybe, because this is a rather less popular feature, but by any chance was the AUTOCLEAR option maybe activated? This can be done for the entire system or for one specific parameter via the alarm template. That option does not clear an alarm when the condition returns to normal, but instead to clearable (which is an intermediate state that needs to be manually transitioned by an operator to clear to close the life cycle – i.e. this is used for cases where one requires that an operator manually clears the alarm from the Alarm Console in order to ensure that they cannot claim that they did not see it). Here you can see where this clearable option is configured in the alarm template: https://help.dataminer.services/dataminer/#t=DataMinerUserGuidepart_2protocolsConfiguring_alarm_templates.htmXREF_79465_Setting_the&rhsearch=clearable&rhsyns=.

Ben Vandenberghe [SLC] [DevOps Enabler] commented 22nd December 2020

Hi Aston,

just wanted to check in to see if you have been able to sort this out?

Ben

Jeff Douglass commented 23rd December 2020

We have been trying to understand these alarms/this feature for a while now and still do not have a solid grasp on it. We have multiple systems where we have seen hundreds of thousands of these alarms reported in a 24hr period so now on all our systems we set this feature to disabled in the maintenancesettings.xml file.

Aston Galvin commented 4th January 2021

Hi Ben,
I have had a quick look at our system and we haven’t defined AutoClear in the MaintenanceSettings.xml, so based on the documentation this would default to “True”. Looking at the Alarm Template for the protocol AutoClear is set to System Default for all parameters. I think I am going to associate this notice as a bit of an anomaly, as our Account Manager and I were unable to find a cause for it and it hasn’t occurred since.
Thanks for all your help Ben, much appreciated.

Alarm history exceeded 100 alarms. It is advised to revise your alarm threshold definitions.

1 Answer