Hi All,
we use the Alarm Report dashboard from the Catalogue (https://catalog.dataminer.services/details/8d26cb0b-ba75-4bfc-8f52-fc721fac4b1d).
All components in that dashboard successfully fetched and displayed the data with our previous Dataminer version 10.5 CU7. After upgrade to 10.6. CU1 the "Top 5 most alarm events" components gets timed out after 5 minutes when larger amount of alarms are being fetched (e.g. for the week timespan).
This component shows the result of the "Alarm events" query. Analyzing the script code enriched by logging shows that the timely operation is the _dms.SendMessages(GetReportAlarmCountDataMessage).
I have the dashboard at its latest version.
The interesting here is that the timeout started to happen after the Dataminer upgrade.
Any idea what to tweak to optimize that?
Milos
Hi Alberto, thanks for replying!
We have a three-node Cassandra cluster and a three-node OpenSearch cluster. I believe the Opensearch where the alarms are stored performs well. Each opensearch node has 128 GB of RAM and 40 cores and is not hitting any HW limits. The JVM heap for opensearch is set to 32 GB. When reviewing a large volume of historical alarms (e.g., using the week timespan in the Alarm Report dashboard) we observe no database stress. There are no thread pool queues, CPU spikes or JVM heap being spent. Moreover, fetching 10k alarms using the REST client (Elasticvue) takes half a second.
Analyzing the network traffic during the heavy read load suggests that the Dataminer in our setup can process ca up to 50Mbps of incoming data (in fact, the historical alarms in JSON text flowing from opensearch). The networking itself is faster (physically 1Gbps). Knowing that 10k alarms is ca 40MB (which we measured by downloading them to a file) gives us a download speed of 10k alarms in 6 seconds. So, about 100k alarms in a minute. This is similar to what we experience also in Cube when listing historical alarms.
Dataminer is not overloaded during the massive read too – CPU and RAM stay low. One can see only the increased network traffic.
It seems the bottleneck isn't OpenSearch or networking, but rather Dataminer's processing of alarms data in JSON format. If that is the case, can this be improved? Does Dataminer process that data from Opensearch using a single thread?
Perhaps, 100k alarms fetched in 1 minute is expected performance, which would be understandable too.
Hi MIlos,
Your logging seems to have already identified _dms.SendMessages(GetReportAlarmCountDataMessage
as the bottleneck: are you using Are you using Cassandra & OpenSearch for the storage backend for alarm history?
I would check these to see if there are warnings, slow queries, or resource bottlenecks on the alarm history storage layer during execution.
Also can you execute the query outside of the dashboard?
How long it takes?
it might help to compare the number of alarm events returned for a week timespan before and after the upgrade.