Hi all, my DMS is running slowly and I've noticed that information events are delayed by over 2 hours behind real time. Rather than just restart, (the server had a restart 6 weeks ago) I'd like to know what's causing the issue. I've noted the following from Agents/BPA messages... (Cassandra DB Size) 'Large partitions detected for table elementdata Partitions: [partitionkey: 31148:26102 - size: 131.817MiB] Large partitions detected for table infotrace Partitions: [partitionkey: Unknown (reported by Nodetool) - size: 765 MiB]' This makes sense to me as the Element does have very large tables and the data is changing constantly. I've since deleted the Element and re-run the Cassandra BPA tool in system center, the message still exists.
Is there anything I should be checking apart from this? Are there any tools in DM to clean or refresh old files to keep the agent in better condition? Are there any files I can manually delete to help? Where can I check how much alarm, trend and parameter data is stored? Maybe this could be reduced to help? Any other best practices?
Thanks!
Ross
Hello Ross,
It sounds like you have a system that generates too much information events. I would try and identify where the flood of information events comes from by inspecting the information tab in DataMinerCube alarm console. You can potentially already isolate or detect frequent culprits like elements, correlation rules, actions that are causing these and then in turn try and reduce the frequency.
From the BPA you can see infotrace is big, this is basically the history of the information events.
If your alarm console is lagging behind it means the different processes handling them can't keep up with the flood of information events.
Hi, thanks for the response it’s very much appreciated. I’m still getting the infotrace alert in BPA but my system isn’t generating many alarms or information events at all. Less than 10 per hour usually. Is there any way I can manually delete these large files without affecting operation? Also, it would be good if I could see what was causing it but there’s not link to processes or Element ID etc so I’m not sure what the issue is. I had to restart the agent for another reason last week but it’s already running 15 minutes behind for information events.