Does anyone have some good steps for diagnosing SLDataGateway.exe taking all Virtual Memory and Crashing the DMA?
Please see some interesting log file entries from just before DMA crashing:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: SLDataGateway.exe (5944) consumed 46393978880 bytes, prunsrv.exe (1296) consumed 7146168320 bytes, and SLElement.exe (6652) consumed 1616957440 bytes.
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: SLDataGateway.exe (5792) consumed 40352841728 bytes, prunsrv.exe (1224) consumed 7361466368 bytes, and SLNet.exe (2652) consumed 2814517248 bytes.
Thank you for looking.
As I stumbled upon this question a year later, I'd like to update this question with our later findings. Perhaps it can still be relevant to some users.
Two causes of high memory usage were found:
- Setting large volume of trending data via History Sets (iDirect Platform in History Polling mode).
History sets and related calculations are more resource intensive than regular trending. That causes build-up of queues in SLDataGateway.
This has been resolved by the feature NewAverageTrending, available in DataMiner 10.2.0 and newer. - Replaying large volumes of data from the Offload file.
When a database is unavailable for some time, the new data is being written to a temporary file in C:\Skyline DataMiner\Offload\. When the connection to the database is restored, the data is pushed from the offload file to the database (so-called replay). This operation can be quite resource intensive and cause the build-up of queues in SLDataGateway.
The best practice would be to avoid database outages and monitor the health of the database, e.g., with the protocol Apache Cassandra Cluster Monitor.
Hi Alasdair, here are some checks you can perform to narrow down an SLDataGateway memory issue:
- Check VM Size trending of SLDataGateway in a Microsoft Platform element that monitors your agent. Does SLDataGateway memory usage change over time? Are there sudden spikes or leak patterns? Are they reoccurring on a month-to-date timespan?
- Compare SLDataGateway VM Size trending with Commit Charge Total or Free Virtual Memory trending on the same agent. There may be other processes consuming the memory, and SLDataGateway may be not the culprit.
- Note the starting points of leaks. Check the SLWatchdog2.txt log around the start time of the issue. Pay attention to lines like "Process * stopped" or "Not signaled 1 (since *)".
- If possible, check if any configuration changes were made in DataMiner around the time the issue occurred for the first time.
- If available, check the SLDataGateway.txt and SLDBConnection.txt logs around the time the issue started, pay attention to lines like
"Queue for * exceeded *000 items".
These checks may help you collect some initial facts required for the investigation.
I would check the SLDBConnection log file to get an idea of what is happening in the SLDataGateway process around that time.
Also note you can increase/decrease the log level in System Center - Logging - DataMiner if you need to change the detail of the logs.