Question

Solved3.52K views11th July 2023adl2099 DataMiner memoryleak SLDataGateway

5

Alberto De Luca [DevOps Enabler]5.68K 25th November 2022 1 Comment

Hi Dojo – question for all the SL-processes experts – 50 points for a good answer (;

We spotted a memory leak on SLDataGateway in 10.2.4.0-11608: the memory consumption can run up to 20GB (sometimes even beyond).

Our squad has already confirmed the issue is addressed in 10.2.10 – plans for the upgrade are already shared – but regardless of the fix, I’m curious about how this started – what could be the root cause of this behaviour?

It started on a few servers on the same date – could there be any possible relation with NIC teaming settings on the servers (e.g. increased MTU size)? Any relation to be checked with SLElement (or SLProtocol)?

If not, what else could we check to mitigate the impact until the upgrade is performed?

The behaviour was first observed last month – the process was using less than 1GB before:

Thanks

Marieke Goethals [SLC] [DevOps Catalyst] Selected answer as best 11th July 2023

Alberto De Luca [DevOps Enabler] commented 28th November 2022

Not sure if this still applies – but any difference in the process between 9.0 & 10 might help:

https://community.dataminer.services/troubleshooting-sldatagateway/?hilite=cpu+average+load

2 Answers

1

Simon Declerck [SLC] [DevOps Advocate]1.13K Posted 30th November 2022 2 Comments

Hi Alberto,

Between the 10.2.4.0 and the 10.2.10.0 feature release there have been quite a few improvements to memory handling in SLDataGateway.
A few of those resolved situations in which memory would leak, those did all stem from race conditions that would be nearly impossible to mitigate.
This could also be a build up of memory on the queues. That is something where the MTU size on the NIC teaming could have an effect if it’s throttling access to the DB. But that seems like a long shot. Was the NIC teaming changed around the time that this started occurring?

I would recommend checking out the information events from around when this issue started occurring. As those may indicate a change that may be the root cause of this (new version? protocol upgrade? more elements? …)

Marieke Goethals [SLC] [DevOps Catalyst] Selected answer as best 11th July 2023

Alberto De Luca [DevOps Enabler] commented 1st December 2022

Thanks for your feedback, Simon

Indeed we’ve already checked the info events at OS layer – the same operations were performed in different environments – but only one has shown this effect.

I was hoping that a memory dump could help in better analysing the cause within the application itself – as this seems to manifest only where the element load is significant – however if the excessive memory consumption has been addressed in multiple parts of SLDataGateway, it would be an excessive effort to debug now. Will upgrade and observe the effects in 10.2.10

NIC wise, an increase in MTU size (jumbo frames) was identified as a prerequisite to process real-time data with specific protocols, but this isn’t something configured by default – normally a DMA operates with the standard Windows settings (standard MTU size), so I’m keen to find out more about the additional checks (if any) when switching from 1500 to 9000 bytes at NIC level.

Actually, this might be a separate specific question – I’ll raise it now
and will report back here on SLDataGateway after the upgrade.

Alberto De Luca [DevOps Enabler] commented 9th December 2022

Simon, quickly adding a screenshot below – can confirm the slope is much better with 10.2.10, yet it’s still present – any chance some sys-dev could take a look into this if the trend keeps showing an increased memory usage over time?

Not sure if this still applies – but any difference in the process between 9.0 & 10 might help:

https://community.dataminer.services/troubleshooting-sldatagateway/?hilite=cpu+average+load
Thanks for your feedback, Simon

Indeed we’ve already checked the info events at OS layer – the same operations were performed in different environments – but only one has shown this effect.

I was hoping that a memory dump could help in better analysing the cause within the application itself – as this seems to manifest only where the element load is significant – however if the excessive memory consumption has been addressed in multiple parts of SLDataGateway, it would be an excessive effort to debug now. Will upgrade and observe the effects in 10.2.10

NIC wise, an increase in MTU size (jumbo frames) was identified as a prerequisite to process real-time data with specific protocols, but this isn’t something configured by default – normally a DMA operates with the standard Windows settings (standard MTU size), so I’m keen to find out more about the additional checks (if any) when switching from 1500 to 9000 bytes at NIC level.

Actually, this might be a separate specific question – I’ll raise it now
and will report back here on SLDataGateway after the upgrade.
Simon, quickly adding a screenshot below – can confirm the slope is much better with 10.2.10, yet it’s still present – any chance some sys-dev could take a look into this if the trend keeps showing an increased memory usage over time?

score 0 · Answer 1 · 2022-12-09T12:06:46+00:00

0

Alberto De Luca [DevOps Enabler]5.68K Posted 9th December 2022 2 Comments

Sharing the memory trend for SLDataGateway after the upgrade to 10.2.10
The slope seems more tolerable, yet it’s still present:

Alberto De Luca [DevOps Enabler] Posted new comment 9th December 2022

Jens Vandewalle [SLC] [DevOps Advocate] commented 9th December 2022

Hi Alberto, is it possible that you are using CassandraCluster (see DB.Xml for Cassandra configuration)? In that case the saw pattern memory leak will be fixed from 10.3.1 onwards.

Alberto De Luca [DevOps Enabler] commented 9th December 2022

Thanks for the hint Jens – no Cassandra cluster in this environment.
We run a local Cassandra DB on each agent – any additional check we could perform?

SLDataGateway – High Memory Consumption

2 Answers