Hello all! I'm hoping to get some information regarding an issue with NATS I had recently. The short version is that after installing some Windows patches and rebooting the servers, the NATS services all of a sudden started using a lot more memory and a few of the agents experienced memory leaks where the service would consume 40GB of RAM. We were able to fix the problem by using the NATSConfigTool and recreating the files. Doing a NATS reset request through the client test tool didn't work. My question is: what causes the config files to be incorrectly configured when prior to the Windows update they were all working fine? If the files weren't configured correctly from the beginning, why didn't the problem show up before? Lastly, can we expect this scenario to come back? If so, is there a more elegant solution than recreating the config files and having to restart the whole cluster?
Thank you in advance!
Hi Luis,
While incorrectly configured NATS has a significant impact on the functionality of the DataMiner, we currently do not know exactly why the increase in memory happens and are investigating this.
We are aware that the way we are handling NATS at the moment is not ideal and are actively working on improving this, a release version for these improvements is not known yet.
Hi Jack, there are a few internal documents that describe how to investigate NATS issues and goes into detail in how the clustering algorithm works. Unfortunately these haven’t been included yet in the official docs (https://docs.dataminer.services/index.html). Since this information is too big to be included in a comment or answer here, I’ll try to translate this information into the format that the website expects and commit it so it is available online.
Just created the pull request: https://github.com/SkylineCommunications/dataminer-docs/pull/106/commits/a5e9e6fda2d90bf195c256a3a936716c37f2c7fd
Once it has been reviewed, it should become available on docs.dataminer.services
Hi Laurens,
We had a couple of instances of these this week already as well. Can you provide suggested NATS configuration? I imagine we have had similar Windows patching to Luis (and I imagine many DM users) which has caused this issue. I understand there may be case-by-case advice required, but if there is any general advice that may help it would be appreciated.
Fortunately this is only on our staging environment, but our Production environment is due to have the patching this coming week and we would like to have a plan in place as soon as possible.
Can you also confirm whether this will affect all DM versions/NATS versions?