Hi There,
We have a strange error where the DMA (Version 10.1.0.0 - 11229CU10) is throwing an error: "NATS Has Stopped, Restarting..." However when looking for a NATS service (or the corresponding nats-streaming-server.exe) neither are there.
We have tried the suggestion: “On a DMA with NAS and NATS installed: stop the services (don’t delete them) and delete the C:\Skyline DataMiner\NATS folder. Restart your DMA, assert the services are reinstalled and no lingering error alarms indicating restarts of the services are present. If the alarms are present initially, they should be automatically cleared after some time.”
And after that the files get recreated, but no NATS service comes up, and the errors persist.
Any help is much appreciated!
Hello,
The service seems to have failed to create for some reason.
You can find logging on the attempted installation of the NATS service in the SLCloudEndpointManager.txt logfile.
Additionally, you can try reinstalling NATS manually using SLEndpointTool_Console.exe. It will indicate any errors in the process which should be able to help you forward.
Hi Rodrigo,
I had this recently as well. On some DMAs NATS were not correctly installed, i.e. only nats-account-server.exe was running but not nats-streaming-server.exe.
What worked for me was this:
- Uninstall NAS/NATS services via C:\Skyline DataMiner\Files\SLEndpointTool_Console.exe
- Run SLEndpointTool_Console.exe. Select "u" for uninstall, NAS as the endpoint type, other parameters as default.
- In my case SLEndpointTool_Console.exe terminated with an error, I had to copy a newer version from a feature release (the one that worked for me was dated 04 Nov 2021).
- Reinstall NAS/NATS services:
- I wasn't able to reinstall the service immediately with the EndpointTool, it threw an error during installation. It was solved by restarting the DMA. At startup, NATS was reinstalled automatically.
Please note, this procedure reinstalls NATS as a standalone node, it's not connected to other nodes in the cluster. If you want it to join the cluster, you'll probably need to run NatsCustodianResetNatsRequest via ClientTestTool. I haven't done this yet, my nodes are still running isolated, not sure how bad it is.
Hi Alexander,
If the NATS nodes are not clustered then functionality such as the SPI data offload to the cloud will only work for the DMAs in the cluster that are directly connected to the cloud. All other agents rely on NATS to publish their data through that connection.
As such it is recommended to ensure that the NATS nodes are clustered and currently mimic the DMS configuration.
We’re currently actively working on a task to improve the self healing capacity of our NATS cluster so it can automatically detect when it’s incomplete and run the reset by itself. (https://collaboration.skyline.be/squads/18/board/task/174977)
In the future there’ll be more critical data paths that will rely on NATS as the message broker between our various services, such as parameter changes.
Note that in the future we’ll also look into reducing the total amount of NATS nodes that have to run in the cluster for performance reasons. So the 1-to-1 match between agents and NATS nodes will not always be a correct indicator of a healthy cluster. Keep an eye on the NATS expert channel and Dojo for more info on that.