Skip to content
DataMiner DoJo

More results...

Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Search in posts
Search in pages
Log in
Menu
  • Blog
  • Questions
  • Learning
    • E-learning Courses
    • Open Classroom Training
    • Certification
      • DataMiner Fundamentals
      • DataMiner Configurator
      • DataMiner Automation
      • Scripts & Connectors Developer: HTTP Basics
      • Scripts & Connectors Developer: SNMP Basics
      • Visual Overview – Level 1
      • Verify a certificate
    • Tutorials
    • Video Library
    • Books We Like
    • >> Go to DataMiner Docs
  • Expert Center
    • Solutions & Use Cases
      • Solutions
      • Use Case Library
    • Markets & Industries
      • Media production
      • Government & defense
      • Content distribution
      • Service providers
      • Partners
      • OSS/BSS
    • DataMiner Insights
      • Security
      • Integration Studio
      • System Architecture
      • DataMiner Releases & Updates
      • DataMiner Apps
    • Agile
      • Agile Webspace
      • Everything Agile
        • The Agile Manifesto
        • Best Practices
        • Retro Recipes
      • Methodologies
        • The Scrum Framework
        • Kanban
        • Extreme Programming
      • Roles
        • The Product Owner
        • The Agile Coach
        • The Quality & UX Coach (QX)
    • DataMiner DevOps Professional Program
  • Downloads
  • More
    • Feature Suggestions
    • Climb the leaderboard!
    • Swag Shop
    • Contact
      • General Inquiries
      • DataMiner DevOps Support
      • Commercial Requests
    • Global Feedback Survey
  • PARTNERS
    • All Partners
    • Technology Partners
    • Strategic Partner Program
    • Deal Registration
  • >> Go to dataminer.services

Random “NATS has stopped, restarting…” error

Solved1.21K views8th July 2022error NAS NATS
4
Jamie Stutz [SLC] [DevOps Member]1.18K 6th July 2022 3 Comments

Our client has a cluster running 10.2.0.0-11774-CU3. It was upgraded a few weeks ago and other than a hardware issue has been stable. Last night we saw two errors that indicated both NATS and NAS restarted. Both errors dropped after about two minutes which I assume was the length of time it took for the services to restart.

After the restart everything appears normal. Looking at the trending for that server, I don't see any evidence of a memory leak or that the CPU is/was stressed. I did see this in the SLNATSCustodian log:

2022/07/06 06:13:14.728|SLNet.exe|HandleCustomMessage|ERR|0|266|(Code: 0x800402B8) Skyline.DataMiner.Net.Exceptions.DataMinerCommunicationException: Failed to connect to 100.x.x.88: Timed out (10s)
at Skyline.DataMiner.Net.DataMinerConnection.GetConnection()
at Skyline.DataMiner.Net.DataMinerConnection.HandleMessage(DMSMessage msg)
at Skyline.DataMiner.Net.Apps.NATSCustodian.NatsRoutesArbiterHelpers.GetCredentialsBytes(NatsNode seed)
at Skyline.DataMiner.Net.Apps.NATSCustodian.NATSCustodianMessageHandler.InnerHandle(IConnectionInfo info, NatsCustodianForwardCredentialsRequest message)
at System.Dynamic.UpdateDelegates.UpdateAndExecute3[T0,T1,T2,TRet](CallSite site, T0 arg0, T1 arg1, T2 arg2)
at Skyline.DataMiner.Net.Apps.NATSCustodian.NATSCustodianMessageHandler.HandleMessage(OperationMeta meta, IManagerStoreCustomRequest request)
at Skyline.DataMiner.Net.ManagerStore.CustomComponent.HandleMessage(IConnectionInfo connInfo, IManagerStoreCustomRequest request)
at Skyline.DataMiner.Net.ManagerStore.BaseManager.HandleCustomMessage(IConnectionInfo connInfo, IManagerStoreCustomRequest request)
HandleCustomMessage
2022/07/06 06:16:42.089|SLNet.exe|ResetNATSClusterIfChangesAreFound|INF|0|22|Reconfiguring local Nats because: SLCloud.xml contains unreachable NatsServer(s) (5 mins) and nats-server.config is misconfigured
2022/07/06 06:17:42.131|SLNet.exe|ResetNATSClusterIfChangesAreFound|INF|0|269|Reconfiguring local Nats because: Nats Service is not running
2022/07/06 06:18:41.867|SLNet.exe|ResetNATSClusterIfChangesAreFound|INF|0|14|Reconfiguring local Nats because: SLCloud.xml is missing new NatsServer(s) and nats-server.config is misconfigured
2022/07/06 07:00:42.345|SLNet.exe|ResetNATSClusterIfChangesAreFound|INF|0|101|Reconfiguring local Nats because: SLCloud.xml contains unreachable NatsServer(s) (5 mins) and nats-server.config is misconfigured
2022/07/06 07:11:42.354|SLNet.exe|ResetNATSClusterIfChangesAreFound|INF|0|23|Reconfiguring local Nats because: SLCloud.xml is missing new NatsServer(s) and nats-server.config is misconfigured

I hid part of the IP address, but the IP in the message: "Failed to connect to 100.x.x.88" is the IP of the server itself.

So my question is this... given that everything looks OK now, is there any further action that needs to be taken here to either clean up or prevent a re-occurrence? Or, after the restart of the services, has the DMA self-healed?

Thanks!

Jamie Stutz [SLC] [DevOps Member] Selected answer as best 8th July 2022
Gellynck Jens [SLC] commented 7th July 2022

How many agents are in this cluster? Does it have failover agents?

Jamie Stutz [SLC] [DevOps Member] commented 7th July 2022

Hi Jens… there are 4 failover pairs in the cluster, but only one of them had the issue. Come to think of it, this is the same system you helped us with a few weeks ago, so maybe my initial statement about there being no previous issues was inaccurate. I had sort of blocked that issue out! 🙂 IRC, that issue was also related to NATS, but in that case it was a memory leak. In this case there was no evidence of strain on the memory or CPU.

Jamie Stutz [SLC] [DevOps Member] commented 7th July 2022

OH, and related to the previous issue we experienced, that was on a different agent in the cluster.

1 Answer

  • Active
  • Voted
  • Newest
  • Oldest
2
Gellynck Jens [SLC]2.71K Posted 8th July 2022 1 Comment

Hi Jamie,

It's very possible this was a one-time issue and will not re-occur, but it's also possible the root cause is a deeper issue (e.g. connection or network issues in the DataMiner System). When SLWatchDog detects NATS is stopped it will automatically restart the services in an effort to self-heal, this will also create the alarms you showed.

That being said, NATS has been a source of issues lately and a lot of bug fixing has been going on there. It's very possible this issue is already fixed in a later version.

I'd suggest keeping a close eye on these error alarms to see if it keeps happening, if it does we will need to perform more troubleshooting to identify the exact root cause. If you see more NATS issues, please refer to Investigating_NATS_Issues.

I hope this somewhat helps.

Best regards,

Jens

Jamie Stutz [SLC] [DevOps Member] Selected answer as best 8th July 2022
Jamie Stutz [SLC] [DevOps Member] commented 8th July 2022

Thanks Jens! We’ll keep an eye on it. I will also look over the NATS investigation link you posted and see what I find on the DMA. In any event, it’s good to have it as a reference for future occurrences. I figured it could be a random occurrence, but thought it wise to put the question out to Dojo anyway.

Please login to be able to comment or post an answer.

My DevOps rank

DevOps Members get more insights on their profile page.

My user earnings

0 Dojo credits

Spend your credits in our swag shop.

0 Reputation points

Boost your reputation, climb the leaderboard.

Promo banner DataMiner DevOps Professiona Program
DataMiner Integration Studio (DIS)
Empower Katas

Recent questions

Multiple Set on Table parameters for DVE’s 0 Answers | 1 Vote
DOM Definition relations returned in Definition query 0 Answers | 1 Vote
Alarm Dashboard PDF/CSV Export 1 Answer | 0 Votes

Question Tags

adl2099 (115) alarm (62) Alarm Console (82) alarms (100) alarm template (83) Automation (223) automation scipt (111) Automation script (167) backup (71) Cassandra (180) Connector (108) Correlation (68) Cube (150) Dashboard (194) Dashboards (188) database (83) DataMiner Cube (57) DIS (81) DMS (71) DOM (140) driver (65) DVE (56) Elastic (83) Elasticsearch (115) elements (80) Failover (104) GQI (159) HTTP (76) IDP (74) LCA (152) low code app (166) low code apps (93) lowcodeapps (75) MySQL (53) protocol (203) QAction (83) security (88) services (51) SNMP (86) SRM (337) table (54) trending (87) upgrade (62) Visio (539) Visual Overview (345)
Privacy Policy • Terms & Conditions • Contact

© 2025 Skyline Communications. All rights reserved.

DOJO Q&A widget

Can't find what you need?

? Explore the Q&A DataMiner Docs

[ Placeholder content for popup link ] WordPress Download Manager - Best Download Management Plugin