I have a DMS with 2 agents. Each is on a different physical server.
It has no elements, scripts, views, ... it's empty.
Upgraded
from 10.2.0.0-11897-20220611-release
to 10.3.7.0-13107-20230614-release (latest feature as of this post)
I first attempted the upgrade by selecting 'cluster'. It indicated it had successfully upgraded both agents in the cluster (a single warning about longpath support).
However when looking with Cube Client and with Upgrades/VersionHistory.txt it indicated that everything was still on 10.2.0.0-11897-20220611-release.
To combat this, I stopped both agents and performed the upgrade locally on each agent.
When starting the agents only 1 agent in the cluster starts correctly. When I attempted to open cube client on the broken agent it got stuck on "retrieving initial data". Checking the logs I found issues with NATS needing a restart and connection issues to cassandra.
To combat this, I restarted the server holding the agent.
Despite several agent restarts after this, still not getting it started
Any ideas on how to get this fixed? Any ideas on how to avoid this from happening for other users?
Details:
I'm seeing the following errors in the logging:
SLAnalytics: 2023/07/10 15:26:57.542|SLAnalytics|SLNetConnection.cpp(169): Skyline::DataMiner::Analytics::SLNetConnection::openConnection)|ERR|0|Exception while opening SLNetConnection:
SLErrors: 2023/07/10 15:26:57.542|SLAnalytics.txt|SLAnalytics|SLNetConnection.cpp(169): Skyline::DataMiner::Analytics::SLNetConnection::openConnection)|ERR|0|Exception while opening SLNetConnection:
SLDataMiner is stuck on: 2023/07/10 15:26:41.915|SLDataMiner.exe 10.3.2321.1738|4108|3608|CRequest::Init|DBG|0|** Initializing SLNetCom
**********
SLNET indicated:
2023-07-10 15:28:48.212|26|ExecutionContext.RunInternal|Destroying connection e58fae35-39e9-448f-82a8-660f9dc22476 (DataMiner Cloud Platform): Authentication took too long.
2023-07-10 15:28:48.228|26|ExecutionContext.RunInternal|Destroying connection 12d50385-fc4b-4362-8999-d801630501c4 (SLNet on qa-dma-test-10): Authentication took too long.
2023-07-10 15:28:48.243|70|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: DataMiner Cloud Platform
2023-07-10 15:28:48.243|71|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-10 Application: SLNet on qa-dma-test-10
2023-07-10 15:28:48.243|70|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64637 [Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: DataMiner Cloud Platform ]
followed by constantly failing:
2023-07-10 15:49:39.626|91|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: SLAnalytics
2023-07-10 15:49:39.626|91|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64637 [Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: SLAnalytics ]
2023-07-10 15:49:39.626|91|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64505 [SLAnalytics removed because of error: Authentication took too long.]
2023-07-10 15:49:43.349|62|AuthenticationStep|
SLWatchdog2
2023-07-10 15:47:27 2196|Failed to generate alarm for "NATS has stopped, restarting...": There's no connection available with this dataminer. (0x800402cdh)
2023-07-10 15:48:27 2196|NATS has stopped, restarting...
There is no upgrade.log file. On neither of the agents. Not in that location at least, also not under DataMiner Logs. I’m seeing backups of the logging from back in 2022 when I installed 10.2. Is it possible the logfile no longer exists or was moved to a different location?
Fixed after some more detective sleuthing.
I focused on NATS as my guess for being the root cause, as it was often remarked it can often fail to play nice with DataMiner.
To assist me. I found this linked in a previous DoJo post:
https://docs.dataminer.services/user-guide/Troubleshooting/Procedures/Investigating_NATS_Issues.html
Eventually I discovered NATS Service was not running but NAS was. Firewall settings were OK however. But the nats-server.config located here C:\Skyline DataMiner\NATS\nats-streaming-server was not correctly configured.
Both servers need the same IP configured in the resolver setting. However Agent 1 of the cluster was setup with 0.0.0.0 and agent 2 was setup with the HTTPS Url of agent 1.
I changed both to point to the HTTPS URL of Agent 1 and rebooted both servers.
This allowed everything to start correctly.
Hi Jan
Not here with a solution, just wondering a couple of things…
1. Does the upgrade.log indicate an issue? an exception, an error,…
This is found in C:/Skyline DataMiner/Upgrades/Packages//upgrade.log
2. Were there any .Net packages installed during the upgrade (search for “InstallDotNet” during the step “ExecuteUpgradeActions”)
-> If 1 of the packages got installed, did the server reboot?