I have a DMS with 2 agents. Each is on a different physical server.
It has no elements, scripts, views, ... it's empty.
to 10.3.7.0-13107-20230614-release (latest feature as of this post)
I first attempted the upgrade by selecting 'cluster'. It indicated it had successfully upgraded both agents in the cluster (a single warning about longpath support).
However when looking with Cube Client and with Upgrades/VersionHistory.txt it indicated that everything was still on 10.2.0.0-11897-20220611-release.
To combat this, I stopped both agents and performed the upgrade locally on each agent.
When starting the agents only 1 agent in the cluster starts correctly. When I attempted to open cube client on the broken agent it got stuck on "retrieving initial data". Checking the logs I found issues with NATS needing a restart and connection issues to cassandra.
To combat this, I restarted the server holding the agent.
Despite several agent restarts after this, still not getting it started
Any ideas on how to get this fixed? Any ideas on how to avoid this from happening for other users?
I'm seeing the following errors in the logging:
SLAnalytics: 2023/07/10 15:26:57.542|SLAnalytics|SLNetConnection.cpp(169): Skyline::DataMiner::Analytics::SLNetConnection::openConnection)|ERR|0|Exception while opening SLNetConnection:
SLErrors: 2023/07/10 15:26:57.542|SLAnalytics.txt|SLAnalytics|SLNetConnection.cpp(169): Skyline::DataMiner::Analytics::SLNetConnection::openConnection)|ERR|0|Exception while opening SLNetConnection:
SLDataMiner is stuck on: 2023/07/10 15:26:41.915|SLDataMiner.exe 10.3.2321.1738|4108|3608|CRequest::Init|DBG|0|** Initializing SLNetCom
2023-07-10 15:28:48.212|26|ExecutionContext.RunInternal|Destroying connection e58fae35-39e9-448f-82a8-660f9dc22476 (DataMiner Cloud Platform): Authentication took too long.
2023-07-10 15:28:48.228|26|ExecutionContext.RunInternal|Destroying connection 12d50385-fc4b-4362-8999-d801630501c4 (SLNet on qa-dma-test-10): Authentication took too long.
2023-07-10 15:28:48.243|70|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: DataMiner Cloud Platform
2023-07-10 15:28:48.243|71|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-10 Application: SLNet on qa-dma-test-10
2023-07-10 15:28:48.243|70|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64637 [Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: DataMiner Cloud Platform ]
followed by constantly failing:
2023-07-10 15:49:39.626|91|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: SLAnalytics
2023-07-10 15:49:39.626|91|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64637 [Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: SLAnalytics ] 2023-07-10 15:49:39.626|91|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64505 [SLAnalytics removed because of error: Authentication took too long.] 2023-07-10 15:49:43.349|62|AuthenticationStep|
2023-07-10 15:47:27 2196|Failed to generate alarm for "NATS has stopped, restarting...": There's no connection available with this dataminer. (0x800402cdh)
2023-07-10 15:48:27 2196|NATS has stopped, restarting...
Fixed after some more detective sleuthing.
I focused on NATS as my guess for being the root cause, as it was often remarked it can often fail to play nice with DataMiner.
To assist me. I found this linked in a previous DoJo post:
Eventually I discovered NATS Service was not running but NAS was. Firewall settings were OK however. But the nats-server.config located here C:\Skyline DataMiner\NATS\nats-streaming-server was not correctly configured.
Both servers need the same IP configured in the resolver setting. However Agent 1 of the cluster was setup with 0.0.0.0 and agent 2 was setup with the HTTPS Url of agent 1.
I changed both to point to the HTTPS URL of Agent 1 and rebooted both servers.
This allowed everything to start correctly.