Question

Solved944 views11th July 2023cluster dma upgrade

1

Jan Staelens [SLC] [DevOps Advocate]899 10th July 2023 2 Comments

I have a DMS with 2 agents. Each is on a different physical server.

It has no elements, scripts, views, … it’s empty.

Upgraded

from 10.2.0.0-11897-20220611-release
to 10.3.7.0-13107-20230614-release (latest feature as of this post)

I first attempted the upgrade by selecting ‘cluster’. It indicated it had successfully upgraded both agents in the cluster (a single warning about longpath support).

However when looking with Cube Client and with Upgrades/VersionHistory.txt it indicated that everything was still on 10.2.0.0-11897-20220611-release.

To combat this, I stopped both agents and performed the upgrade locally on each agent.

When starting the agents only 1 agent in the cluster starts correctly. When I attempted to open cube client on the broken agent it got stuck on “retrieving initial data”. Checking the logs I found issues with NATS needing a restart and connection issues to cassandra.

To combat this, I restarted the server holding the agent.

Despite several agent restarts after this, still not getting it started

Any ideas on how to get this fixed? Any ideas on how to avoid this from happening for other users?

Details:

I’m seeing the following errors in the logging:

SLDataMiner is stuck on: 2023/07/10 15:26:41.915|SLDataMiner.exe 10.3.2321.1738|4108|3608|CRequest::Init|DBG|0|** Initializing SLNetCom
**********

SLNET indicated:

2023-07-10 15:28:48.212|26|ExecutionContext.RunInternal|Destroying connection e58fae35-39e9-448f-82a8-660f9dc22476 (DataMiner Cloud Platform): Authentication took too long.
2023-07-10 15:28:48.228|26|ExecutionContext.RunInternal|Destroying connection 12d50385-fc4b-4362-8999-d801630501c4 (SLNet on qa-dma-test-10): Authentication took too long.
2023-07-10 15:28:48.243|70|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: DataMiner Cloud Platform
2023-07-10 15:28:48.243|71|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-10 Application: SLNet on qa-dma-test-10
2023-07-10 15:28:48.243|70|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64637 [Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: DataMiner Cloud Platform ]

followed by constantly failing:

2023-07-10 15:49:39.626|91|Destroy|Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: SLAnalytics
2023-07-10 15:49:39.626|91|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64637 [Connection did not authenticate. Computer: QA-DMA-TEST-14 Application: SLAnalytics ]
2023-07-10 15:49:39.626|91|GenerateInformationAlarm|Not generating information alarm (no agent up and running): 56/2100000000/64505 [SLAnalytics removed because of error: Authentication took too long.]
2023-07-10 15:49:43.349|62|AuthenticationStep|

SLWatchdog2

2023-07-10 15:47:27 2196|Failed to generate alarm for “NATS has stopped, restarting…”: There’s no connection available with this dataminer. (0x800402cdh)
2023-07-10 15:48:27 2196|NATS has stopped, restarting…

Robin Devos [SLC] [DevOps Advocate] Edited comment 11th July 2023

Robin Devos [SLC] [DevOps Advocate] commented 10th July 2023

Hi Jan
Not here with a solution, just wondering a couple of things…
1. Does the upgrade.log indicate an issue? an exception, an error,…
This is found in C:/Skyline DataMiner/Upgrades/Packages//upgrade.log
2. Were there any .Net packages installed during the upgrade (search for “InstallDotNet” during the step “ExecuteUpgradeActions”)
-> If 1 of the packages got installed, did the server reboot?

Jan Staelens [SLC] [DevOps Advocate] commented 10th July 2023

There is no upgrade.log file. On neither of the agents. Not in that location at least, also not under DataMiner Logs. I’m seeing backups of the logging from back in 2022 when I installed 10.2. Is it possible the logfile no longer exists or was moved to a different location?

1 Answer

Hi Jan
Not here with a solution, just wondering a couple of things…
1. Does the upgrade.log indicate an issue? an exception, an error,…
This is found in C:/Skyline DataMiner/Upgrades/Packages//upgrade.log
2. Were there any .Net packages installed during the upgrade (search for “InstallDotNet” during the step “ExecuteUpgradeActions”)
-> If 1 of the packages got installed, did the server reboot?
There is no upgrade.log file. On neither of the agents. Not in that location at least, also not under DataMiner Logs. I’m seeing backups of the logging from back in 2022 when I installed 10.2. Is it possible the logfile no longer exists or was moved to a different location?

score 2 · Answer 1 · 2023-07-10T15:16:17+00:00

Fixed after some more detective sleuthing.

I focused on NATS as my guess for being the root cause, as it was often remarked it can often fail to play nice with DataMiner.

To assist me. I found this linked in a previous DoJo post:

https://docs.dataminer.services/user-guide/Troubleshooting/Procedures/Investigating_NATS_Issues.html

Eventually I discovered NATS Service was not running but NAS was. Firewall settings were OK however. But the nats-server.config located here C:\Skyline DataMiner\NATS\nats-streaming-server was not correctly configured.

Both servers need the same IP configured in the resolver setting. However Agent 1 of the cluster was setup with 0.0.0.0 and agent 2 was setup with the HTTPS Url of agent 1.

I changed both to point to the HTTPS URL of Agent 1 and rebooted both servers.

This allowed everything to start correctly.

Not all cluster agents starting after upgrade to 10.3.7.0

1 Answer