Hi Dojo,
We have a cluster of 4 agents(without failovers) where recently an upgrade to the feature release 10.1.3-9963 was performed. Ever since then we have seen the 'NATS has stopped, restarting...' error. But this is present only on agents 2, 3 and 4 - not on agent 1.
We have the firewall rules enabled on all 4 agents (example of one agent below):
I also see on agent 1, the NATS service is continuously running without stopping. But on the other 3 agents, it is continuously stopping and starting. I cannot stop the service on the 3 agents, it does not work.
I tried to end the process tree of the corresponding process nats-streaming-server.exe but the process too appears and disappears continuously that by the time I click End Process Tree it already has disappeared.
The NAS service is continuously running.
How can I remove these errors from the alarm console ? Thank you in advance.
Update - May 7 : We now have the error only on Agent-2(.131), after the firewall was updated on all agents. I saw these logs from Agent-1 :
[13296] 2021/05/07 16:27:52.737252 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection created
[13296] 2021/05/07 16:27:52.741110 [DBG] Account [ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL] fetch took 1.9951ms
[13296] 2021/05/07 16:27:52.741110 [WRN] Account fetch failed: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Account JWT lookup error: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [ERR] 172.30.144.131:62934 - cid:3708 - authentication error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection closed
[13296] 2021/05/07 16:27:52.780121 [INF] 172.30.144.131:62926 - rid:3706 - Router connection closed
[13296] 2021/05/07 16:27:53.356184 [ERR] Error trying to connect to route (attempt 225): dial tcp 172.30.144.131:6222: i/o timeout
[13296] 2021/05/07 16:27:54.360141 [DBG] Trying to connect to route on 172.30.144.131:6222
[13296] 2021/05/07 16:27:55.362065 [ERR] Error trying to connect to route (attempt 226): dial tcp 172.30.144.131:6222: i/o timeout
On agent-2 I see that ports 4222, 6222 and 8222 are not in a listening state, yet the person responsible says the ports are open. What am I missing ? TIA
For future reference, the final solution was to:
- Adjust the firewall rules from "Domain" to "All"
- Using Client Test Tool, connect to any agent and send a NatsCustodianResetNatsRequest with default values (IsDistributed = false)
This will cause both NAS and NATS to be reconfigured entirely, and then restarted on all agents in the cluster.
This is a firewall issue. NATS isn't starting because NAS (the account server) can't connect to the primary (= the agent with the lowest lexicographical IP address, you can find this in nas.config). NAS can start without a connection to the primary but it won't load any JWTs, which is why you're getting the error 500 when NATS tries to verify its account claims.
On agent-2 I see that ports 4222, 6222 and 8222 are not in a listening state
This is normal. 4222, 6222, and 8222 are the ports used by NATS, which isn't starting. This issue isn't related to those ports, but to port 9090 (the one used by NAS).
Make sure port 9090 is also opened between all DMAs, and you may want to try changing the profile of the Windows firewall rules from Domain to All. You may also have to restart the NAS service on all 4 agents after adjusting the firewall.
As mentioned in a comment above, note that this specific alarm can not be cleared from the alarm console.
In 10.1.3.0 NATS is only required when cloud connecting your dma. If this functionality is not required you can try to disable the NATS & NAS services in the windows service manager, DataMiner will no longer try to start those services then. This will require a DataMiner restart.
Hi Arunkrishna,
If you want to remove the error alarms in your alarmconsole, you can right click on them and choose "Clear alarm...". After selecting this option you can optionally give a comment and clear the alarm by which it will be removed from the alarmconsole. However if your nats service is continuously stopping and restarting, these errors will come back in your alarmconsole.
Trying to clear the alarm will not work because this specific alarm is unclearable.
Hi Mattias, thanks for your inputs. We changed the firewall rules from Domain to all, restarted the NAS service too. I can also see port 9090 is open on all DMAs, but the above mentioned issue still persists. Please let me know if there is any other way. Right now the error is present only on agent 2