Hi Dojo,
We have a cluster of 4 agents(without failovers) where recently an upgrade to the feature release 10.1.3-9963 was performed. Ever since then we have seen the 'NATS has stopped, restarting...' error. But this is present only on agents 2, 3 and 4 - not on agent 1.
We have the firewall rules enabled on all 4 agents (example of one agent below):
I also see on agent 1, the NATS service is continuously running without stopping. But on the other 3 agents, it is continuously stopping and starting. I cannot stop the service on the 3 agents, it does not work.
I tried to end the process tree of the corresponding process nats-streaming-server.exe but the process too appears and disappears continuously that by the time I click End Process Tree it already has disappeared.
The NAS service is continuously running.
How can I remove these errors from the alarm console ? Thank you in advance.
Update - May 7 : We now have the error only on Agent-2(.131), after the firewall was updated on all agents. I saw these logs from Agent-1 :
[13296] 2021/05/07 16:27:52.737252 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection created
[13296] 2021/05/07 16:27:52.741110 [DBG] Account [ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL] fetch took 1.9951ms
[13296] 2021/05/07 16:27:52.741110 [WRN] Account fetch failed: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Account JWT lookup error: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [ERR] 172.30.144.131:62934 - cid:3708 - authentication error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection closed
[13296] 2021/05/07 16:27:52.780121 [INF] 172.30.144.131:62926 - rid:3706 - Router connection closed
[13296] 2021/05/07 16:27:53.356184 [ERR] Error trying to connect to route (attempt 225): dial tcp 172.30.144.131:6222: i/o timeout
[13296] 2021/05/07 16:27:54.360141 [DBG] Trying to connect to route on 172.30.144.131:6222
[13296] 2021/05/07 16:27:55.362065 [ERR] Error trying to connect to route (attempt 226): dial tcp 172.30.144.131:6222: i/o timeout
On agent-2 I see that ports 4222, 6222 and 8222 are not in a listening state, yet the person responsible says the ports are open. What am I missing ? TIA
This is a firewall issue. NATS isn't starting because NAS (the account server) can't connect to the primary (= the agent with the lowest lexicographical IP address, you can find this in nas.config). NAS can start without a connection to the primary but it won't load any JWTs, which is why you're getting the error 500 when NATS tries to verify its account claims.
On agent-2 I see that ports 4222, 6222 and 8222 are not in a listening state
This is normal. 4222, 6222, and 8222 are the ports used by NATS, which isn't starting. This issue isn't related to those ports, but to port 9090 (the one used by NAS).
Make sure port 9090 is also opened between all DMAs, and you may want to try changing the profile of the Windows firewall rules from Domain to All. You may also have to restart the NAS service on all 4 agents after adjusting the firewall.
Hi Mattias, thanks for your inputs. We changed the firewall rules from Domain to all, restarted the NAS service too. I can also see port 9090 is open on all DMAs, but the above mentioned issue still persists. Please let me know if there is any other way. Right now the error is present only on agent 2