Hi Dojo,
We have a cluster of 4 agents(without failovers) where recently an upgrade to the feature release 10.1.3-9963 was performed. Ever since then we have seen the 'NATS has stopped, restarting...' error. But this is present only on agents 2, 3 and 4 - not on agent 1.
We have the firewall rules enabled on all 4 agents (example of one agent below):
I also see on agent 1, the NATS service is continuously running without stopping. But on the other 3 agents, it is continuously stopping and starting. I cannot stop the service on the 3 agents, it does not work.
I tried to end the process tree of the corresponding process nats-streaming-server.exe but the process too appears and disappears continuously that by the time I click End Process Tree it already has disappeared.
The NAS service is continuously running.
How can I remove these errors from the alarm console ? Thank you in advance.
Update - May 7 : We now have the error only on Agent-2(.131), after the firewall was updated on all agents. I saw these logs from Agent-1 :
[13296] 2021/05/07 16:27:52.737252 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection created
[13296] 2021/05/07 16:27:52.741110 [DBG] Account [ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL] fetch took 1.9951ms
[13296] 2021/05/07 16:27:52.741110 [WRN] Account fetch failed: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Account JWT lookup error: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [ERR] 172.30.144.131:62934 - cid:3708 - authentication error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection closed
[13296] 2021/05/07 16:27:52.780121 [INF] 172.30.144.131:62926 - rid:3706 - Router connection closed
[13296] 2021/05/07 16:27:53.356184 [ERR] Error trying to connect to route (attempt 225): dial tcp 172.30.144.131:6222: i/o timeout
[13296] 2021/05/07 16:27:54.360141 [DBG] Trying to connect to route on 172.30.144.131:6222
[13296] 2021/05/07 16:27:55.362065 [ERR] Error trying to connect to route (attempt 226): dial tcp 172.30.144.131:6222: i/o timeout
On agent-2 I see that ports 4222, 6222 and 8222 are not in a listening state, yet the person responsible says the ports are open. What am I missing ? TIA
For future reference, the final solution was to:
- Adjust the firewall rules from "Domain" to "All"
- Using Client Test Tool, connect to any agent and send a NatsCustodianResetNatsRequest with default values (IsDistributed = false)
This will cause both NAS and NATS to be reconfigured entirely, and then restarted on all agents in the cluster.