Hi Dojo,
We have a cluster of 4 agents(without failovers) where recently an upgrade to the feature release 10.1.3-9963 was performed. Ever since then we have seen the 'NATS has stopped, restarting...' error. But this is present only on agents 2, 3 and 4 - not on agent 1.
We have the firewall rules enabled on all 4 agents (example of one agent below):
I also see on agent 1, the NATS service is continuously running without stopping. But on the other 3 agents, it is continuously stopping and starting. I cannot stop the service on the 3 agents, it does not work.
I tried to end the process tree of the corresponding process nats-streaming-server.exe but the process too appears and disappears continuously that by the time I click End Process Tree it already has disappeared.
The NAS service is continuously running.
How can I remove these errors from the alarm console ? Thank you in advance.
Update - May 7 : We now have the error only on Agent-2(.131), after the firewall was updated on all agents. I saw these logs from Agent-1 :
[13296] 2021/05/07 16:27:52.737252 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection created
[13296] 2021/05/07 16:27:52.741110 [DBG] Account [ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL] fetch took 1.9951ms
[13296] 2021/05/07 16:27:52.741110 [WRN] Account fetch failed: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Account JWT lookup error: could not fetch <"http://0.0.0.0:9090/jwt/v1/accounts/ABRBZXY4MLTM2LGLUXSGVA3W6WP7AFSF5ZMIJBG4F2V2HI6NBTX2VMKL">: 500 Internal Server Error
[13296] 2021/05/07 16:27:52.741110 [ERR] 172.30.144.131:62934 - cid:3708 - authentication error
[13296] 2021/05/07 16:27:52.741110 [DBG] 172.30.144.131:62934 - cid:3708 - Client connection closed
[13296] 2021/05/07 16:27:52.780121 [INF] 172.30.144.131:62926 - rid:3706 - Router connection closed
[13296] 2021/05/07 16:27:53.356184 [ERR] Error trying to connect to route (attempt 225): dial tcp 172.30.144.131:6222: i/o timeout
[13296] 2021/05/07 16:27:54.360141 [DBG] Trying to connect to route on 172.30.144.131:6222
[13296] 2021/05/07 16:27:55.362065 [ERR] Error trying to connect to route (attempt 226): dial tcp 172.30.144.131:6222: i/o timeout
On agent-2 I see that ports 4222, 6222 and 8222 are not in a listening state, yet the person responsible says the ports are open. What am I missing ? TIA
Hi Arunkrishna,
If you want to remove the error alarms in your alarmconsole, you can right click on them and choose "Clear alarm...". After selecting this option you can optionally give a comment and clear the alarm by which it will be removed from the alarmconsole. However if your nats service is continuously stopping and restarting, these errors will come back in your alarmconsole.
Trying to clear the alarm will not work because this specific alarm is unclearable.