Recently we had a situation where all SRM resources suddenly disappeared from the agent, though Elastic and Cassandra databases were ready and available.
We finally managed to get them back by restarting the agent with all elements stopped. Still, the collected logs didn't give us a clear indication of the culprit, so I wonder if there is a recommended procedure for troubleshooting such situations. How can I tell where the resources are stored? And how can I check that source?
Following steps are my to go steps in trouble shooting any kind of SRM issue.
- Check following log files:
- SLResourceManager (after restart, perhaps in the _BAK logfile of it)
- SLDBConnection
- SLSearch
- SLError
- SLNet
- Check the Resources.xml file if it still has everything. (as resources is still in the Resources.xml file)
- Try to fetch SRM objects using the client-test tool --> Advanced --> Apps --> ResourceManager
- Try to do a follow of cube when getting resources in cube and check if the follow sends a message and has a return message with the resources. Or maybe an exception.
- Check if the SLNet or datagateway process hasn't disappeared and maybe created a crashdump while it did.
In this perticular issue I would think that the SLNet process has disappeared and the cache was cleared and not back updated from the list.
Thanks Thomas. I’ll take these steps into account in case of any other similar situation.