I recently had to restart one of my agents for another reason
I then happened what I will describe below.
Once started I noticed that 13 of the 310 elements of various protocols started with the following common error message:
Initializing the protocol for ELEMENT_NAME failed. General database failure. (hr = 0x80040226)
When validating log of the elements I do not observe errors and even refers that the elements initialized normally, example:
CElement :: NotifySNMPManagers | DBG | 5 | Notified SNMPManagerV2 of active 26416/540. The operation completed successfully. (hr = 0x00000000)
**********
When trying to start the element I observe that the option to start is not available so I choose the only viable option "restart" and indeed the element starts normally.
I remember that this had happened before but I consider it an isolated case until today.
I would like to know if there is a reference to the error codes to move forward with a more in-depth investigation of the root cause.
I hope you can guide me in this regard.
Greetings
Hey Christhiam,
A comprehensive list of the error message and their meaning can be found Here.
Initializing the protocol for ELEMENT_NAME failed. General database failure. (hr = 0x80040226) is probably the culprit here and points towards issues with detching the correct data from the database.
A good point to start here might be to check SLDataGateway and SLDBConnection logfiles for issues.
Hi Brent,
I had to restart the agent again but this time I took the opportunity to restart the DB.
On this occasion, the number of elements that started in an error state increased to 21
Look in the logs you recommended and I found the following message that is repeated many times in SLDBConnection:
2021/08/11 16:52:30.902|SLDBConnection|SLDBConnection|INF|0|76|CassandraConnection.ExecuteAsync (INSERT INTO datapoints(“d”,”e”,”p”,”w”,”i”)VALUES (?,?,?,?,?)USING TTL ?;) – Exception: DBGatewayException(SLCassandraClassLibrary.DBGateway.Cassandra.StorageManagers.SingleNode.CassandraConnection,,UNKNOWN) (Code: 0x80131500) SLDataGateway.Types.DBGatewayException: CassandraConnection.ExecuteAsync (INSERT INTO datapoints(“d”,”e”,”p”,”w”,”i”)VALUES (?,?,?,?,?)USING TTL ?;) – Exception: System.AggregateException: One or more errors occurred. —> Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 127.0.0.1:9042: BusyPoolException ‘All connections to host 127.0.0.1:9042 are busy, 2048 requests are in-flight on each 2 connection(s)’)
at Cassandra.Requests.RequestHandler.GetNextValidHost(Dictionary`2 triedHosts)
at Cassandra.Requests.RequestExecution.d__13.MoveNext()
— End of inner exception stack trace —
—> (Inner Exception #0) Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 127.0.0.1:9042: BusyPoolException ‘All connections to host 127.0.0.1:9042 are busy, 2048 requests are in-flight on each 2 connection(s)’)
at Cassandra.Requests.RequestHandler.GetNextValidHost(Dictionary`2 triedHosts)
at Cassandra.Requests.RequestExecution.d__13.MoveNext() Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 127.0.0.1:9042: BusyPoolException ‘All connections to host 127.0.0.1:9042 are busy, 2048 requests are in-flight on each 2 connection(s)’)
at Cassandra.Requests.RequestHandler.GetNextValidHost(Dictionary`2 triedHosts)
at Cassandra.Requests.RequestExecution.d__13.MoveNext()
— End of inner exception stack trace —
at SLCassandraClassLibrary.DBGateway.ExceptionHandlers.ExceptionHandler.handle(DBGatewayException exception)
at SLCassandraClassLibrary.DBGateway.Cassandra.StorageManagers.SingleNode.CassandraConnection.c__DisplayClass175_0.b__0(Task`1 c)
Cleaned Stack !!!
**********
Can you guide me regarding: BusyPoolException?
As always I appreciate your guidance
Regards
The BusyPoolException is an unexpected one. Essential a cassandra node can only handle so many concurrent requests (2048) when the driver notices a host is handling this amount it will try the next host according to the loadbalancing policy. If all the nodes in the setup are at the max, cassandra will throw the BusyPoolException.
So this leads us to believe that some element or protocol is currently spamming the database with too many read/write requests. A way to investigate this would be to stop the dma, use “C:Skyline DataminerToolsChange Element States Offline.exe” to stop all the elements. Start the dma, once the dma is started you can start the elements in small batches (10 or so). By doing this you can start narrowing down the elements that could cause issues.
Hi Brent,
Thank you very much for the list of error codes.
I’ll see what I find in the referred logs
Regards