Hello community,
we have deployed separate 3 node cassandra cluster (and 3 node opensearch cluster) as well as one LAB DMA. We connected the LAB DMA to these clusters.
We have added some elements to generate the data (alarms, trends). Then we were testing stability of the system by bringing down database notes (both cassandra and opensearch nodes). All tests went well, even more than good as e.g. the historical alarms were shown when all 3 cassandra nodes were down (I guess DMA is using the offloaded data in this case).
But, then we shut down all servers to move them. Once booted back, we noticed one element in error
The error says there is a general database failure and therefore the element can not start. So we repeated the "test" and shut down everything once again. After that, the element in error initially was OK, but other 2 elements had that error.
After some time of manipulating (multiple times of stopping, activating, ...) in cube, one of these two became OK. The other one is still in error.
Other interesting data:
Element log:
2024/07/17 17:05:22.045|SLProtocol - 22284 - TNS4200-1 - copy|19560|ElementDataPagedQuerier::Query|ERR|-1|Could not start reading the element data for 38603/7: 0x80131500: 2024/07/17 17:05:22.047|SLProtocol - 22284 - TNS4200-1 - copy|19560|CProtocol::InitializeParameters|ERR|0|Failed to query elementdata for 38603/7: General database failure. 2024/07/17 17:05:22.047|SLProtocol - 22284 - TNS4200-1 - copy|19560|CProtocol::Init|ERR|0|InitializeParameters failed General database failure. (hr = 0x80040226) 2024/07/17 17:05:22.048|SLDataMiner.exe - TNS4200-1 - copy|15024|CElement::Start|ERR|-1|InitializeProtocol for Element TNS4200-1 - copy failed with General database failure.. (hr = 0x80040226) 2024/07/17 17:05:22.049|SLDataMiner.exe - TNS4200-1 - copy|15024|CElement::Activate|ERR|-1|Start failed. General database failure. (hr = 0x80040226) 2024/07/17 17:05:22.049|SLDataMiner.exe - TNS4200-1 - copy|15024|CElement::SetState|DBG|0|** Setting state from 1 to 10 (RealState = 4) 2024/07/17 17:05:22.049|SLDataMiner.exe - TNS4200-1 - copy|15024|CElement::SetState|DBG|0|** Setting state finished.
Database Connection log (obtaining this error whenever the erroneous element gets activated):
2024/07/17 17:44:05.872|SLDBConnection|StartPagedRead|INF|0|73|System.AggregateException: One or more errors occurred. ---> Cassandra.ReadFailureException: Server failure during read query at consistency Quorum (2 response(s) were required but only 0 replica(s) responded, 2 failed) ... ...
Logs from the cassandra node (obviously failing getting data for erroneous element with DMA ID = 7):
ERROR [ReadStage-2] 2024-07-17 15:44:05,896 NoSpamLogger.java:111 - Scanned over 100001 tombstones during query 'SELECT v, vu FROM dmsdemo_elementdata.elementdata WHERE d = 38603 AND e = 7 AND p > 6004 AND i > '290602.520.5102' LIMIT 5000 ALLOW FILTERING' (last scanned row token was -4197853928192400396 and partion key was ((38603, 7), 17010, 131985986)); query aborted
Issuing the query command from above log manually by cqlsh brings me this:
Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed - received 0 responses and 2 failures: READ_TOO_MANY_TOMBSTONES from /10.185.116.211:7000, READ_TOO_MANY_TOMBSTONES from /10.185.116.213:7000" info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0, 'failures': 2, 'error_code_map': {'10.185.116.211': '0x0001', '10.185.116.213': '0x0001'}}
So I assume data related to the element gets corrupted in cassandra cluster and can't be read by DMA. I also run nodetool repair --full on one Cassandra node but without an effect.
My questions could be condensed to these:
- why did such situation happen? The cluster is there to provide robustness and avoid loss of data.
- why some elements reverted back to normal state when the assumption is that the data were lost?
- isn't there a way to force the erroneous element rewrite the essential data that are blocking it to start normally?
Regards,
Milos
Hi Milos,
It's great to hear how you are testing the system, thanks for sharing that. When using OpenSearch and Cassandra Cluster, alarms are stored within OpenSearch. This is likely why alarms were still accessible when stopping the Cassandra DB.
The elementdata (saved parameters) is stored within Cassandra. At startup of the element it will first try to retrieve the saved parameters from the Cassandra DB. This has failed and that is why you see the error on the element. The SLDBConnection logging also confirms that requests are failing.
The query result with CQLSH tells us that the query failed for that element is caused by too many tombstones that were encountered. This is most likely because the parameters that are saved in the connector/element are updating too frequently or because non-volatile tables have too many rows being added and removed (keys are saved by default to keep alarm root time).
This question will tell you what can cause tombstones in Cassandra in general:
Hi Michiel,
thank you for the answer. It shown us the right way to investigate. We had indeed improper configuration of tombstones in cassandra (tombstones thresholds in cassandra.yaml).
The element in error went OK spontaneously again and now we know it happened because tombstone count decreased below error threshold. Anyway, it’s likely we’ll avoid such situation in the future as the threshold was increased as per suggestion described in SL doc.
Still, we want to understand managing tombstones and therefore will need to get more familiar with this. Anyway, thank you for help again!