We have a 6 node Cassandra cluster with a replication factor of 3 and consistency level of 2. We need to bring down 2 nodes for maintenance, but they cannot be done separately. Maximum downtime will likely be 15 minutes so any data that cannot be written will be cached in the Dataminer nodes as I understand. Are there any risks here or can this be done safely?
Thanks.
Hi Petro, I'm glad you found your way to the Dojo questions! 🙂
I'll provide the same/similar answer, as I provided via mail, to help others.
The DM Docs are a good place to start:
- Data replication (RF) defines how many copies of the data needs to be present in the Cassandra cluster.
- Consistency level (C) defines how many replicas need to respond before the response is seen as acceptable.
The DM Docs also provide a nice and fitting example:
- If the replication factor is 3 and the consistency level is 2, Cassandra queries will require an answer from 2 out of 3 replicas to be considered successful. This means that if one node is down, queries will still succeed, but if another node is down, it is possible that queries will no longer succeed.
A more concrete example to your situation:
2 of your nodes will be down for a short time, should you request data (e.g. a trending or elementdata) which happens to be located on these 2 nodes, then this read request will fail. (Even if there's a 3rd copy on another node due to RF = 3, the response won't be accepted due to C = 2.)
Writing shouldn't be an issue, as there are still 4 other nodes available.
However, in case there would be an issue with pushing the data to the DB, then you are correct. In that case, data will be kept locally on the disk until the Cassandra cluster is available again.
As mentioned via mail, all went well and no anomalies were seen during your maintenance window.
In addition, you could set the consistency level temporarily to one to prevent reads/writes would fail. Keep in mind that to ensure strong consistency (always have the latest data) you should have R + W > RF. As you say it is only 15 min it might mean longer downtime to change the CL (DMA restart needed) than to get your Cassandra nodes back up.
More info:
https://docs.dataminer.services/user-guide/Advanced_Functionality/Databases/Supported_System_Data_Storage_Architectures/Migrating_the_general_database_to_a_DMS_Cassandra_cluster.html?tabs=tabid-1#customizing-the-consistency-level-of-the-cassandra-cluster
R => Read Consistency
W => Write Consitency
RF => Replication Factor