We have a 9.6 DMA deployment using Cassandra, where the database on the inactive node is larger than the active node. Logs indicate the syncing and purging is happening regularly as configured, however the database on the inactive node continues to be larger.
Is there a mechanism that we can apply to clean up the entire database on the inactive node and let it resync from the active node?
Thanks Michiel for correcting me, it should read as active and inactive nodes rather than the active and inactive database.
Hi Srikanth,
There could be a couple of reasons why you see this behavior. In most cases it is related with a lack of resources (e.g. Disk, CPU, memory, network, ...). Can you evaluate if one of your nodes is not having enough resources? In case of virtualized environment it is best to run benchmarks.
For example: We have seen already a couple of setups where large disks are being shared between multiple virtualized servers. When the compaction or repair actions kick in it could be that they are sharing a disk and therefor are not able to successfully deal with the load of DataMiner together with the added load of the compaction/repair.
To know if the repairs have finished successfully you should check through DevCenter (C:\Program Files\Cassandra\DevCenter\Run DevCenter.lnk). For example for the data table:
select * from system_distributed.repair_history where keyspace_name='SLDMADB' and columnfamily_name='data' order by id desc;
To be able to perform your compaction action you need at least double the space on your data disk (to build the new ssTables). If you are no longer able to do this (because your tables are getting too large) there are a couple of options:
- Perform Compactions (both nodes)/repair (one node)/compactions (both nodes) on a table basis and start with the smaller tables that still have enough space to be compacted and move your way up to the bigger tables so you clear space for the larger tables.
- Truncate the large table(s) => loss of data:
- Timetrace table: history alarms
- Data table: history trending
- Not advised to do other tables
- Use the cloning tool to copy over the table without tombstones. Cassandra Clone - DataMiner Dojo Best to test this on an staging system before performing these actions on production to get the steps right.
In addition: In case of failover setups with the default Cassandra setup (2 Cassandra nodes) you will want to schedule the compaction action for the main and Failover on different moments. This way, if the compaction is running on one of the two nodes and leading to some slow behavior, DataMiner will still have the other node. As for the repair, it should only be scheduled on one of the two servers (RF2 and 2 nodes will lead to all data being repaired with the -full option). Make sure to also schedule this one on a different time then the two compaction actions. You can normally find these actions in the TaskSchedular under DBGatewayMaintenance.
Hope this helps.
Michiel
Thanks for the comprehensive response on this question. We have now verified the resources, the table history, and also the compaction/repair patterns.
As you mentioned in the end, main and failover runs the compaction at different times of a week and also the repair. The DB size difference is showing up because of the different compaction times.
Queries on the tables on main and failover provide the same results consistently. Also the logs show that the compaction and repair are successful.
The only question I have is about the truncating the large tables you mentioned. My understanding is TTL in the settings automatically purge the data older than the days specified, does it not take care of the older data automatically?
Thanks
Hi Srikanth,
Your data that is expired (based on the TTL) will first be converted to a tombstone and after the gc_grace_seconds it will be actually removed from your disk. So, you need to run multiple compaction actions to reclaim disk space.
Note: If you are sure you will only have one Cassandra node (not the case for you), you could put your gc_grace_seconds to 0 for improved performance. If you put your gc_grace_seconds to 0 when you have multiple nodes, you break hinted handoff (see https://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html)
During the compaction your ssTables/partitions will be checked and potentially recreated. This is why we state you need at least double the space (worst-case). If you don’t have enough space to make your compaction happen then you can look at the options that were provided (like truncating tables).
In addition related to trend data, we store the different types of data (with different TTL e.g. Real-Time, 5-minutes, hours, …) into the same table. This means that Cassandra need much more effort to get rid of the tombstones throughout the different ssTables with the automatic triggered compaction actions (Minor compactions). This is why we trigger major compactions to ensure your disk space is reclaimed in a timely matter.
For more information on these topics:
https://cassandra.apache.org/doc/3.11.3/operating/compaction.html
FYI: The ‘Cassandra Cluster’ option in DataMiner is now released, which means you have one Cassandra Cluster for your complete DataMiner System (DMS) instead of a Cassandra Cluster for every DMA/FO-pair. With this options we also reviewed some of our data models like the data table (split up into multiple tables with different TTL) to have better performance. The ‘Cassandra Cluster’ option, will also make it possible to use the features like swarming that are on our roadmap. Keep in mind that for ‘Cassandra Cluster’, Failover is currently not supported.
Thanks Michiel, that’s really useful information.
FYI: For Cassandra there is never an ‘active’ or an ‘inactive’ DB. You have a Cassandra Cluster that has multiple nodes. Data (rows) are divided between these nodes based on the partition columns (partition key). As we use a RF (replication factor) of 2 we have all data (rows) available on both nodes. Every node can be a coordinator for storing/retrieving data.