How necessary is the Cassandra compaction as it requires a lot of resources and time?
During the compaction process, which usually takes around 10 to 12 hours for a 150GB database, the system gets slow.
What could happen in the near future if the Cassandra compaction is disabled?
Short answer: without compaction, Cassandra will perform less well, and you'll run out of hard drive space.
Long answer: When a write comes in, it’s written to the commit log, and to the active Memtable for the table. Memtables are later flushed to disk, and that file is called an SSTable. SSTables are immutable - the data contained is never updated or deleted in place. Instead of updating the data in place, we write our changes to a new Memtable, and then a new SSTable. The compaction process merges the SSTables together. If there was an update or a delete, the newest value for the field is kept by compaction and is written to the new SSTable, and the older versions are discarded.
Also, entries for which the TTL has expired are only deleted on compaction.
So without compaction, your disk will fill up with immutable SSTable files containing data that may already have been overwritten or should be deleted.
Note that efficient compaction requires sufficient hard drive space. It is recommended to have at least half the size of the biggest Cassandra table available as free hard drive space to allow the compaction to be done in an efficient way.
Also note that compaction can be rescheduled as a windows job, so perhaps your customer can schedule the compaction to occur after peak hours, rather than at 1am as is standard.
From experience:
The amount of free disk space required for compaction is _the same_ as the size of the biggest table, not half the size. If less is available, “nodetool.bat compact” will terminate with an error. So it is very important to run compaction before the database size reaches one half of the disk size (assuming you have a dedicated disk for Cassandra).