Hello Dojo,
We had an issue where a failover agent suffered a complete hardware failure and needed to be reinstalled on a new machine. The backup agent was reinstalled on new hardware, but did not have a backup to restore, so a fresh 10.2 installation and upgrade to 10.2 CU11 after joining this new agent to failover we noticed a schema mismatch, after it was reported that elements were no longer working do to the previous schema being lost and a new one created.
I have attempted to resolve this mismatch by doing a nodetool drain and rolling restarts on both nodes. When that did not work we ended up breaking failover on the primary agents cube from the failover status window, reinstalling again the backup agent, setting the primary node back to localhost and from the primary node executing a nodetool removenode of the backup node as it still remained in the nodetool status after breaking failover.
After rejoining the backup agent in failover, we again have this same issue as above. It seems as though the schema conflict resides in the primary node somewhere but I am unsure how to resolve it. I am not sure where this schema mismatch could be stored in the primary node and where to go from here.
Thank you in advance for any insight and info!
Hey Ryan. You mentioned that the backup agent was reinstalled from scratch with the 10.2 installer. Was the main agent installed with the same installer? Can you check that the Cassandra versions are the same? If there is a version mismatch, this might be causing the Cassandra nodes to avoid being in the same schema.
Hi Ryan,
When DataMiner starts it will create the tables if they don't exist yet. When a new Cassandra node is being added to an existing cluster, you should point it to the seeds of the existing cluster to ensure the schema is known to the new node. When the node itself is a seed it can start without contacting other seeds. Most likely, the failover agent added a blank Cassandra node as seed. When DataMiner communicates to it just after it starts it will not yet have the schema, so it will create the needed tables of the same name overwrite the existing tables.
If you end up in the situation where you have two folders of your tables with a different ID and you want to restore the data, you will need to figure out first which one is now being used by the cluster. If you still have one node using the correct tables (you can see the ID used in the tables table in the schema keyspace). You could stop the Cassandra nodes that are using the incorrect ID and add those again after cleaning them (SSTables, commitlogs, hint files and cache) and set seeds of nodes with the correct schema without setting itself as seed.
If the new tables are already accepted by all nodes, then you can restore the old data by copying the SSTables from the old folder to the new one and restarting Cassandra.
Hope this helps.
Running a nodetool status on the different nodes during the process can also help you to identify what is going on.
In regards to, “When a new Cassandra node is being added to an existing cluster, you should point it to the seeds of the existing cluster to ensure the schema is known to the new node.”, when you configure a failover agent from the cube, it automatically adds both IPs of the failover agents as seeds on the primary and backup cassandra.yaml files. Is this a potential software issue or is it handled differently when the node is first joined and only after they are joined does the software append the yaml to include both IPs?
Depends a bit on how the software is handling adding a failover (they could wait until after the node is fully added to the Cassandra Cluster before connecting to it for example). Ideally, you do not configure a new node as a seed to prevent this situation, so you could see this as a bug.
Hey Ryan,
As a last resort I would wipe both Cassandra nodes away, start fresh, and restore your SLDMADB table files on the primary. After that you can attempt to configure failover again.
- Ensure failover is completely broken
- Stop all DMAs and Cassandra nodes
- Uninstall Cassandra nodes (sc delete) and remove all files/folders
- Reinstall Cassandra on both DMAs
- Connect DMAs to their respective node
- Start DMAs and ensure they function as standalone DMAs
- Stop DMA and Cassandra on primary
- Restore SLDMADB table files on primary
- Start primary and ensure functionality
- Reconfigure failover
We weren’t sure if there was an easier way to possibly just delete/append a schema table instead of a full deletion, reinstall and restore of the data tables (a huge effort).
this is correct, the primary was running on Cassandra 3.7 and the new ISO installs 3.11. After downgrading the secondary node to 3.7 we got rid of the schema conflict and all is operational again! Thank you!