Our team is responsible for running multiple Cassandra clusters reliably and flawlessly at scale leveraging Azure PaaS.
As a result, I am able to appreciate the technology and resonate with its steep learning curve. The following quiz was made to help alleviate the on boarding pain for folks. Most of the questions are not a simple “How to does concept A work ?” (that can be binged easily), but “How does A impact B,C and D if E is disabled ?”. Also, this is geared more towards operating the clusters, and less towards data modelling which does have good online literature.
I hope you find this useful (the target Cassandra version here is 2.1.13). The pre-requisite is basic knowledge of Cassandra.
Many thanks to the Datastax support, Cassandra Users / Developers List, and other online forums for the guidance !
1. have one data center in my setup, and I have setup a replication factor of 5 for my keyspace. If I query data using LOCAL_QUORUM consistency, how many nodes can go down before I’ll start seeing data loss ?
Will this change if I change replication factor to 6 ? If yes, to what value ?
2. Suppose I have setup replication=3 for my setup, and I read and write using LOCAL_QUORUM. Initially all nodes had a value of “10” for a piece of data. When I wrote a new value of “20”, I got an error back suggesting “Could not achieve LOCAL_QUORUM”. Now, if I read the value using LOCAL_QUORUM would I get an error, “10”, “20”, or something else ?
What if hinted handoff is enabled ?
3. Suppose I have setup replication=3 for my setup, and I read and write using LOCAL_QUORUM. Let us assume that all 3 nodes holding my data have different values (because repairs did not run / data went missing / whatever other reason). Will I get a LOCAL_QUORUM failure on read ?
4. We specify contact points when establishing a Cassandra connection. If those nodes go down afterwards, does the connection keep working ?
What if the nodes are seed nodes ?
5. How can I tell if my keyspace is evenly distributed across the datacenter ? How can I tell what nodes hold a particular piece of data ?
6. I have one keyspace set to replicate in two data centers. I changed it to not replicate in one data center. Now, I want to remove a node in this data center. Should I use “nodetool removenode” or “nodetool decommission” ? Which one would be faster and why ?
7. Is it necessary to run a “nodetool repair” after doing a “nodetool removenode”, what about “nodetool decommission” ?
8. Suppose a node in a data center is down. Now, I want to remove another node. Will I be able to remove successfully if I use consistency of LOCAL_QUORUM for my read / write queries ?
9. I am running a “nodetool rebuild” to build my node from scratch. Unfortunately, it gives up in the middle. If I run it again, will the node try to fetch the same data again from other nodes ?
10. I am replacing a node in the cluster using -Dcassandra.replace_address=X in cassandra-env.sh. How can I tell the time it will take to completely join the ring ?
11. I am replacing a node in the cluster using -Dcassandra.replace_address=X in cassandra-env.sh. Until the node joins the ring completely, will it serve any read / write requests ?
12. I am replacing a node in the cluster using -Dcassandra.replace_address=X in cassandra-env.sh. Once the node joins the ring, do I need to run a “nodetool repair” ? If yes, why ? If no, why not ?
13. I am replacing a node in the cluster using -Dcassandra.replace_address=X in cassandra-env.sh. Is it correct this entry can only be removed after the node joins the ring (meaning, if the node reboots in the middle and the entry does not exist, it won’t be able to proceed) ?
14. “nodetool” commands run on the node where started and don’t have any effect on other nodes. What is the exception to this behavior ?
15. Which of the following commands are safe to run multiple times without unexpected side-effects / performance implications ?
A) nodetool repair
B) nodetool cleanup
C) nodetool snapshot
D) nodetool removenode
16. Suppose we add a new node to an existing data-center. Once the node has completely bootstrapped, some data from old node must have moved over to this node to balance the ring. Thus, we should see disks from old nodes getting free. Is this correct ?
17. Does everything in Cassandra run at same priority ?
18. How can I tell how many threads are allocated for handling write requests, read requests and compactions ? What about repairs and streaming ?
19. What is the best way to reduce the number of SS Tables present on a node ?
A) nodetool compact B) increase concurrent_compactors in yaml C) nodetool cleanup D) nodetool repair
20. Explain the difference between /var/log/cassandra/output.log and /var/log/cassandra/system.log.
21. Suppose OpsCenter is down, and you need to find out approximately how much data Cassandra node is receiving per minute. What can you do ?
22. When OpsCenter shows that a node is down (“grey”), does this mean DSE is down or DSE is up but CQLSH is down ?
23. What is a “schema mismatch” ? How do you detect it, and get out of it ?
24. Suppose a node goes down and moves to another rack. When it comes back up, we update cassandra-rackdc.props to tell Cassandra about it. However, Cassandra won’t start up if you change racks in a running cluster. Why ?
25. Suppose we have a 2 data-center setup with same number of nodes in each DC and we are using random partitioner with vnodes. Is it safe to assume there must exist a node in both DCs that have same tokens (therefore same data if such a replication is set) ?
26. You’re seeing high number of SS Tables for a table on a node. What could be possible reasons for this ?
A] Low concurrent_compactors in yaml B] The table received a lot of data only on this node C] Poor Compaction properties defined on the table while creation D] All of the above
How can you confirm / deny [B] ?
Can you think of any other reasons ?
27. Suppose we are running DSE Spark and Cassandra on a node (like how we do on all of our nodes). At any point, how many Java processes would be running on the box ? Why ?
28. You want to tell CQLSH to execute a query on a given Cassandra node. How would you do that ? Will setting CONSISTENCY to ONE on the node from where you started CQLSH do the job ?
29. In what situations the tokens assigned to nodes change ?
A] When a new node joins the ring
B] When a node leaves the ring
C] When “nodetool repair” is run
D] When a node goes down for long period of time
30. Suppose a table uses Size Tiered Compaction Strategy. What is a typical file layout you will see on disk ?
31. When running “nodetool cleanup”, you notice that the disk usage is going high at times. Why would this happen ?
32. How will you determine if a node has “wide partitions” for a table ? How will you fix such a “wide partition” ?
33. Will “Truncate Table” work if any nodes in the cluster are down ?
34. Why is re-creating a table (drop followed by create) problematic ?
35. Which of the following statements are true ?
A] Seed nodes act as coordinators for Cassandra clients
B] Ops Center uses seed nodes to talk to the Cluster
C] Once setup, seed nodes cannot be changed
D] All of the above
36. Is it possible to disable CQLSH Auth on a node temporarily ? If so, how ?
37. If a few nodes are down in the cluster, and you attempt to add a new node you may get an error like below. Can you concretely explain what this is about ?
java.lang.RuntimeException: A node required to move the data consistently is down (/xx.xx.xx.xx). If you wish to move the data from a potentially inconsistent replica, restart the node with -Dcassandra.consistent.rangemovement=false
38. What ports does Cassandra use ?
39. Will Cassandra move data on dynamic properties such as load on nodes, nodes going up / down (meaning if you run “nodetool repair” will Cassandra move data from nodes that are now down to other nodes to increase availability?)
40. Can you rename a data center on a cluster ? What about renaming a cluster ?
41. As you know, SSTableLoader and CQLSH import are the two ways to load data into Cassandra. Besides the fact that CQLSH does not work for large amounts of data, is there something else that makes it a non-viable choice for a dataset ?
42. Suppose a requirement came up, where you need to export all the data written from now onward somewhere else. There are no timestamp columns on your tables, so you can’t query data that way and use CQLSH export.
Can you think of any other way to achieve this ?
43. Suppose your data pattern is such that you never update an existing row. In such a situation, does compaction add any value at all (since there are no “duplicates” so to speak) ?
44. You’re seeing a large number of Dropped Mutations on a node in Ops Center. Which of the following statement are true about this key metric ?
A] The number reported is per node.
B] The number will be persisted even if you restart DSE.
C] The number represents the number of reads that failed
D] All of above
45. Which of the following may fix the high number of Dropped Mutations ?
A] Increase write request timeout setting in yaml
B] nodetool setstreamthroughput 0
C] nodetool repair
D] None of above
46. You deleted some row in a table. That deleted data magically started re-appearing after some time in your read queries.
What can be the possible causes for this ?
A] Some nodes went down and did not come up until after the gc_grace_seconds setting on the table
B] You deleted using consistency LOCAL_QUORUM but read the row later with consistency ONE
C] The data did not get deleted in the first place. Cassandra dropped the mutation, but never returned an error back to the client.
D] Any of above
47. Secondary indexes are usually frowned upon. Which of the following are the reasons for it ?
A] They are “local” to the node. So, Cassandra has to talk to all nodes to figure out the column you’re asking for.
B] They need to be rebuilt every time the column value changes, which is costly
C] Compactions don’t run on secondary index tables as often as other tables, thus they may take disk space
D] All of above
48. Suppose you’re using vnodes and random partitioner. If you now read data in a table in a paged fashion, will you get the data in same order every time ? Why or why not ?
49. Which of the following statements are true ?
A] If Cassandra gets a read and write request at the same time, the write request gets higher priority.
B] The only way to generate new SS Tables on disk is through memtable flush.
C] Repairs run continuously in the background
D] All of above
E] None of above
50. If a node goes down, which of the following will try to bring it up ?
A] Datastax Ops Center Agent
B] Linux Kernel
C] Some other nodes through gossip
D] None of above
51. How can you tell how much CPU compaction is consuming ?
52. Does “nodetool removenode” result in any data streaming ?
3 nodes need to go down before you will start seeing issues.
Here is why: https://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_config_consistency_c.html
Same as 1.
Let’s assume the nodes are called N1, N2 and N3. When the write request failed, let’s assume it wrote data to N1, but could not write to N2 and N3 (thus failing QUORUM). When the read request comes, if it falls on N1, you will get “20” back and N2,N3 will get fixed as part of Read Repair. However, if it falls on N2 or N3 and N1 is down, Cassandra has no way of knowing that N2 and N3 have stale data, thus you will get “10” back.
If Cassandra was able to replicate the data to N2 or N3 before you issued a Read, you will get “20” back.
No, you will get the latest result. QUORUM fails iff the node does not respond back in allotted time. Cassandra has no way of determining if the data is “correct” given its distributed nature.
Yes. The client get updates from the cluster as nodes go up and down. However, if you restart the client and try establishing a new connection using the nodes that are now down, it will not work.
Seed nodes only play a role for gossip. So, it doesn’t matter if they go down – everything will continue to work as is assuming the cluster is healthy in general.
nodetool status <keyspace> and make sure the percentage on every node is equal
Note, this will tell what nodes should hold the data. It will not tell if they actually hold the data. The logic of this command simply runs the math that does the token computation.
Either. Since, the node does not own the keyspace there is nothing to stream out to other nodes when you run “nodetool decommission”. Normally, decommission streams the data owned by the node to other nodes thus taking a lot of time. Therefore in this case you can do a decommission if DSE process is up else “removenode”. Both will take equal amount of time.
“nodetool decommission” streams the tokens owned by the node to other nodes as well as the data meaning the cluster continues to stay consistent. “nodetool removenode” does the same thing, but nodes other than the one being removed move data around.
The answer has nothing to do with your consistency. It depends on if the node that is already down needs to own some of the token ranges that the leaving node owns. Obviously, Cassandra is smart to give token ranges to other nodes, so you should not see any problems in a reasonably big cluster.
Run “nodetool status” and check the size on other nodes. The node that is being replaced needs to get to that size. Byt running “nodetool status” a few times, you will get a feel for the rate at which the node is pulling data, thus helping you compute the total time.
No. The node simply replaced an existing node, so the token arrangement hasn’t changed. Running a repair is necessary only when the data corresponding to tokens needs to move to appropriate nodes.
No. Once the node figures out it needs to replace itself, this is written to one of the system tables. Thus, you can remove the entry as soon as you see “JOINING THE RING” in system.log
“nodetool repair”. Read more here: https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsRepair.html
Note any commands that show information about other nodes (e.g. status, ring, gossipinfo etc) are all really just showing cached information on the node where you’re running nodetool.
“nodetool removenode” is the only command that’s safe for re-entry. All other commands will have varying level of impact.
“repair” if run multiple times will cause too much data movement (if not actual, at least for merkle tree computation).
“cleanup” is fine to run multiple times, but it will also cause unnecessary work on the node
No. Unless a “nodetool cleanup” is run, the data does not get deleted even though the node doesn’t own it. This is by design (you can feel free to call it questionable design if you like).
Everything runs at the NORM_PRIORITY except compactions and repairs (they run at LOW_PRIORITY)
Generally you can’t tell. This is hardcoded for various stages. However, for compactions you can configure this through a yaml setting called “concurrent_compactors”
B is the safest way to do so. A is strongly advised against as it will create one giant SS Table that will then create problems for future compactions. C and D will most likely not lead to any success unless the node is indeed imbalanced.
When DSE starts is spews stuff in output.log for first minute or so. Then everything goes to system.log (which includes the stuff in output.log). For practical purposes, output.log is important if the process can’t start after doing a “sudo dse service start” as it will tell why it could not start.
Look at the size of files in /mnt/cassandra/commitlog. Here is a good refresher on what commitlog is: http://stackoverflow.com/questions/34592948/what-is-the-purpose-of-cassandras-commit-log
CQLSH is down. Therefore, it is a great way of telling if the node is functional.
This happens when the schema changes don’t propagate to all nodes. Every node generates a hash based on the schema it knows, and if all nodes agree this must match. When nodes are down and schema changes are made, they may not make it to all nodes. Similarly, if you re-create tables it will lead to unexpected mismatches.
This can be detected using “nodetool describecluster”, and fixed using “nodetool resetlocalschema” or restarting the node or a combination of both.
Cassandra uses rack information to distribute the load to different racks. Thus changing a rack messes up with this logic, and they have safe guarded by not letting you change it on the fly.
You can tell Cassandra to ignore the rack change using -Dcassandra.ignore_rack. However, you must run a “nodetool repair” after this so that Cassandra correctly places replicas.
It is a good mental exercise to think why a repair is needed here.
No. As much as it is unintuitive, it is not implemented that way.
Any / all of the above. You will need to eliminate one thing at a time.
You can add a graph in Ops Center that shows “Write Requests” and compare the numbers with other nodes.
No TTL on the table. If using DTCS, extremely low max_sstable_age.
1 (Cassandra started as sudo service dse start) + Number of spark executors + Datastax Agent + Datastax Agent Monitor
There is no way to achieve this reliably. While you can experiment with CONSISTENCY ONE, Cassandra may still pick a different coordinator every time.
A and B. Repair makes sure that the nodes that are expected to hold data for its token range holds it, it never changes the token assignment.
Couple of really big files, a few medium sized and 3/4 really small ones.
As part of cleanup, Cassandra is reading existing SS Tables and writing new ones with the unnecessary data removed. Thus, you may notice a little bump in disk usage while this completely settles.
“nodetool cfstats” will show you Maximum Partition Bytes. If this number is greater than 100 MB, it would be considered a “wide partition”.
You will need to change the schema of your table and add more entries to the Partition Key to fix this. You can not do anything once the schema is made to fix this.
No. It requires CONSISTENCY ALL.
This is really a bug in Cassandra. If a node is down / does not get the schema update when DROP TABLE is run, it will continue to use the old ID for the table after the new table is created. When it receives writes / reads for this table, it results into ID mismatch causing “couldn’t find cfid” exceptions in system logs and eventually system instability.
Correct answer is “None of the above”. The only function seed nodes provide is, all nodes provide frequent gossip updates to them.
Yes. Adjust auth related settings in yaml.
When you add a node it will take over the range of an existing node, and thus it needs to stream data from it to maintain consistency. If the existing node is unavailable, the new node may fetch the data from a different replica, which may not have some of data from the node which you are taking the range for, what may break consistency.
For example, imagine a ring with nodes A, B and C, RF=3. The row X=1 maps to node A and is replicated in nodes B and C, so the initial arrangement will be:
A(X=1), B(X=1) and C(X=1)
Node B is down and you write X=2 to A, which replicates the data only to C since B is down (and hinted handoff is disabled). The write succeeds at QUORUM. The new arrangement becomes:
A(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM will fetch the correct value of X=2)
Now imagine you add a new node D between A and B. If D streams data from A, the new replica group will become:
A, D(X=2), B(X=1), C(X=2) (if any of the nodes is down, any read at QUORUM will fetch the correct value of X=2)
But if A is down when you bootstrap D and you have -Dcassandra.consistent.rangemovement=false, D may stream data from B, so the new replica group will be:
A, D(X=1), B(X=1), C(X=2)
Now, if C becomes down, reads at QUORUM will succeed but return the stale value of X=1, so consistency is broken.
See CASSANDRA-2434 for more background.
By default, Cassandra uses 7000 for cluster communication (7001 if SSL is enabled), 9160 for Thrift clients, 9042 for native protocol clients, and 7199 for JMX. The internode communication, Thrift, and native protocol ports are configurable in cassandra.yaml. The JMX port is configurable in cassandra-env.sh (through JVM options). All ports are TCP.
You cannot rename a data center. You can however rename the cluster.
CQLSH export / import does not support preserving TTLs.
You can take a snapshot of data using “nodetool snapshot” and then enable “incremental_backups” in yaml. This will make sure that all the new SS Tables are being written to another folder (under snapshots) that you can then easily copy and load somewhere else. It is a hack, but a good one !
You would still need some process to expire data based on TTL. Since compaction is that process today, you would need compactions.
A] is true.
A] assuming you don’t have any other cluster issue
A] is the most likely cause for this. Here is a good reading: https://wiki.apache.org/cassandra/DistributedDeletes
B] will also cause this, but the hope is you don’t do this normally
A] and B]
Yes. The data is returned in the Murmur3 hash order on the partition key, and then sorted on clustering key.
Correct answer is E]
Cassandra does not have a notion of priority, so both write and read requests will be handled in the same priority.
New SS Tables can be generated through repairs or some nodetool commands such as cleanup.
Repairs don’t run by default.
There is no built-in mechanism to re-start a node that went down.
Run “sudo iostat” and look at the %nice column
Interestingly yes. All nodes that now are responsible for the tokens of removed node try to fetch necessary data from other nodes in the cluster. This can lead to increase in CPU and SSTables on such nodes.