Skip to content
This repository has been archived by the owner on Oct 20, 2022. It is now read-only.

[RFE] Make executing operations topology aware #79

Open
jsanda opened this issue Jul 24, 2019 · 4 comments
Open

[RFE] Make executing operations topology aware #79

jsanda opened this issue Jul 24, 2019 · 4 comments

Comments

@jsanda
Copy link
Contributor

jsanda commented Jul 24, 2019

During a recent demo, there was some discussion of cstar and that maybe it could be used by casskop. The purpose of this ticket is threefold:

  • Provide a brief introduction to cstar
  • Explain the topology awareness feature of cstar
  • Explain why cstar is not a good fit for casskop

cstar is a stand alone, Cassandra orchestration tool that runs from a control machine. It is similar to Ansible in a couple of ways. First, it has a simple architecture in that it has no server. Secondly, it uses ssh to push commands to the machines on which Cassandra runs. cstar can be used to run arbitrary commands on the host machines. It is not limited to running nodetool commands.

The distinguishing feature of cstar is its topology awareness. It discovers the cluster topology and tunes concurrency based on replication settings.

For a detailed overview of cstar, see this TLP blog post.

The best way to explain is with an example.

$ cstar run --command="nodetool cleanup" --seed-hosts=1.2.3.4 --strategy=topology

The command option is self-explanatory.

The seed-hosts option tells cstar to connect to that node and discover the cluster topology from it. Multiple, comma-delimited values can be specified.

The strategy option tells cstar to run the command on one replica at a time but against multiple nodes in parallel. This improves throughput for the time to perform the operation against the cluster as a whole while minimizing impact on latencies. Let's consider an example to make this clear.

Suppose we have a nine node cluster spread across three racks with RF = 3 for our application keyspaces. cstar can perform maintenance operations on a node in each rack in parallel and minimize latency impact because it only runs against one replica at a time.

By default cstar executes commands across data centers in parallel. You have a lot of control over how it runs though. You can instead execute commands serially - one data center and one node at a time.

This is just a glimpse of cstar. Check out the blog post mentioned earlier for a more thorough introduction.

casskop should absolutely incorporate features from cstar, namely the topology aware execution of commands against nodes, but I do not think it makes sense to directly incorporate cstar into casskop, possibly running as a sidecar container for example.

casskop could delegate some operations like cleanup to cstar. It would not however delegate other operations, like restarting a node, to cstar. This is because operations like restarting a node effectively go through the k8s api server. It would still be really nice to have the topology awareness for doing restarts. The topology awareness is the key feature of cstar in my opinion. Once the operator has the topology awareness capability, I do not see a compelling reason to have the operator delegate some tasks to cstar.

In addition to points about the topology awareness feature, I think we would need to make changes to cstar in order for it to work well with casskop. It is designed to run as a stand alone, command line application. If we were to introduce some sort of cstar sidecar container, we would have to have some running process that would listen for requests from casskop and then execute cstar. I think it makes a lot more sense to invest that time and effort into casskop itself.

I will add some follow up comments with more details on the topology awareness. For now, I just want to get the discussion started.

@cscetbon
Copy link
Contributor

Thanks @jsanda for the introduction of c-start. I looked into it in the past and used it a bit. The good thing about the operator is that it knows about the cluster as the CRD was used to create it. And like you said we would have to update it to avoid ssh access and add a REST API or a way to communicate to trigger operations, to check status and read read results from it. I agree that it would make more sense to port the logic into the operator.

We'll need to discuss how to implement the multiple/parallel operations and handling of statuses in the cluster.

@jsanda
Copy link
Contributor Author

jsanda commented Jul 30, 2019

@cscetbon operations are already performed in parallel. For example, I have a 3 node cluster in one rack. I added data to the cluster using tlp-stress. I then added a fourth node. After the fourth node finished bootstrapping, I scheduled a cleanup operation with:

kubectl label --overwrite pods \
           operation-name=cleanup \
           operation-status=ToDo  \
           -l app=cassandracluster

Cleanup was running in parallel across all nodes. In this particular case with a 4-node cluster using the default of 256 tokens (which is really bad by the way), every node is a replica for every other node. It would therefore be much better to run the operation serially, one node at a time, in order minimize the impact of things like GC pauses.

Serial execution of operations will be slower, but it is safer and on that basis maybe should be the default mode for executing operations.

@cscetbon
Copy link
Contributor

Yeah you’re right. Pod operations are different as they’re not seen as clusterwide. They’re not rackaware etc... it’s up to us though to sey the labels at the right spots with our plugin. There would probably be some benefit in making the plugin more intelligent, to allow it in some cases to decide by itself. That’s what I’ve done for instance last time by adding checks around triggering new operations as labels are not protected and can be overwritten anytime.

We don’t really want to limit it as we could decide via the plugin which nodes to trigger. We could potentially add some preferences for the plugin and provide a switch to exceed the soft limit. The plugin could also check the state of the cluster before accepting to trigger the operation. Don’t know what it should check though

@jsanda
Copy link
Contributor Author

jsanda commented Jul 30, 2019

it’s up to us though to sey the labels at the right spots with our plugin

I think that the logic needs to be in the operator. What if the user is setting the labels directly with kubectl or some automation like ansible rather than using the plugin? Or how about a scenario in which another k8s operator is setting the labels.

We don’t really want to limit it as we could decide via the plugin which nodes to trigger

I partially agree in so far as it should be computed on what nodes to trigger the operation and when. It still makes sense to provide ways for the client to control the mode of operation. Let's reconsider my 4 node cluster. I previously explained that it probably makes sense to do the cleanup one node at a time to minimize impact on performance. But suppose we are running out of disk space and running cleanup will alleviate the problem. In this case we want to run it as quickly as possible all nodes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants