-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ingress and node-services during reconcile #674
Remove ingress and node-services during reconcile #674
Conversation
Any takers for a review? It works when testing locally :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
The tests already pass without the PR (we delete all services between tests), so I would hope they would pass with the PR. The way to actually test this would be to make a cloud with node services, switch it to just use the headless and make sure that the node services no longer exist. Not sure that's required, but it likely would be good to test. I think we saw issues in a previous release when trying to delete resources automatically.
I might take a stab at such a test one day. Extending one of the existing, so after verifying that node-services are there, call an "update" on the CRD, then wait for reconsile to happen, and check that they are gone. Is there some test method that waits for reconcile or do we need to do polling with timeout? |
If you look at the existing unit tests, that's exactly how all of the |
Btw this is going to be impacted by the work in #692 . The two approaches should not necessarily conflict with eachother though. We might not want to prune old node services too heavily, in case the user is using autoscaling. At that point it's likely better to keep the old ones around. (which will result in less rolling restarts due to the hostAliases being updated when the new node services are created.) |
# Conflicts: # controllers/solrcloud_controller.go # helm/solr-operator/Chart.yaml
Trying to follow up on some of these older PRs. AFAICT from the discussion above, the main thing holding up this PR is a test like the one Houston described in his earlier comment? Is that right @janhoy ? |
Suppose so. I think I got stuck with the test part. Anyone may feel free to grab this, I won't pick it up soonish. |
Alright, rather than introducing a net-new test, I was able to add some assertions into In the meantime, we should decide how we want this behavior to interact with the work in #692. One idea that Houston suggested offline is that rather than deleting orphaned services right away, we could instead give them an annotation that indicates that they should be deleted after X time (say, 30 minutes?). What do we all think about that? |
Ahhh @gerlowskija , I think we were misunderstanding. Also misunderstanding with regards to the test. This is only deleting services/ingresses if the entire externalAddressibility feature is changed. Not when the solrcloud is scaled up/down. This testing can probably be done in the unit tests, and we have no concerns around autoscaling. |
So shall we try to conclude on this before 0.9? Do I understand correctly that the feature itself looks good, no concerns wrt autoscaling here. And the test in this PR is testing the wrong thing (scale-down) and should instead be a unit test disabling ingress? |
Yes, we should definitely include this. I'll try to get it merged early this week |
Houston and I spent some time on this today. I've removed the e2e test code and replaced it with a comparable (and now working!) unit-style test covering the ingress-cleanup. When we got to thinking through the service-cleanup a bit more though, Houston raised some concerns about this possibly deleting services that might still be used by one or more pods. One approach for handling this would be to give each Solr pod an annotation recording the service it uses, and then defer cleanup of each service until it is no longer used by any pods. This approach is safer but IMO adds enough complexity that we probably shouldn't try to squeeze it into 0.9.0. |
Let’s not over think either. This cleanup is in response to an explicit reconfiguration of a cluster. Can we simply assume that this also implies that there is no longer a need for the ingress or these services? Can you give a concrete example of legitimate use of these services that warrant this added complexity? In my head we WANT such requests to fail after this reconfiguration? |
So the biggest concern here is the use of node-services instead of the headless service. If we auto-remove the node-services immediately when the user switches away from the ingress option in their SolrCloud resource, the pods will have not had a chance to update themselves to say that they have a different hostname. Therefore inter-node traffic will be halted, which will probably lead to a never-ending rolling restart (because nodes won't become healthy again). If the idea is that all the data should go away, then the user should delete and recreate. I'm going to push a commit that does this a bit more cautiously, only deleting the services when we know they aren't in use. Ingresses I'm ok with just deleting. |
I think this should be good to go |
Ok, I was not aware that inter pod communication preferred node-service and needed a rolling restart to start using headless service instead. Thanks for the diligent review and fix. I have nothing to add. Feel free to merge. |
Fixes #673
DRAFT, code largely generated by GitHub copilot, just throwing up the idea