Remove ingress and node-services during reconcile #674

janhoy · 2024-01-03T21:39:45Z

Fixes #673

DRAFT, code largely generated by GitHub copilot, just throwing up the idea

janhoy · 2024-02-20T13:13:00Z

Any takers for a review? It works when testing locally :)

HoustonPutman

This looks good to me.

The tests already pass without the PR (we delete all services between tests), so I would hope they would pass with the PR. The way to actually test this would be to make a cloud with node services, switch it to just use the headless and make sure that the node services no longer exist. Not sure that's required, but it likely would be good to test. I think we saw issues in a previous release when trying to delete resources automatically.

controllers/solrcloud_controller.go

janhoy · 2024-02-27T09:18:56Z

switch it to just use the headless and make sure that the node services no longer exist.

I might take a stab at such a test one day. Extending one of the existing, so after verifying that node-services are there, call an "update" on the CRD, then wait for reconsile to happen, and check that they are gone. Is there some test method that waits for reconcile or do we need to do polling with timeout?

HoustonPutman · 2024-02-27T17:38:49Z

switch it to just use the headless and make sure that the node services no longer exist.

I might take a stab at such a test one day. Extending one of the existing, so after verifying that node-services are there, call an "update" on the CRD, then wait for reconsile to happen, and check that they are gone. Is there some test method that waits for reconcile or do we need to do polling with timeout?

If you look at the existing unit tests, that's exactly how all of the Eventually...() functions work. They are eventually consistent checks. There should be some similar tests around changing state halfway through, such as creating necessary secrets for basicAuth after the SolrCloud has already been created. I can help you with the test, but obviously if you get started first, it'll help better with the learning 🙂

HoustonPutman · 2024-03-15T17:42:11Z

Btw this is going to be impacted by the work in #692 . The two approaches should not necessarily conflict with eachother though. We might not want to prune old node services too heavily, in case the user is using autoscaling. At that point it's likely better to keep the old ones around. (which will result in less rolling restarts due to the hostAliases being updated when the new node services are created.)

# Conflicts: # controllers/solrcloud_controller.go # helm/solr-operator/Chart.yaml

gerlowskija · 2024-12-06T15:10:51Z

Trying to follow up on some of these older PRs. AFAICT from the discussion above, the main thing holding up this PR is a test like the one Houston described in his earlier comment? Is that right @janhoy ?

janhoy · 2024-12-07T00:53:43Z

Suppose so. I think I got stuck with the test part. Anyone may feel free to grab this, I won't pick it up soonish.

gerlowskija · 2024-12-11T15:39:40Z

Alright, rather than introducing a net-new test, I was able to add some assertions into solrcloud_scaling_test.go that should validate the functionality here. There's still something not quite right with the test assertions (perhaps we're not waiting long enough to see the pod-services disappear?), but it's a start.

In the meantime, we should decide how we want this behavior to interact with the work in #692. One idea that Houston suggested offline is that rather than deleting orphaned services right away, we could instead give them an annotation that indicates that they should be deleted after X time (say, 30 minutes?). What do we all think about that?

HoustonPutman · 2024-12-11T17:40:30Z

Ahhh @gerlowskija , I think we were misunderstanding. Also misunderstanding with regards to the test. This is only deleting services/ingresses if the entire externalAddressibility feature is changed. Not when the solrcloud is scaled up/down. This testing can probably be done in the unit tests, and we have no concerns around autoscaling.

janhoy · 2025-01-10T08:13:53Z

So shall we try to conclude on this before 0.9? Do I understand correctly that the feature itself looks good, no concerns wrt autoscaling here. And the test in this PR is testing the wrong thing (scale-down) and should instead be a unit test disabling ingress?

HoustonPutman · 2025-01-13T04:14:09Z

Yes, we should definitely include this. I'll try to get it merged early this week

This reverts commit 7d48a1d.

gerlowskija · 2025-01-13T21:17:41Z

Houston and I spent some time on this today. I've removed the e2e test code and replaced it with a comparable (and now working!) unit-style test covering the ingress-cleanup.

When we got to thinking through the service-cleanup a bit more though, Houston raised some concerns about this possibly deleting services that might still be used by one or more pods. One approach for handling this would be to give each Solr pod an annotation recording the service it uses, and then defer cleanup of each service until it is no longer used by any pods.

This approach is safer but IMO adds enough complexity that we probably shouldn't try to squeeze it into 0.9.0.

janhoy · 2025-01-15T15:15:46Z

Houston raised some concerns about this possibly deleting services that might still be used by one or more pods

Let’s not over think either. This cleanup is in response to an explicit reconfiguration of a cluster. Can we simply assume that this also implies that there is no longer a need for the ingress or these services?

Can you give a concrete example of legitimate use of these services that warrant this added complexity? In my head we WANT such requests to fail after this reconfiguration?

HoustonPutman · 2025-01-15T18:02:28Z

So the biggest concern here is the use of node-services instead of the headless service. If we auto-remove the node-services immediately when the user switches away from the ingress option in their SolrCloud resource, the pods will have not had a chance to update themselves to say that they have a different hostname. Therefore inter-node traffic will be halted, which will probably lead to a never-ending rolling restart (because nodes won't become healthy again).

If the idea is that all the data should go away, then the user should delete and recreate.

I'm going to push a commit that does this a bit more cautiously, only deleting the services when we know they aren't in use. Ingresses I'm ok with just deleting.

HoustonPutman · 2025-01-15T20:58:53Z

I think this should be good to go

janhoy · 2025-01-15T22:05:43Z

Ok, I was not aware that inter pod communication preferred node-service and needed a rolling restart to start using headless service instead. Thanks for the diligent review and fix. I have nothing to add. Feel free to merge.

DRAFT code to remove ingress and node-services during reconcile

2b42f5b

janhoy requested review from HoustonPutman, thelabdude and gerlowskija January 3, 2024 21:39

Changes

2cf97b1

HoustonPutman reviewed Feb 26, 2024

View reviewed changes

controllers/solrcloud_controller.go Outdated Show resolved Hide resolved

controllers/solrcloud_controller.go Outdated Show resolved Hide resolved

janhoy added 2 commits February 26, 2024 23:45

Typo

81ee166

Add info logging for removed services and ingress

ddf0a56

janhoy marked this pull request as ready for review February 26, 2024 23:06

janhoy added 2 commits May 6, 2024 14:16

Merge branch 'refs/heads/main' into remove-ingress-nodeServices

365ba4d

# Conflicts: # controllers/solrcloud_controller.go # helm/solr-operator/Chart.yaml

Update after changes in main

1c75470

gerlowskija added 2 commits December 10, 2024 08:48

Merge branch 'main' into remove-ingress-nodeServices

1570ece

Initial test assertions for service cleanup

7d48a1d

Merge branch 'main' into remove-ingress-nodeServices

087faff

gerlowskija added 2 commits January 13, 2025 16:06

Create ingress-cleanup unit test

db0e841

Revert "Initial test assertions for service cleanup"

2b1a484

This reverts commit 7d48a1d.

Merge remote-tracking branch 'apache/main' into pr/674

14e94bd

Add safe deletion of services

89fd54a

HoustonPutman merged commit 1fbc6aa into apache:main Jan 15, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove ingress and node-services during reconcile #674

Remove ingress and node-services during reconcile #674

janhoy commented Jan 3, 2024

janhoy commented Feb 20, 2024

HoustonPutman left a comment

janhoy commented Feb 27, 2024

HoustonPutman commented Feb 27, 2024

HoustonPutman commented Mar 15, 2024

gerlowskija commented Dec 6, 2024

janhoy commented Dec 7, 2024

gerlowskija commented Dec 11, 2024

HoustonPutman commented Dec 11, 2024

janhoy commented Jan 10, 2025 •

edited

Loading

HoustonPutman commented Jan 13, 2025

gerlowskija commented Jan 13, 2025 •

edited

Loading

janhoy commented Jan 15, 2025

HoustonPutman commented Jan 15, 2025

HoustonPutman commented Jan 15, 2025

janhoy commented Jan 15, 2025

Remove ingress and node-services during reconcile #674

Remove ingress and node-services during reconcile #674

Conversation

janhoy commented Jan 3, 2024

janhoy commented Feb 20, 2024

HoustonPutman left a comment

Choose a reason for hiding this comment

janhoy commented Feb 27, 2024

HoustonPutman commented Feb 27, 2024

HoustonPutman commented Mar 15, 2024

gerlowskija commented Dec 6, 2024

janhoy commented Dec 7, 2024

gerlowskija commented Dec 11, 2024

HoustonPutman commented Dec 11, 2024

janhoy commented Jan 10, 2025 • edited Loading

HoustonPutman commented Jan 13, 2025

gerlowskija commented Jan 13, 2025 • edited Loading

janhoy commented Jan 15, 2025

HoustonPutman commented Jan 15, 2025

HoustonPutman commented Jan 15, 2025

janhoy commented Jan 15, 2025

janhoy commented Jan 10, 2025 •

edited

Loading

gerlowskija commented Jan 13, 2025 •

edited

Loading