-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incompatible changes introduced with Diego release 2.109.0 #983
Comments
There are two aspects of this issue buildpack based apps and docker apps
|
Is it an option to downgrade envoy to unblock diego updates and have consistent cf-deployment until we come up with a permanent solution for this case? |
Hi Plamen, |
Hi @jochenehret, I meant is it an option to downgrade envoy in future diego-release. In this case we shall not hit this limitation in the cf-d, because we'll have a new diego-release with downgraded envoy |
Yes, that would of course be possible. |
If I understand this issue correctly, it means that people running CF with multiple stacks can't upgrade their platform. Is there a timeline for a fix or a workaround? |
Hi @PlamenDoychev, any updates on a fix? |
Hi @oliver-heinrich, nothing yet to discuss. |
Hello, here is a proposal, the PRs contain a change to the way the FSes are stored, from a map to an array, so that the order can be preserved. They should be listed in an ascending order. |
cflinuxfs3 has been deprecated for nearly 2 years. Why are you proposing this change when the stack is not receiving any updates? Additionally, this proposal is turning rootfses spec property to an ordered list that I don't think it'd be the intention of it. |
Hello @winkingturtle-vmw, Yes, the FS3 is indeed deprecated, but we are still using it in our production environments, because we have customer workloads that have not yet migrated to FS4. According to our corporate product standards we are not allowed to either stop or migrate FS3 running containers to FS4 on behalf of our customers. This must be a decision that they take by themselves. Yes, we are turning it into an ordered list and that is the point. First, we are supporting multiple stacks and we want to always use the latest available FS for the sidecar processes like healthcheck and envoy. By using an ordered list we ensure that we get the latest one consistently, which does not happen at the moment. At the moment the sidecar processes pick a FS on a random principle, which causes issues - like the latest version of envoy being strictly coupled to FS4 |
@Ivo1116 Is it random or is it the first one? It looks these PRs will reverse and pick the last one instead. |
@winkingturtle-vmw The idea is that the first one is expected to be taken, but in reality due to the fact that the map structure does not preserve the order a random one is taken. Yes, the PRs will pick the last one as it is known as the latest one |
@Ivo1116 I think there are two different problems here we are trying to solve:
For the first problem: I think we can fix your deployment order by swapping the order the rootfses defined in your manifest without introducing any changes. For the second problem: I think we can propose a different solution where it makes it more explicit what garden-health-check container to use. Looking at this code, can you help understand how this is not populating the list in the same non-deterministic fashion ? |
For the first point, swapping the order of the rootfses in the manifest will not fix the issue since the problem is that they are not getting being selected consistently, because of the map structure. So moving it in the front will not make a difference. For your second point, can you give me more details ? As for the code, we are comparing Paths and StackPathMap, since they are used here and the difference is that in the StackPathMap the values are stored in a map, which does not preserve the order. The Paths function stores the FSes in an array, which maintains the order. And when they reach the assignment for the healthcheck we can be sure that we are using the latest one |
@Ivo1116 I understand your intention for creating an array, but I am concerned that the array you are creating is in itself non-deterministic because you are ranging over the map to create the array in the first place and there is no guarantee for the initial range. I am not sure how this is solving the issue you are having. My proposal would be to instead create a new property for rep (e.g. |
Hello @winkingturtle-vmw, i don't think that we range over a map in our implementation. The rootFSes are passed into the program form either a json list or a yaml list, which both preserve the order. Then they are stored in an array, and on that array we add the method Paths, which iterates over it and returns a new array with just the paths. Then we pass the array to the initialize func. As for your proposal, adding the new property does not fix the existing problem, but goes around it. The default behaviour will remain inconsistent. The issue has existed since the moving of FS2 to FS3, and will come back when moving from FS4 to FS5, if we do not fix the default behaviour. |
paths := make([]string, len(rootFSes))
for i, rootFS := range rootFSes {
paths[i] = rootFS.Path
}
return paths This is ranging over the rootFSes to create the array
How is this inconsistent when the proposal is adding an explicit property to be consistent ? |
Hello @winkingturtle-vmw, As for the inconsistency, it occurs when the explicit property is empty, since it will still use the old random behaviour that assigns a rootFS for the sidecar processes.
|
That's my point. There is no guarantee what order you'd receive for the map, so the original array could be in any order.
What we want is backward compatibility and we will achieve it by defaulting to the old behavior. It's worth noting that the case you are running into is absolutely unsupported. diego-release was backward compatible as long as cflinuxfs3 was receiving security fixes. We simply can't support every stack since the creation of this project. |
I think that by default, the paid-for PCF/Tanzu for VMs/Tanzu platform for CF has cflinuxfs3 and cflinuxfs4 as they pay for the Ubuntu Expanded Security Maintenance so technically it's supported til 2028. So although open source cf-deployment users who don't have multiple stacks won't have an issue with this and really shouldn't be using ancient unsupported stacks, I think any Broadcom/TP4CF customers who use both stacks may not be comfortable with this approach as they'll need to stay on older versions of diego-release. I'm also assuming that when Noble becomes the new standard, we'll need to start planning cflinuxfs5 migrations and have two stacks for a period. Won't this be an issue then too? |
Hello @winkingturtle-vmw, diego-release: #991 |
Current behavior
When upgrading Cloud Foundry deployment from version v45.0.0 to v45.1.0+ or deploying CF v45.1.0, all the applications are crashing because of the breaking changes introduced in Diego release version 2.109. This regression occurs under a specific conditions thats why it went through the normal cf validation pipeline, for instance, if your foundation is offering two stacks(e.g, fs3 and fs4).
!
The affected Diego versions are 2.109.0 to 2.112.0.
Steps to reproduce
You will observe that all the applications, start failing with the following error:
When the envoy was bumped?
With Diego release v2.107.0 and v2.108.0, it was still using envoy version 1.28.7 and works well:
However, with Diego release v2.109, envoy was bumped to version 1.32.2 and we observed such behavior:
Describe how the problem can be fixed?
There has been a detailed explanation provide by Diego Fidalgo: https://cloudfoundry.slack.com/archives/C02FM2BPE/p1735293118448499?thread_ts=1734709809.366679&cid=C02FM2BPE. Why such behavior is observed.
As follow-up, we can confirm that after creating a diego dev release with the code changes suggested. The issue is gone and CATS are passing despite being using the latest envoy version. The following code was patched here
Desired behavior
After the upgrade or deploying CF v45.1.0+, the applications should not crash for a foundation with more then one stack offering.
Quick follow-up: We need to have a special release note that should be added to all the affected CF releases.
Affected Version
2.109.0,2.110.0, 2.111.0,2.112.0
The text was updated successfully, but these errors were encountered: