Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatible changes introduced with Diego release 2.109.0 #983

Open
iaftab-alam opened this issue Jan 9, 2025 · 22 comments
Open

Incompatible changes introduced with Diego release 2.109.0 #983

iaftab-alam opened this issue Jan 9, 2025 · 22 comments
Labels

Comments

@iaftab-alam
Copy link

Current behavior

When upgrading Cloud Foundry deployment from version v45.0.0 to v45.1.0+ or deploying CF v45.1.0, all the applications are crashing because of the breaking changes introduced in Diego release version 2.109. This regression occurs under a specific conditions thats why it went through the normal cf validation pipeline, for instance, if your foundation is offering two stacks(e.g, fs3 and fs4).

! The affected Diego versions are 2.109.0 to 2.112.0.

Steps to reproduce

  1. Deploy CF v45.1.0+ with two stacks (cflinuxfs3 and cflinuxfs4).
  2. Run CATS against it
  3. CATS fail and all the apps crashes.

You will observe that all the applications, start failing with the following error:

  2024-12-26T22:17:15.47+0000 [PROXY/0] ERR /etc/cf-assets/envoy/envoy: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by /etc/cf-assets/envoy/envoy)
   2024-12-26T22:17:15.47+0000 [PROXY/0] ERR /etc/cf-assets/envoy/envoy: /lib/x86_64-linux-gnu/libpthread.so.0: version `GLIBC_2.30' not found (required by /etc/cf-assets/envoy/envoy)
   2024-12-26T22:17:15.47+0000 [PROXY/0] ERR /etc/cf-assets/envoy/envoy: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.28' not found (required by /etc/cf-assets/envoy/envoy)
   2024-12-26T22:17:15.56+0000 [PROXY/0] OUT Exit status 1
   2024-12-26T22:17:15.57+0000 [CELL/SSHD/0] OUT Exit status 0
   2024-12-26T22:17:15.57+0000 [APP/PROC/WEB/0] OUT Exit status 143
   2024-12-26T22:17:18.23+0000 [CELL/SSHD/0] OUT Exit status 0
   2024-12-26T22:17:18.24+0000 [APP/PROC/WEB/0] OUT Exit status 143 

When the envoy was bumped?

With Diego release v2.107.0 and v2.108.0, it was still using envoy version 1.28.7 and works well:

https://github.com/cloudfoundry/diego-release/blob/v2.107.0/config/blobs.yml
proxy/envoy-4848d6d548438b50d6d78187a7cfbe61e02d8b85-1.28.7.tgz

However, with Diego release v2.109, envoy was bumped to version 1.32.2 and we observed such behavior:

https://github.com/cloudfoundry/diego-release/blob/v2.109.0/config/blobs.yml
proxy/envoy-a0504e87c5a246cb097b37049b1e4dc7706c2a90-1.32.2.tgz:

Describe how the problem can be fixed?

There has been a detailed explanation provide by Diego Fidalgo: https://cloudfoundry.slack.com/archives/C02FM2BPE/p1735293118448499?thread_ts=1734709809.366679&cid=C02FM2BPE. Why such behavior is observed.

As follow-up, we can confirm that after creating a diego dev release with the code changes suggested. The issue is gone and CATS are passing despite being using the latest envoy version. The following code was patched here

stacks := make([]string, 0, len(rootFSes))
	for stack := range rootFSes {
			stacks = append(stacks, stack)
	}

	sort.Strings(stacks)

	if len(stacks) > 0 {
		mostRecentStack := stacks[len(stacks)-1]
		gardenHealthcheckRootFS = rootFSes[mostRecentStack]
	}

Desired behavior

After the upgrade or deploying CF v45.1.0+, the applications should not crash for a foundation with more then one stack offering.

Quick follow-up: We need to have a special release note that should be added to all the affected CF releases.

Affected Version

2.109.0,2.110.0, 2.111.0,2.112.0

@PlamenDoychev
Copy link

PlamenDoychev commented Jan 10, 2025

There are two aspects of this issue buildpack based apps and docker apps

  1. Buildpack apps
    In this case the application developer is allowed to assign stack via the stack property in the app manifest I.e.

    ...
    stack: cflinuxfs4

    Seems like although a given stack is selected this is not exactly true, as sidecar processes like envoy and healthcheck may start using another stack available on the platform (i.e. fs3). This is a bit misleading behaviour from my point of view given i have stack selected.

  2. Docker apps
    In this case the stack property is ignored for the main process but it seems like even if the platform operator has specified fs4 as default one the sidecar processes like envoy and healthcheck may pick up fs3 and run on issues. This is also not correct behaviour from my point if view given we have default stack configuration set by the platform operator.

@PlamenDoychev
Copy link

Is it an option to downgrade envoy to unblock diego updates and have consistent cf-deployment until we come up with a permanent solution for this case?

@jochenehret
Copy link

Hi Plamen,
downgrading a release in cf-deployment is not an option. The update and validation processes and pipelines are designed to integrate new versions only. Pinning an existing version can be done if you know that the next version could have issues (we did this for the uaa-release some time ago).

@PlamenDoychev
Copy link

Hi @jochenehret,

I meant is it an option to downgrade envoy in future diego-release. In this case we shall not hit this limitation in the cf-d, because we'll have a new diego-release with downgraded envoy

@jochenehret
Copy link

Yes, that would of course be possible.

@jossy
Copy link

jossy commented Jan 24, 2025

If I understand this issue correctly, it means that people running CF with multiple stacks can't upgrade their platform. Is there a timeline for a fix or a workaround?

@oliver-heinrich
Copy link

Hi @PlamenDoychev, any updates on a fix?

@PlamenDoychev
Copy link

Hi @oliver-heinrich, nothing yet to discuss.
We started progress on the issue but lets see when something like PoC will appear.

@Ivo1116
Copy link
Contributor

Ivo1116 commented Jan 31, 2025

Hello, here is a proposal, the PRs contain a change to the way the FSes are stored, from a map to an array, so that the order can be preserved. They should be listed in an ascending order.

@winkingturtle-vmw
Copy link
Contributor

cflinuxfs3 has been deprecated for nearly 2 years. Why are you proposing this change when the stack is not receiving any updates?

Additionally, this proposal is turning rootfses spec property to an ordered list that I don't think it'd be the intention of it.

@Ivo1116
Copy link
Contributor

Ivo1116 commented Feb 3, 2025

Hello @winkingturtle-vmw,

Yes, the FS3 is indeed deprecated, but we are still using it in our production environments, because we have customer workloads that have not yet migrated to FS4. According to our corporate product standards we are not allowed to either stop or migrate FS3 running containers to FS4 on behalf of our customers. This must be a decision that they take by themselves.

Yes, we are turning it into an ordered list and that is the point. First, we are supporting multiple stacks and we want to always use the latest available FS for the sidecar processes like healthcheck and envoy. By using an ordered list we ensure that we get the latest one consistently, which does not happen at the moment. At the moment the sidecar processes pick a FS on a random principle, which causes issues - like the latest version of envoy being strictly coupled to FS4

@winkingturtle-vmw
Copy link
Contributor

At the moment the sidecar processes pick a FS on a random principle

@Ivo1116 Is it random or is it the first one? It looks these PRs will reverse and pick the last one instead.

@Ivo1116
Copy link
Contributor

Ivo1116 commented Feb 3, 2025

@winkingturtle-vmw The idea is that the first one is expected to be taken, but in reality due to the fact that the map structure does not preserve the order a random one is taken.

Yes, the PRs will pick the last one as it is known as the latest one

@winkingturtle-vmw
Copy link
Contributor

@Ivo1116 I think there are two different problems here we are trying to solve:

  1. Your deployment of CF can no longer upgrade diego because we are now using a version of envoy that doesn't run on an unsupported rootfs
  2. garden healthcheck container picks the first rootfs to run on and it's not deterministic.

For the first problem: I think we can fix your deployment order by swapping the order the rootfses defined in your manifest without introducing any changes.

For the second problem: I think we can propose a different solution where it makes it more explicit what garden-health-check container to use.

Looking at this code, can you help understand how this is not populating the list in the same non-deterministic fashion ?

@Ivo1116
Copy link
Contributor

Ivo1116 commented Feb 3, 2025

@winkingturtle-vmw

For the first point, swapping the order of the rootfses in the manifest will not fix the issue since the problem is that they are not getting being selected consistently, because of the map structure. So moving it in the front will not make a difference.

For your second point, can you give me more details ?

As for the code, we are comparing Paths and StackPathMap, since they are used here and the difference is that in the StackPathMap the values are stored in a map, which does not preserve the order. The Paths function stores the FSes in an array, which maintains the order. And when they reach the assignment for the healthcheck we can be sure that we are using the latest one

@winkingturtle-vmw
Copy link
Contributor

@Ivo1116 I understand your intention for creating an array, but I am concerned that the array you are creating is in itself non-deterministic because you are ranging over the map to create the array in the first place and there is no guarantee for the initial range. I am not sure how this is solving the issue you are having.

My proposal would be to instead create a new property for rep (e.g.garden-healthcheck-rootfs) where it can be set to a key for the rootfses (e.g. cflinuxfs4) via diego-release. This property would default to empty to keep the current behavior as is. This way you can explicitly set the value of this property in your deployment to cflinuxfs4 to avoid the issue you are seeing now.

@Ivo1116
Copy link
Contributor

Ivo1116 commented Feb 4, 2025

Hello @winkingturtle-vmw, i don't think that we range over a map in our implementation. The rootFSes are passed into the program form either a json list or a yaml list, which both preserve the order. Then they are stored in an array, and on that array we add the method Paths, which iterates over it and returns a new array with just the paths. Then we pass the array to the initialize func.

As for your proposal, adding the new property does not fix the existing problem, but goes around it. The default behaviour will remain inconsistent. The issue has existed since the moving of FS2 to FS3, and will come back when moving from FS4 to FS5, if we do not fix the default behaviour.

@winkingturtle-vmw
Copy link
Contributor

i don't think that we range over a map in our implementation.

	paths := make([]string, len(rootFSes))
	for i, rootFS := range rootFSes {
		paths[i] = rootFS.Path
	}
	return paths

This is ranging over the rootFSes to create the array

As for your proposal, adding the new property does not fix the existing problem, but goes around it. The default behaviour will remain inconsistent.

How is this inconsistent when the proposal is adding an explicit property to be consistent ?

@Ivo1116
Copy link
Contributor

Ivo1116 commented Feb 5, 2025

Hello @winkingturtle-vmw,
Yes, the Paths func ranges over the rootFSes, but they are stored in an array type RootFSes []RootFS , here as well. And when we range over the array, we keep the same order of items in it paths[i] = rootFS.Path

As for the inconsistency, it occurs when the explicit property is empty, since it will still use the old random behaviour that assigns a rootFS for the sidecar processes.

This property would default to empty to keep the current behavior as is

@winkingturtle-vmw
Copy link
Contributor

Yes, the Paths func ranges over the rootFSes

That's my point. There is no guarantee what order you'd receive for the map, so the original array could be in any order.

As for the inconsistency, it occurs when the explicit property is empty, since it will still use the old random behaviour that assigns a rootFS for the sidecar processes.

What we want is backward compatibility and we will achieve it by defaulting to the old behavior.

It's worth noting that the case you are running into is absolutely unsupported. diego-release was backward compatible as long as cflinuxfs3 was receiving security fixes. We simply can't support every stack since the creation of this project.

@thelangley
Copy link

I think that by default, the paid-for PCF/Tanzu for VMs/Tanzu platform for CF has cflinuxfs3 and cflinuxfs4 as they pay for the Ubuntu Expanded Security Maintenance so technically it's supported til 2028.

So although open source cf-deployment users who don't have multiple stacks won't have an issue with this and really shouldn't be using ancient unsupported stacks, I think any Broadcom/TP4CF customers who use both stacks may not be comfortable with this approach as they'll need to stay on older versions of diego-release.

I'm also assuming that when Noble becomes the new standard, we'll need to start planning cflinuxfs5 migrations and have two stacks for a period. Won't this be an issue then too?

@Ivo1116
Copy link
Contributor

Ivo1116 commented Feb 7, 2025

Hello @winkingturtle-vmw,
Here is a proposal for the new property SidecarRootFSPath, which will contain the rootfs to be used for the sidecar processes.

diego-release: #991
executor: cloudfoundry/executor#114
rep: cloudfoundry/rep#65
inigo: cloudfoundry/inigo#49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

8 participants