Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Che6 to Che7 upgrade using che-operator #14200

Closed
skabashnyuk opened this issue Aug 12, 2019 · 22 comments
Closed

Test Che6 to Che7 upgrade using che-operator #14200

skabashnyuk opened this issue Aug 12, 2019 · 22 comments
Assignees
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Milestone

Comments

@skabashnyuk
Copy link
Contributor

Is your task related to a problem? Please describe.

Install Che 6.19 and then attempt to update it to 7.0 using an operator.
Need to check

  • Make sure user db & workspace definitions in postgres, successfully upgrade the Che server to 7.0
  • see if anything breaks... variables that got renamed? routes that are different? assumptions/APIs in Che 7 that changed?
@skabashnyuk skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. status/open-for-dev An issue has had its specification reviewed and confirmed. Waiting for an engineer to take it. team/platform labels Aug 12, 2019
@skabashnyuk skabashnyuk added this to the 7.1.0 milestone Aug 12, 2019
@nickboldt
Copy link
Contributor

Downstream: https://issues.jboss.org/browse/CRW-303

@metlos metlos self-assigned this Aug 23, 2019
@metlos metlos added status/in-progress This issue has been taken by an engineer and is under active development. and removed status/open-for-dev An issue has had its specification reviewed and confirmed. Waiting for an engineer to take it. labels Aug 23, 2019
@skabashnyuk skabashnyuk added the severity/P1 Has a major impact to usage or development of the system. label Sep 2, 2019
@skabashnyuk skabashnyuk modified the milestones: 7.1.0, 7.2.0 Sep 5, 2019
@metlos
Copy link
Contributor

metlos commented Sep 16, 2019

I tested the upgrade on both K8s (using minikube 1.3.1) and Openshift 3.11 (using minishift).

I've found the following minor problems:

  1. Keycloak doesn't upgrade from che-keycloak:6.19.0 to che-keycloak:7.1.0 on operator upgrade. This seems to be caused by the explicitly mentioned Keycloak image version in the CR config.
  2. When upgrading from che operator 0.5.0 (commit eclipse-che/che-operator@32b8e15) which is the last operator with a non-CRW version of Che6 to operator 7.1.0, the configuration of Keycloak looses the original DB password and tries to use a new generated one leading to failing upgrade, because Keycloak cannot start.

@davidfestal do you think those are problems worth creating issues for?

@metlos metlos removed the status/in-progress This issue has been taken by an engineer and is under active development. label Sep 16, 2019
@skabashnyuk
Copy link
Contributor Author

those are problems worth creating issues for

CC @l0rd @davidfestal wdyt do we need to upgrade version of keycloak for already exists CheCluste installations?

@metlos
Copy link
Contributor

metlos commented Sep 17, 2019

those are problems worth creating issues for

CC @l0rd @davidfestal wdyt do we need to upgrade version of keycloak for already exists CheCluste installations?

I've created #14575 for that.

@davidfestal
Copy link
Contributor

Sorry for the time before answering @metlos

I would answer differently for these 2 cases:

Let me start with the second one:

2.When upgrading from che operator 0.5.0 (commit eclipse-che/che-operator@32b8e15) which is the last operator with a non-CRW version of Che6 to operator 7.1.0, the configuration of Keycloak looses the original DB password and tries to use a new generated one leading to failing upgrade, because Keycloak cannot start.

That seems a backward-compatibility bug to me. If there would be some real possibility (and compelling reason) that a user would still use this old che operator version, then we should look deeper into the root cause of this and possibly fix it.

  1. Keycloak doesn't upgrade from che-keycloak:6.19.0 to che-keycloak:7.1.0 on operator upgrade. This seems to be caused by the explicitly mentioned Keycloak image version in the CR config.

This is an already-known issue that was still there in previous versions of Che 7. This has been fixed in Che 7.1.0 for all upcoming upgrades (7.1.0 to next releases) and also specifically for the upgrades from the 7.0.0 to 7.1.0, since we knew that 7.0.0 was the last version with this bug (see PR eclipse-che/che-operator#72 (comment)).

That's not possible to fix this for upgrades from any older version to 7.1.0 since users also might want to keep fixed versions into their CheCluster resources.

2 cases here:

  • if we plan to use script for migration of CRW 1.2.x operator to 2.0, then we should be able to clean the frozen docker image versions in the CheCluster (set them back to empty string for Keycloak, Postgres, etc ...) just after upgrading the operator to 7.1.0.
  • if we plan to let OperatorHub + marketplace do the upgrade of the operator, then the user should have to manually clean the frozen docker image versions in the CheCluster (set them back to empty string for Keycloak, Postgres, etc ...) just after upgrading the operator to 7.1.0.

@nickboldt @l0rd Does it seem an acceptable solution ?

If it is not acceptable, then we would need to implement the same type of automatic fix as we did for the 7.0.0 to 7.1.0 release:

  • identify a single CRW 1.2.x release that is the one most CRW users would have (probably the last release one)=> considered the same as the default one
  • automatically clear the docker images in the CheClusteras soon as they correspond to the docker image of this single CRW release.

@metlos
Copy link
Contributor

metlos commented Sep 23, 2019

Discovered another issue - #14630. This is not strictly an issue with upgrade of the operator though. The upgrade of the operator to 7.1.0 followed the redeployment of the che server seems to work. It only happens when the user makes changes to the CR of the che cluster.

@metlos
Copy link
Contributor

metlos commented Sep 23, 2019

After all, this affects 1 upgrade scenario:

  1. Using Che6 based operator, install che cluster CR specifically mentioning the image tag of Che to install
  2. Upgrade the operator to Che7 based one.
  3. The Che server deployment doesn't change because the image tag is explicitly configured.
  4. Now you need to manually change the tag to some 7.x version, which triggers Operator regenerates DB password without redeploying Postgres #14630.

@davidfestal
Copy link
Contributor

@nickboldt @l0rd ld you please provide your thoughts about the questions in comment #14200 (comment) ?

This would be great to conclude on this and decide if any action item is required, before closing the analysis issue.

@davidfestal
Copy link
Contributor

davidfestal commented Sep 24, 2019

After all, this affects 1 upgrade scenario:

1. Using Che6 based operator, install che cluster CR specifically mentioning the image tag of Che to install

2. Upgrade the operator to Che7 based one.

3. The Che server deployment doesn't change because the image tag is explicitly configured.

4. Now you need to manually change the tag to some 7.x version, which triggers #14630.

@metlos Do you confirm that if you manually change the image tags in the custom resource by patching the CR existing in K8S instead of applying it again, upgrade works correctly ?

The main reason of my question is that, when you apply the changed CR again from the yaml file you have locally, you override the Postgres and Keycloak passwords to empty strings. This triggers the generation of new passwords by the operator, and brings bug #14630.

So to summarize, bug #14630 would only impact upgrade from 6 to 7 if the CR is changed fully, instead of only patching the docker image and tag fields.

@metlos
Copy link
Contributor

metlos commented Sep 24, 2019

@metlos Do you confirm that if you manually change the image tags in the custom resource by patching the CR existing in K8S instead of applying it again, upgrade works correctly ?

Yes, patching the CR seems to be working well.

@l0rd
Copy link
Contributor

l0rd commented Oct 1, 2019

@metlos @davidfestal the keycloak image version problem can be definitely handled as a doc issue (hence we need to add the workaround in the doc to close this issue).

@nickboldt
Copy link
Contributor

if we plan to use script for migration of CRW 1.2.x operator to 2.0, then we should be able to clean the frozen docker image versions in the CheCluster (set them back to empty string for Keycloak, Postgres, etc ...) just after upgrading the operator to 7.1.0.

I'm hoping to not have to handle scripting an update process, but if we have to we can do so as documentation, so it's not tightly tied to a release.

if we plan to let OperatorHub + marketplace do the upgrade of the operator, then the user should have to manually clean the frozen docker image versions in the CheCluster (set them back to empty string for Keycloak, Postgres, etc ...) just after upgrading the operator to 7.1.0.

This sounds preferrable, except that we still have to support OCP 3.11 users. So would this mean we would tell them to install OLM in order to be able to use operator-based updates?

@skabashnyuk skabashnyuk modified the milestones: 7.2.0, 7.3.0 Oct 2, 2019
@davidfestal
Copy link
Contributor

  1. Now you need to manually change the tag to some 7.x version

@metlos Instead if patching the custom resource with some Che 7.x version, you should just reset the docker image to empty string, so that it takes the defaults.

Did you do this for all the following components:

  • Keycloak (I saw you did)
  • Postgres
  • the PVC jobs docker image ?

I'm asking that because it would be important to have tested the upgrade of the Postgres image as well (if it needs upgrade).

@nickboldt
Copy link
Contributor

+1, odds are good that with every CRW update we'll have new Keycloak, UBI8-minimal (PVC) and Postgres images because the hamster wheel of CVE fixes never stops. :)

@metlos
Copy link
Contributor

metlos commented Oct 7, 2019

I have not tested Postgres for sure because it hasn't changed between the versions for neither Che nor CRW. Not sure about PVC.

@skabashnyuk
Copy link
Contributor Author

The scope of this task defined in the description is complete.
Next steps

I would like to ask no to extend the scope of this issue, instead of that create a new one.

@sleshchenko
Copy link
Member

I've tested the following scenario:
I deployed Che with chectl and operator on OpenShift 4.2. Che operator image is: quay.io/eclipse/che-operator:nightly.
It created CR where some fields, like postgres image, is empty by default.
Then I made sure that all images have empty values (che server, postgres, keycloak, plugin-registry, devfile-registry, pvcJobimage):

apiVersion: org.eclipse.che/v1
kind: CheCluster
metadata:
  creationTimestamp: 2019-10-08T13:32:47Z
  finalizers:
  - oauthclients.finalizers.che.eclipse.org
  generation: 19
  name: eclipse-che
  namespace: che
  resourceVersion: "21659"
  selfLink: /apis/org.eclipse.che/v1/namespaces/che/checlusters/eclipse-che
  uid: 1daebd0e-e9d0-11e9-803c-12a6fdf53914
spec:
  auth:
    # ...    identityProviderImage: ""
    # ...
  database:
    # ...
    postgresImage: ""
    # ...
  server:
    # ...
    cheImage: ""
    cheImagePullPolicy: ""
    cheImageTag: ""
    # ...
    devfileRegistryImage: ""
    # ...
    pluginRegistryImage: ""
    # ...
  storage:
    # ...    pvcJobsImage: ""
    # ...

After che-operator report that Che is available again we have the following state:

  • Plugin Registry: quay.io/eclipse/che-plugin-registry:7.2.0
  • Devfile Registry: quay.io/eclipse/che-devfile-registry:7.2.0
  • Keycloak: eclipse/che-keycloak:7.2.0
  • Postgres: centos/postgresql-96-centos7:9.6
  • Che: eclipse/che-server:7.2.0
  • CHE_INFRA_KUBERNETES_PVC_JOBS_IMAGE: 'registry.access.redhat.com/ubi8-minimal:8.0-213'

@metlos
Copy link
Contributor

metlos commented Oct 8, 2019

After che-operator report that Che is available again we have the following state:

I think we should test this:

  1. Add custom tags for all the above mentioned images
  2. Configure the CR to use those custom tags
  3. After Che is redeployed with these new images
  4. Set the image properties back to empty strings
  5. The operator should use the default values for the images again.

It is basically what you have done with the difference that we're not creating new deployments but rolling over existing ones.

@davidfestal does the above make sense or do you think it is superfluous?

@nickboldt
Copy link
Contributor

nickboldt commented Oct 9, 2019

+1 to retesting with rolling updates.

Re: postgres updates...

I have not tested Postgres for sure because it hasn't changed between the versions for neither Che nor CRW. Not sure about PVC.

Two problems with that statement:

a) CRW 1.2.2 is using registry.redhat.io/rhscl/postgresql-96-rhel7:1-46 and 2.0 will be using registry.redhat.io/rhscl/postgresql-96-rhel7:1-47 (or newer) so you CAN test a simple version change.

b) even with the upstream Che 6 -> 7 version centos/postgresql-96-centos7:9.6 is a moving tag which represents the latest set of applied CVE fixes. So :9.6 from 2 months ago is a different version from today. You can see this from the tags in the repo which are 5 or 13 days old:

https://hub.docker.com/r/centos/postgresql-96-centos7/tags?page=1&ordering=last_updated

So... the flaw here is the approach that you're always using a moving tag, instead of a static one.

Re: "not sure about PVC", we can again test an update there:

  • 1.2.2 / 6.19.x: registry.access.redhat.com/ubi8-minimal:8.0-159
    vs.
  • 2.0 / 7.3 / master: registry.access.redhat.com/ubi8-minimal:8.0-213

So... reopening. Task not complete. @metlos can you investigate the two missing update path scenarios?

@nickboldt nickboldt reopened this Oct 9, 2019
@davidfestal
Copy link
Contributor

I think we should test this:

1. Add custom tags for all the above mentioned images

What do you mean precisly by this step ? You could simply fix old tags of each images.

2. Configure the CR to use those custom tags

I would simply create the CR with these custom (old) tags at start.

3. After Che is redeployed with these new images

s/redeployed/deployed

4. Set the image properties back to empty strings

5. The operator should use the default values for the images again.

And it should have triggered the rolling update of all the components (apart from Postgres that will be restarted instead of rolling-updated)

It is basically what you have done with the difference that we're not creating new deployments but rolling over existing ones.

@davidfestal does the above make sense or do you think it is superfluous?

I think it can be interesting, especially if you use it with running workspaces.

Another point, you might be able to set the cheFlavor to codeready when creating the initial custom resource. So it should use the CRW images. Then you would be able to really test CRW update from 1.2 to 2.0 (just look at the default.go file in the che-operator 1.x branch at the 1.2.0 tag to know what images will be used by default in CRW 1.2).
Of course to download CRW docker images, you will need the subscription secret to access the private registry. @rhopp and @nickboldt might help for that.

@metlos
Copy link
Contributor

metlos commented Oct 10, 2019

Let me reply to all of the questions raised and swiftly close this issue again.

CRW has nada to do with this issue that is about upgrade of Che6 to Che7. If you need to test upgrades of CRW, you're of course free to create another issue which will go through the normal loops.

Sorry if I sound harsh but let's just not extend the scope of this issue constantly and return back to "defined scope -> work -> done -> problems found -> new issue" so that we can actually make progress.

CRW 1.2.2 is using registry.redhat.io/rhscl/postgresql-96-rhel7:1-46 and 2.0 will be using registry.redhat.io/rhscl/postgresql-96-rhel7:1-47 (or newer) so you CAN test a simple version change.

How on earth is the platfrom team supposed to that? at the time we tested this, operator on master still used the old image. If that's not gonna be the case, then let's open a new issue.

b) even with the upstream Che 6 -> 7 version centos/postgresql-96-centos7:9.6 is a moving tag which represents the latest set of applied CVE fixes. So :9.6 from 2 months ago is a different version from today. You can see this from the tags in the repo which are 5 or 13 days old:

which still resolves to a no-op with regards to testing the upgrade itself. Are we supposed to test kubernetes now?

So... the flaw here is the approach that you're always using a moving tag, instead of a static one.

Sure, let's create a new issue for that. Nothing to do with testing the upgrade.

Re: "not sure about PVC", we can again test an update there:

1.2.2 / 6.19.x: registry.access.redhat.com/ubi8-minimal:8.0-159
vs.
2.0 / 7.3 / master: registry.access.redhat.com/ubi8-minimal:8.0-213

So... reopening. Task not complete. @metlos can you investigate the two missing update path scenarios?

I think you can guess my sentiment on this. No, I will not do that in the scope of this issue.

Another point, you might be able to set the cheFlavor to codeready when creating the initial custom resource. So it should use the CRW images. Then you would be able to really test CRW update from 1.2 to 2.0 (just look at the default.go file in the che-operator 1.x branch at the 1.2.0 tag to know what images will be used by default in CRW 1.2).

Let's create another issue for this. This issue is about Che, not CRW. I would also argue the platform team of Che is not responsible for testing CRW.

@nickboldt
Copy link
Contributor

As requested followup work moved to 2 new issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. severity/P1 Has a major impact to usage or development of the system.
Projects
None yet
Development

No branches or pull requests

6 participants