Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talk to Fedora kernel team about FCOS stream design #80

Closed
bgilbert opened this issue Nov 9, 2018 · 18 comments
Closed

Talk to Fedora kernel team about FCOS stream design #80

bgilbert opened this issue Nov 9, 2018 · 18 comments

Comments

@bgilbert
Copy link
Contributor

bgilbert commented Nov 9, 2018

The FCOS stream design (#22, #72) has two elements that affect the kernel:

  • When there's a kernel fix that requires an out-of-cycle FCOS release, it may be desirable to backport the fix to kernels already in the affected streams rather than pushing a new kernel release to those streams. Container Linux currently does this for its stable channel to reduce the impact of user-facing regressions in stable kernels. See Short-lived package branches for backported fixes #79 for associated branch tooling.
  • The next stream will track the upcoming kernel from rc6 until the kernel reaches Bodhi updates. The idea is that, while every other major package bump in FCOS bakes in next for a substantial time before being promoted, new kernel versions would normally only bake for two weeks in testing because they flow directly into Fedora stable releases. The additional kernel coverage in next will provide a few weeks of additional baking time for catching regressions. This will require tooling support as part of Package pinning mechanism for release streams #77.

Ask the Fedora kernel team what they think of all this.

@bgilbert
Copy link
Contributor Author

In particular, we'll need to use official processes to build kernel images signed for Secure Boot.

@mskarbek
Copy link

Does tracking RC releases means that we will get kernels with debugging options enabled in next?

@bgilbert
Copy link
Contributor Author

I guess we could produce our own builds with debug options disabled. Which way is preferable?

@mskarbek
Copy link

For me, it would be debug off.

@dustymabe
Copy link
Member

Reached out to the kernel team about scheduling some time to discuss this. Will try to meet with them next week.

@labbott
Copy link

labbott commented Dec 14, 2018

I'm not fully familiar with FCOS streams, it sounds like the proposal is to backport individual critical fixes instead of just doing a full stable update, is that correct? (Want to make sure I understand before giving a detailed response)

@dustymabe
Copy link
Member

I'm not fully familiar with FCOS streams, it sounds like the proposal is to backport individual critical fixes instead of just doing a full stable update, is that correct?

TL;DR - yes. In some cases (we hope this would be infrequent) we'd like to be able to backport fixes directly rather than pick up the latest update from Fedora. The idea is that if we are a kernel version behind in our stable branch (worst case 4 weeks behind) and we need to rush out a fix then we can elect to apply the small patch rather than jump to the next kernel version. Since we are trying to get people to set updates to automatic, jumping to the next kernel version without some soak time could be problematic.

The longer version is in the design doc we've created

To be clear: we aren't necessarily asking the Fedora kernel team to perform the work here. We are asking that our Fedora tooling/infrastructure is set up so that this type of build/update could occur and someone can do that work. I believe the FCOS community would be comfortable applying/testing these patches and collaborating with the Fedora kernel team to get them available.

@jmflinuxtx
Copy link

While I get the theory behind it, I might recommend that this be followed in the case of a rebase, but perhaps less so on stable updates. So if a critical CVE happens and FCOS is on 4.19.5, but Fedora pushed the fix in 4.19.8, FCOS would just update to 4.19.8. If Fedora were to push the fix in 4.20.4 however, it would be backported to the last 4.19.x that Fedora shipped. Stable updates don't often have major regressions, and the cases where they do, it is typically because of rushed CVE fixes anyway, so you would still get those.
Essentially, what I am proposing is that we continue to support a stable kernel for up to 4 weeks after Fedora has rebased, if it is being used in FCOS streams. It also covers the cases of the less critical CVEs which get fixed with most updates, adds the collective test efforts of the Fedora community, lowers the burden on QA, and in general everyone wins.

@bgilbert
Copy link
Contributor Author

@jmflinuxtx Our experience in Container Linux is that stable updates often do have significant regressions. We used to push stable updates directly to the CL stable channel, but switched to a more conservative policy (similar to the one proposed here for FCOS) as a result of repeated breakage. (Those regressions were in LTS kernels, though, which might well be more regression-prone than current ones.)

Based on our experience with CL, we think that each user-impacting regression encourages users to stop trusting us and disable automatic updates. To reduce that risk, we'll encourage users to run a few percent of their FCOS nodes on the testing stream to help us catch regressions (similar to the CL alpha and beta channels), so we want to ensure risky changes have a chance to bake in testing before reaching stable. FCOS will have CI gating but will not have manual QA, so hopefully this model shouldn't cause additional maintenance burden outside the FCOS community.

The support gap after a kernel rebase is an issue I had missed. We'd greatly appreciate the extra 4 weeks of support for the previous kernel if the kernel team is willing to provide it, though I hesitate to further increase your workload.

@labbott
Copy link

labbott commented Dec 14, 2018

You've probably heard this before but just for the record: Trying to selectively pick up commits is not recommended by upstream. It's difficult to know if any commit will become a security issue or not so best practice is to pick them all up. It also may be difficult to do backporting of individual fixes if there's a diff between versions.

That said, that message is usually intended at people who never want to pick up stable updates. Choosing to only pick up known critical fixes vs. entire stable updates for a fixed time period could be an option as long as you realize the tradeoffs. If you're already holding back on stable updates (i.e. FCOS is a few stable versions behind Fedora) I don't think you've lost too much by choosing to just give just the critical fixes. I'd only suggest doing this for the most critical of issues though and would advocate for "just take the full stable update" be the default option for most issues.

I like the idea Justin proposed as well and I think it would be a good supplement. Long term the goal is to increase the confidence in stable updates to make everyone's life easier.

@bgilbert
Copy link
Contributor Author

Okay, so it sounds like there's rough consensus around:

  • A 2-4 week delay from a stable kernel update landing in updates to it landing in FCOS stable
  • An occasional backport of a critical fix to the existing kernels in FCOS testing and stable
  • The kernel team handling backports of critical CVE fixes in the case that Fedora has rebased but FCOS stable hasn't picked up the rebase yet (@jmflinuxtx's proposal)

Does that all sound good?


Other pieces we haven't explictly discussed:

  • Secure Boot signing for builds of backported fixes (presumably via the koji ACLs)
  • The FCOS next stream (which previews the next Fedora release between Bodhi enablement and release, and mirrors testing otherwise) carrying upcoming kernels starting from rc6
  • Whether preview kernels in next should be rebuilt with debug options disabled

Any thoughts on those?

@jeremycline
Copy link

  • The FCOS next stream (which previews the next Fedora release between Bodhi enablement and release, and mirrors testing otherwise) carrying upcoming kernels starting from rc6
  • Whether preview kernels in next should be rebuilt with debug options disabled

RC builds have debugging disabled, so as long as you only use those (and not the daily git snapshots between them) you'd get kernels without debug options on. There is a bit of a snag here, though, because once a release happens Rawhide moves on so the day after 4.20 comes out, Rawhide will be 4.21-rc0. At the moment stable updates before the rebase happens (4.20.1, maybe 4.20.2) aren't built in Koji, they're built in a Copr repository and aren't Secure Boot signed.

I think Koji will happily produce real builds from any dist-git commit, though, so we could possibly just build the stabilization branch in Koji without stepping on the old stable kernel's toes. They obviously wouldn't end up in Bodhi so there'd need to be some other Koji tag they got placed in, I guess.

@dustymabe
Copy link
Member

dustymabe commented Dec 18, 2018

Thanks @jeremycline - 👍 I think you have helped answer some questions. I'll review where I currently think we are:

  • The FCOS next stream (which previews the next Fedora release between Bodhi enablement and release, and mirrors testing otherwise) carrying upcoming kernels starting from rc6
    • Context provided by @jeremycline. For right now we (FCOS) will use judgement and manually decide when to switch the next stream to the new RC kernels from rawhide.
  • Whether preview kernels in next should be rebuilt with debug options disabled
    • Answered by @jeremycline. RC builds have debugging disabled already.

The remaining question we have is:

  • Secure Boot signing for builds of backported fixes (presumably via the koji ACLs)

It looks like from the documentation only certain people have the ACLs to get a kernel build signed appropriately. Since we are volunteering to do or aid in backports could we get someone like @bgilbert initiated (is there a process here?) and blessed with appropriate ACLs ?

@jmflinuxtx
Copy link

There is not a specific process in place, it is guarded enough that it is extremely rare to add anyone to the ACLs there. We would need to get a couple more people involved in that discussion, but it is a discussion we can have.

@dustymabe
Copy link
Member

There is not a specific process in place, it is guarded enough that it is extremely rare to add anyone to the ACLs there. We would need to get a couple more people involved in that discussion, but it is a discussion we can have.

Thanks @jmflinuxtx. Could you make introductions for us, or tell us who to talk to in order to start the discussion?

@dustymabe
Copy link
Member

sent a followup email - will update next week.

@dustymabe
Copy link
Member

Followed up with @jmflinuxtx @vathpela and @nirik - We got @bgilbert ACLs for building/signing kernel packages. Will test it out once @jmflinuxtx has a candidate build we can try it out on.

Thanks all.

@Conan-Kudo
Copy link

Just to chime in as another voice (perhaps an irrelevant one), but Mageia (one of the distributions I actively work in) has actually explicitly elected to switch away from LTS kernels because they tend to have more breakage than normal stable kernels.

The unfortunate reality of LTS kernels is that certain people only come out of the woodwork to submit changes when an upcoming LTS release is announced, and those particular kernels have been worse than regular releases in recent years.

I am not at all surprised by @bgilbert's experience, as it mirrors what has been the case for Mageia during the Mageia 5 and Mageia 6 release cycles, which is why Mageia 7 is switching to the Fedora policy of just shipping the latest stable releases and tracking those.

I suspect that the Container Linux approach to handling kernel releases will matter a whole lot less with Fedora CoreOS simply because Container Linux followed a Mageia-like policy rather than a Fedora-like one. In my experience with my Fedora systems, it's been pretty rare to see such breakages with regular stable releases.

Moreover, by updating frequently to new stable kernels as they arrive, the behavior changes and such are going to be more incremental and easier to adapt to anyway, which should alleviate a large number of issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants