Skip to content
This repository has been archived by the owner on Sep 2, 2024. It is now read-only.

Remove usage of eventing duck v1alpha1 shapes #426

Closed
vaikas opened this issue Mar 3, 2021 · 27 comments · Fixed by #550
Closed

Remove usage of eventing duck v1alpha1 shapes #426

vaikas opened this issue Mar 3, 2021 · 27 comments · Fixed by #550
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@vaikas
Copy link
Contributor

vaikas commented Mar 3, 2021

Describe the bug
The v1alpha1 channelable shape has been deprecated forever and we need to remove it.
knative/eventing#5005

Looks like kafka still uses it and it's still storing channels as v1alpha1:
https://github.com/knative-sandbox/eventing-kafka/blob/master/config/channel/resources/kafkachannel-crd.yaml#L30

So, these will stop being reconciled once 5005 goes in. We should fix this asap so we can unblock the upstream work.

@matzew

A clear and concise description of what the bug is.

Expected behavior
Using supported ducks for Channelable.

To Reproduce
Steps to reproduce the behavior.

Knative release version
head

Additional context
Add any other context about the problem here such as proposed priority

@vaikas vaikas added the kind/bug Categorizes issue or PR as related to a bug. label Mar 3, 2021
@vaikas
Copy link
Contributor Author

vaikas commented Mar 3, 2021

We will also need something like this to migrate storage versions from v1alpha1 to v1beta1:
knative/eventing#3262

That one was "pre-install", which we do not use anymore, but something like that as a post-install.

@matzew
Copy link
Contributor

matzew commented Mar 8, 2021

@vaikas 🙈 that pre-job triggers bad memories

@vaikas
Copy link
Contributor Author

vaikas commented Mar 10, 2021

So here's couple of PRs. Note that you must start using the v1beta1 channelable like in this PR so that any channels created during the migration will get the proper duck type and they will get reconciled properly.

knative/eventing#3169

Here's post install script:
https://github.com/knative/eventing/pull/3168/files

For Brokers, we moved from v1alpha1 -> v1beta1 by installing this CRD first:
knative/eventing#3282

So you could do something like that for kafka, apply first that flips the bits while the conversion from v1alpha1<->v1beta1 still works.

Then run the pre-install storage migrator, then upgrade.
knative/eventing#3110

@slinkydeveloper
Copy link
Contributor

Thanks for the pointers. I'm going to tackle it now.

Since this issue is a duplicated of #71, I'm going to close this

@slinkydeveloper
Copy link
Contributor

Well, maybe the issue is different?

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented Mar 10, 2021

@vaikas I'm trying to find out how we can do this migration without pre-install jobs and smoothly across releases. So what I've done is:

My question is: can we keep the code and the v1alpha1 version in the crd (exactly as #446) during the lifecycle of 0.22 and then do the final code and crd yaml cleanup (issue #71) in 0.23? All that code should just remain unused during the 0.22 lifecycle, right? This way, we can ensure the transition is smooth across releases, without pre-install hacks.

@vaikas
Copy link
Contributor Author

vaikas commented Mar 10, 2021

I do not think there's a way to do this smoothly without preinstall or adding some special handling for the kafka channel for the subscription controller. The problem is that the Subscription reconciler will treat the Channel differently (because the shapes are different) based on the label on the Channel. After #5005 in eventing went in, it will not have any understanding of the old channel types so it will not see the subscribers and it will produce wrong patches that will fail.
Since the Channel fetched is of type v1alpha1 we do not even get a chance to modify it via conversion webhook as things stand, however (need to think about this little more). The one venue here might be to change the storage version to v1beta1, and then if the Channel is fetched via v1alpha1 the conversion webhook would run and you might be able to then make it look like v1beta1, but then you'd have to change the shape of the channel in the v1alpha1 API to have the new fields. This was what the ChannelableCombined duck type did in the eventing ducks to work around this. This personally sounds pretty sketch, but just wanted to jot it down.
Another venue is to hard code and special case the kafka channels in eventing/subscription reconciler so that it has knowledge of only the kafka channels and handle them as v1alpha1 channels for one more release. This is yucky but it may be that it's the least yucky :)

Does any of that make sense @slinkydeveloper

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented Mar 10, 2021

The one venue here might be to change the storage version to v1beta1, and then if the Channel is fetched via v1alpha1 the conversion webhook would run and you might be able to then make it look like v1beta1, but then you'd have to change the shape of the channel in the v1alpha1 API to have the new fields. This was what the ChannelableCombined duck type did in the eventing ducks to work around this

I think I lost you in this, but it seems similar to my idea:

This way we need no hacks, no pre-install jobs, and code should continue to work as usual

Also ahmed is better than me at explaining things :) #448 (comment)

but then you'd have to change the shape of the channel in the v1alpha1 API to have the new fields

WDYM by this one in particular? v1alpha1 and v1beta1 kafka channels look identical, except for their ducks

@vaikas
Copy link
Contributor Author

vaikas commented Mar 10, 2021

I think that will work, but you will have some downtime since after you upgrade eventing and before you complete the storage version, the channels will not get reconciled. If this is acceptable then that should work indeed. One problem is that if the job fails to upgrade for some reason you might be in a bad state. It's mosdef worth experimenting again if this is acceptable. The one thing I am curious about is that if there are cases where we will lose the information as in if the subscription controller will look at a channel that does not have subscriptions and will update it and if badness will happen. Because it's a patch I think not, but it would require some investigation for sure.

v1alpha1 and v1beta1 are different because at the duck level v1alpha1 has extra level of indirection, so the subscription controller will not find the fields it's expecting. So if it looks for a Channel, say spec.subscribers (in v1beta1) and v1alpha1 has it at spec.subscriptionspec.subscribers (I don't remember exact field names, but for illustration purposes), then the subscription controller will look at a channel and see that there are no subscribers and it will try to produce a patch against the incorrect duck which will fail.

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented Mar 10, 2021

If this is acceptable then that should work indeed

I guess the tradeoff here is either this or the pre-install... which tradeoff do we want?

One problem is that if the job fails to upgrade for some reason you might be in a bad state
The one thing I am curious about is that if there are cases where we will lose the information as in if the subscription controller will look at a channel that does not have subscriptions and will update it and if badness will happen. Because it's a patch I think not, but it would require some investigation for sure.
then the subscription controller will look at a channel and see that there are no subscribers and it will try to produce a patch against the incorrect duck which will fail.

Ok I see your point, so what you're saying to me that in theory it might work, in practice some things might corrupt the CR, right? I guess I need to try the process on my machine and see what happens. If it doesn't fail, I assume it just works and we can proceed... @matzew @devguyio @aliok how does that sounds?

@vaikas
Copy link
Contributor Author

vaikas commented Mar 10, 2021

@slinkydeveloper yes, I'm worried that the Channel gets written back to the storage, it's lossy and you'll lose all the subscription information from the channel. Now, this should get reconciled later, the questions are:

  1. Is this true for realsies
  2. how long is the lag and is that acceptable
  3. Will this cause actual data loss in which case it's not just a matter of a resource being stale for the time being, but if data is dropped on the floor, that is bad.

@aliok
Copy link
Member

aliok commented Mar 11, 2021

@vaikas thanks a lot for the detailed comments, really appreciated.

I have some bad memories with pre-install, so I will be a little biased :)

Another venue is to hard code and special case the kafka channels in eventing/subscription reconciler so that it has knowledge of only the kafka channels and handle them as v1alpha1 channels for one more release. This is yucky but it may be that it's the least yucky :)

IMO this workaround is the safest and easiest path, although it is yucky :)

Other options:

  • Pre-install caused us some problems in the past. This would be my least preferred option.
  • Doing something similar to ChannelableCombined duck type in KafkaChannel is also a workaround and that's more complex to implement than the other options above.

Also, let me ping @travis-minke-sap although we might not need to touch the pkg/ of Kafka Channel implementations.

@slinkydeveloper
Copy link
Contributor

I think I've found the way to go, but it includes an hack in the subscription controller. I have a gist to reproduce the update process: https://gist.github.com/slinkydeveloper/59ab25d63fe8b14c298bbc1d9839fb62

@aliok
Copy link
Member

aliok commented Mar 17, 2021

@slinkydeveloper Thanks for the detailed gist. I will give it a try.

@slinkydeveloper
Copy link
Contributor

slinkydeveloper commented Mar 17, 2021

@vaikas @aliok In a "normal" situation, how the subscription ref is supposed to be updated?

Aka who is responsible for updating:

Spec:
  Channel:
    API Version:  messaging.knative.dev/v1alpha1
    Kind:         KafkaChannel
    Name:         testchannel
  Subscriber:
    Ref:
      API Version:  v1
      Kind:         Service
      Name:         sacura
      Namespace:    sacura

To this:

Spec:
  Channel:
    API Version:  messaging.knative.dev/v1beta1
    Kind:         KafkaChannel
    Name:         testchannel
  Subscriber:
    Ref:
      API Version:  v1
      Kind:         Service
      Name:         sacura
      Namespace:    sacura

The hack I've implemented with knative/eventing#5085 makes the reconciler work, but it doesn't update the sub version... Who's supposed to do that? The user? Some update script? One of our reconcilers? 😄

@aliok
Copy link
Member

aliok commented Mar 17, 2021

I don't know where this is updated. I've never seen any release notes about any user action nor any upgrade script around that. Maybe the subscription reconciler should do it?

@aliok
Copy link
Member

aliok commented Mar 17, 2021

I was just checking the gist.

Is this a good summary?

  • We keep KafkaChannel v1alpha1 having duck v1alpha1 --> we cannot change this.
  • We migrate KafkaChannel v1alpha1 resources to v1beta1, but we still keep the served=true on v1alpha1. That means the reconciler will still see the CR as v1alpha1 until post-install is finished.
  • In subscription reconciler, we fetch v1alpha1 KafkaChannels as v1beta1 to solve the problem above, which would happen for a limited time frame.

@slinkydeveloper
Copy link
Contributor

Yep correct! And then after the 0.22 release, we clean up hacks and v1alpha1 code.

Although I think, as last bit, we still need to solve the update of the refs. Somehow it doesn't sound right user has to do that by hand

@aliok
Copy link
Member

aliok commented Mar 17, 2021

The procedure in the Gist works. I also added removed / added the subscription after the procedure to see what happens.

Another test case I did was doing post-install first, and then ko apply'ing all the others. That also works fine. Red Hat QE might get more creative testing this, but my initial tests succeeded fine.

sacura logs - first part:

acura-7cm55 sacura 2021/03/17 14:42:19 metrics {
sacura-7cm55 sacura   "latencies": {
sacura-7cm55 sacura    "total": 591097242063,
sacura-7cm55 sacura    "mean": 4925810,
sacura-7cm55 sacura    "50th": 1819480,
sacura-7cm55 sacura    "90th": 4243028,
sacura-7cm55 sacura    "95th": 8642483,
sacura-7cm55 sacura    "99th": 42304481,
sacura-7cm55 sacura    "max": 2588310740,
sacura-7cm55 sacura    "min": 525891
sacura-7cm55 sacura   },
sacura-7cm55 sacura   "bytes_in": {
sacura-7cm55 sacura    "total": 0,
sacura-7cm55 sacura    "mean": 0
sacura-7cm55 sacura   },
sacura-7cm55 sacura   "bytes_out": {
sacura-7cm55 sacura    "total": 49199180,
sacura-7cm55 sacura    "mean": 409.99316666666664
sacura-7cm55 sacura   },
sacura-7cm55 sacura   "earliest": "2021-03-17T14:32:19.44241941Z",
sacura-7cm55 sacura   "latest": "2021-03-17T14:42:19.437456075Z",
sacura-7cm55 sacura   "end": "2021-03-17T14:42:19.439676092Z",
sacura-7cm55 sacura   "duration": 599995036665,
sacura-7cm55 sacura   "wait": 2220017,
sacura-7cm55 sacura   "requests": 120000,
sacura-7cm55 sacura   "rate": 200.00165445868603,
sacura-7cm55 sacura   "throughput": 199.94924754061643,
sacura-7cm55 sacura   "success": 0.9997416666666666,
sacura-7cm55 sacura   "status_codes": {
sacura-7cm55 sacura    "0": 2,
sacura-7cm55 sacura    "202": 119969,
sacura-7cm55 sacura    "404": 29
sacura-7cm55 sacura   },
sacura-7cm55 sacura   "errors": [
sacura-7cm55 sacura    "404 Not Found",
sacura-7cm55 sacura    "Post \"http://testchannel-kn-channel.sacura.svc.cluster.local/\": dial tcp 0.0.0.0:0-\u003e10.107.8.126:80: connect: connection refused",
sacura-7cm55 sacura    "Post \"http://testchannel-kn-channel.sacura.svc.cluster.local/\": read tcp 10.244.0.22:45969-\u003e10.107.8.126:80: read: connection reset by peer"
sacura-7cm55 sacura   ]
sacura-7cm55 sacura  }
sacura-7cm55 sacura 2021/03/17 14:42:19 context canceled

What is the frequency of Sacura sending messages?

@slinkydeveloper
Copy link
Contributor

What is the frequency of Sacura sending messages?

It's in the config, it seems to me it's running at 200 rps. Worst case, my assumption is that the system is not available for ~ 3 secs (looking at sacura log + kubectl describe)

@aliok
Copy link
Member

aliok commented Mar 17, 2021

I can see this above: "404": 29 So, maybe it is actually shorter.

@aliok
Copy link
Member

aliok commented Mar 17, 2021

I can see this above: "404": 29 So, maybe it is actually shorter.

Ah, this is the 2nd case I mentioned above (doing the post-install first). In that case, possibly there are less problems

@slinkydeveloper
Copy link
Contributor

Ok so I've found how to upgrade the subscription ref:

So now the sub channel ref is automatically bumped:

Spec:
  Channel:
    API Version:  messaging.knative.dev/v1beta1
    Kind:         KafkaChannel
    Name:         testchannel

@slinkydeveloper
Copy link
Contributor

The whole migration path goes as follows:

  • User setup knative eventing 0.22
  • User setup eventing-kafka 0.22
    • post-install script taps every kafkachannel, causing it to be stored as v1beta1 because now the CRD def doesn't allow storing v1alpha1 anymore
    • post-install script taps every subscription, causing the update of its Spec.Channel.APIVersion field to v1beta1 through the defaulting chain

@matzew
Copy link
Contributor

matzew commented Mar 18, 2021

I see the updated PR for eventing - will follow that

@slinkydeveloper
Copy link
Contributor

@vaikas can you please check the solution proposed here?

@vaikas
Copy link
Contributor Author

vaikas commented Mar 18, 2021

@slinkydeveloper looks pretty hacky but if it works, I'm fine with it.

@slinkydeveloper slinkydeveloper linked a pull request Apr 19, 2021 that will close this issue
matzew added a commit to matzew/eventing-kafka that referenced this issue Nov 12, 2021
Signed-off-by: Matthias Wessendorf <mwessend@redhat.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants