-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic in bpfman-agent due to err not being returned #330
Conversation
ab2040b
to
25444f6
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #330 +/- ##
==========================================
+ Coverage 27.60% 27.63% +0.03%
==========================================
Files 87 87
Lines 7499 7498 -1
==========================================
+ Hits 2070 2072 +2
+ Misses 5228 5226 -2
+ Partials 201 200 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
25444f6
to
0bf84bb
Compare
@Billy99 You need to rebase this to pick up the changes in the Tcx pr so that integration test will pass. |
It looks like you just did that. |
Originally reported in [bpfman:#1311](bpfman/bpfman#1311) and tracked in bpfman#328, `GetID()` and `reconcileBpfProgram()` both have instances where they detect errors but return nil for err, causing the calling functions to continue and eventually panic. Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if those cases the where err was ignored for the reconciler to retry again ?
Good point @msherif1234 , I'm glad you brought that up. I meant to mention something similar when I pushed the PR but it failed some tests early and I got sidetracked. If that is the case, I think it should be up to the calling function to decide on the retry because they have more context and can just ignore the error. Let me know if you have any suggestions, you and @anfredette know the agent code a little better than me. |
I'm looking at it now. I had a similar thought/concern. We need to make sure that we retry when appropriate. It's also the case that a delete can fail if the program has already been deleted or was never installed, so we need to handle that too. |
@@ -217,7 +217,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context, | |||
id, err := bpfmanagentinternal.GetID(bpfProgram) | |||
if err != nil { | |||
r.Logger.Error(err, "Failed to get bpf program ID") | |||
return bpfmaniov1alpha1.BpfProgCondNotLoaded, nil | |||
return bpfmaniov1alpha1.BpfProgCondNotLoaded, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ID didn't get set by now, how is it going to get set? I don't think it is. Did the users scenario hit this case? However, we think the program is loaded becasue it's in the loadedBpfPrograms list, and the element in that list is the gobpfman.ListResponse_ListResult returned by bpfman, and that has all the info on the program, including the ID. So, maybe we should try to pull it out of there.
@@ -232,7 +232,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context, | |||
r.Logger.V(1).Info("bpf program is in wrong state, unloading and reloading", "reason", reasons, "bpfProgram Name", bpfProgram.Name, "bpf program ID", id) | |||
if err := bpfmanagentinternal.UnloadBpfmanProgram(ctx, r.BpfmanClient, *id); err != nil { | |||
r.Logger.Error(err, "Failed to unload BPF Program") | |||
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, nil | |||
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems reasonable to return an error here. However, as the code stands, I think we will only try this one more time (unless a change to another program kicks off the reconciler again.) If it fails the second time, we'll stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've talked about this being an issue, so I wanted to flag it for discussion. Maybe we should keep trying, or at least try a few times. We could count how many times it has failed in the CRD. Maybe do some type of retry with a back off so we keep trying every once in a while. If we could get a failure reason from bpfman, maybe we can determine whether it's a permanent or possibly transient failure to guide whether/when to do a retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think these changes are all good. I also think we should evaluate whether we should have retries when the bpfman calls fail, and how to do it.
@@ -251,7 +251,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context, | |||
r.Logger.Info("Calling bpfman to unload program on node", "bpfProgram Name", bpfProgram.Name, "Program ID", id) | |||
if err := bpfmanagentinternal.UnloadBpfmanProgram(ctx, r.BpfmanClient, *id); err != nil { | |||
r.Logger.Error(err, "Failed to unload Program") | |||
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, nil | |||
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for the UnloadBpfmanProgram() case above.
@@ -267,7 +267,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context, | |||
r.progId, err = bpfmanagentinternal.LoadBpfmanProgram(ctx, r.BpfmanClient, loadRequest) | |||
if err != nil { | |||
r.Logger.Error(err, "Failed to load Program") | |||
return bpfmaniov1alpha1.BpfProgCondNotLoaded, nil | |||
return bpfmaniov1alpha1.BpfProgCondNotLoaded, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for the UnloadBpfmanProgram() case above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the logs, I think this is the only case that was hit by the user. It failed, r.progId is set to nil, no error returned to calling function and r.progId being nil caused a panic later. I returned err on all other cases just to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm marking this as "Request changes" so it doesn't get merged quite yet, but I didn't actually request any changes -- just added a few comments/questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/LGTM
Originally reported in bpfman:#1311 and tracked in #328,
GetID()
andreconcileBpfProgram()
both have instances where they detect errors but return nil for err, causing the calling functions to continue and eventually panic.Resolves: #328