Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic in bpfman-agent due to err not being returned #330

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

Billy99
Copy link
Contributor

@Billy99 Billy99 commented Oct 23, 2024

Originally reported in bpfman:#1311 and tracked in #328, GetID() and reconcileBpfProgram() both have instances where they detect errors but return nil for err, causing the calling functions to continue and eventually panic.

Resolves: #328

Copy link

codecov bot commented Oct 23, 2024

Codecov Report

Attention: Patch coverage is 28.57143% with 5 lines in your changes missing coverage. Please review.

Project coverage is 27.63%. Comparing base (aa3840d) to head (0bf84bb).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
controllers/bpfman-agent/common.go 20.00% 4 Missing ⚠️
controllers/bpfman-agent/internal/bpfman-core.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #330      +/-   ##
==========================================
+ Coverage   27.60%   27.63%   +0.03%     
==========================================
  Files          87       87              
  Lines        7499     7498       -1     
==========================================
+ Hits         2070     2072       +2     
+ Misses       5228     5226       -2     
+ Partials      201      200       -1     
Flag Coverage Δ
unittests 27.63% <28.57%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@anfredette
Copy link
Contributor

@Billy99 You need to rebase this to pick up the changes in the Tcx pr so that integration test will pass.

@anfredette
Copy link
Contributor

@Billy99 You need to rebase this to pick up the changes in the Tcx pr so that integration test will pass.

It looks like you just did that.

Originally reported in
[bpfman:#1311](bpfman/bpfman#1311) and tracked
in bpfman#328, `GetID()` and `reconcileBpfProgram()` both have instances where
they detect errors but return nil for err, causing the calling functions
to continue and eventually panic.

Signed-off-by: Billy McFall <22157057+Billy99@users.noreply.github.com>
Copy link
Contributor

@msherif1234 msherif1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if those cases the where err was ignored for the reconciler to retry again ?

@Billy99
Copy link
Contributor Author

Billy99 commented Oct 23, 2024

I wonder if those cases the where err was ignored for the reconciler to retry again ?

Good point @msherif1234 , I'm glad you brought that up. I meant to mention something similar when I pushed the PR but it failed some tests early and I got sidetracked. If that is the case, I think it should be up to the calling function to decide on the retry because they have more context and can just ignore the error. Let me know if you have any suggestions, you and @anfredette know the agent code a little better than me.

@anfredette
Copy link
Contributor

I'm looking at it now. I had a similar thought/concern. We need to make sure that we retry when appropriate. It's also the case that a delete can fail if the program has already been deleted or was never installed, so we need to handle that too.

@@ -217,7 +217,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context,
id, err := bpfmanagentinternal.GetID(bpfProgram)
if err != nil {
r.Logger.Error(err, "Failed to get bpf program ID")
return bpfmaniov1alpha1.BpfProgCondNotLoaded, nil
return bpfmaniov1alpha1.BpfProgCondNotLoaded, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ID didn't get set by now, how is it going to get set? I don't think it is. Did the users scenario hit this case? However, we think the program is loaded becasue it's in the loadedBpfPrograms list, and the element in that list is the gobpfman.ListResponse_ListResult returned by bpfman, and that has all the info on the program, including the ID. So, maybe we should try to pull it out of there.

@@ -232,7 +232,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context,
r.Logger.V(1).Info("bpf program is in wrong state, unloading and reloading", "reason", reasons, "bpfProgram Name", bpfProgram.Name, "bpf program ID", id)
if err := bpfmanagentinternal.UnloadBpfmanProgram(ctx, r.BpfmanClient, *id); err != nil {
r.Logger.Error(err, "Failed to unload BPF Program")
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, nil
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to return an error here. However, as the code stands, I think we will only try this one more time (unless a change to another program kicks off the reconciler again.) If it fails the second time, we'll stop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've talked about this being an issue, so I wanted to flag it for discussion. Maybe we should keep trying, or at least try a few times. We could count how many times it has failed in the CRD. Maybe do some type of retry with a back off so we keep trying every once in a while. If we could get a failure reason from bpfman, maybe we can determine whether it's a permanent or possibly transient failure to guide whether/when to do a retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think these changes are all good. I also think we should evaluate whether we should have retries when the bpfman calls fail, and how to do it.

@@ -251,7 +251,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context,
r.Logger.Info("Calling bpfman to unload program on node", "bpfProgram Name", bpfProgram.Name, "Program ID", id)
if err := bpfmanagentinternal.UnloadBpfmanProgram(ctx, r.BpfmanClient, *id); err != nil {
r.Logger.Error(err, "Failed to unload Program")
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, nil
return bpfmaniov1alpha1.BpfProgCondNotUnloaded, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for the UnloadBpfmanProgram() case above.

@@ -267,7 +267,7 @@ func (r *ReconcilerCommon) reconcileBpfProgram(ctx context.Context,
r.progId, err = bpfmanagentinternal.LoadBpfmanProgram(ctx, r.BpfmanClient, loadRequest)
if err != nil {
r.Logger.Error(err, "Failed to load Program")
return bpfmaniov1alpha1.BpfProgCondNotLoaded, nil
return bpfmaniov1alpha1.BpfProgCondNotLoaded, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for the UnloadBpfmanProgram() case above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the logs, I think this is the only case that was hit by the user. It failed, r.progId is set to nil, no error returned to calling function and r.progId being nil caused a panic later. I returned err on all other cases just to be safe.

Copy link
Contributor

@anfredette anfredette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm marking this as "Request changes" so it doesn't get merged quite yet, but I didn't actually request any changes -- just added a few comments/questions.

Copy link
Contributor

@anfredette anfredette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@mergify mergify bot merged commit 3babb63 into bpfman:main Oct 23, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

panic in bpfman-agent when trying to load XDP program
3 participants