-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add both hardware and nic mac allocation retry #239
Conversation
pkg/sriov/sriov.go
Outdated
if err = s.nLink.LinkSetHardwareAddr(linkObj, hwaddr); err != nil { | ||
return fmt.Errorf("failed to restore original effective netlink MAC address %s: %v", hwaddr, err) | ||
} | ||
err = utils.Retry(5, 200*time.Millisecond, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the CNI at least output a warning message if this fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we return error sorry I miss that :)
64567bd
to
06aef2e
Compare
Pull Request Test Coverage Report for Build 4607738372
💛 - Coveralls |
/lgtm |
pkg/sriov/sriov.go
Outdated
// reset effective MAC address | ||
if conf.MAC != "" { | ||
hwaddr, err := net.ParseMAC(conf.OrigVfState.EffectiveMAC) | ||
if err != nil { | ||
return fmt.Errorf("failed to parse original effective MAC address %s: %v", conf.OrigVfState.EffectiveMAC, err) | ||
} | ||
|
||
if err = s.nLink.LinkSetHardwareAddr(linkObj, hwaddr); err != nil { | ||
err = utils.Retry(5, 200*time.Millisecond, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we try to do things in reverse(ish) of setupVF ?
that is:
- bring interface down
- restore mac
- set netdev orig name
- move ifc back to host ns
WDYT ?
this way you may not need the extra call to LinkByName
In addition, can you add same comment for the Retry call as in SetupVF ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change the solution a bit when you have time please take a look.
in general I prefer to leave the rename before because if for some reason, in any case, we failed to reset the mac address the container runtime will remove the network namespace and the VF will go back to the node with the pod nic name.
how it's today it will always have the right vf name even if for some driver issues we are not able to recover the right mac address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general I prefer to leave the rename before because if for some reason, in any case, we failed to reset the mac address the container runtime will remove the network namespace and the VF will go back to the node with the pod nic name.
how it's today it will always have the right vf name even if for some driver issues we are not able to recover the right mac address.
ok
a4f2384
to
3cd9071
Compare
pkg/sriov/sriov.go
Outdated
so to cover both cases we will retry the set function | ||
and also check the mac address on the VF after the configuration. | ||
*/ | ||
err = utils.Retry(10, 100*time.Millisecond, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have two places where we set the MAC netdev mac address (here and in L104)
shall we use the same logic in both ? it seems the "fail silently" part may happen on set as well WDYT ?
that is just have a : setVFEffectiveMAC(netdevName, MAC)
and call it with utils.Retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done can you please have a look?
461ce7b
to
a43748c
Compare
@adrianchiris all green now please take a look when you have some time :) |
/test-all |
pkg/sriov/sriov.go
Outdated
if err = s.nLink.LinkSetHardwareAddr(linkObj, hwaddr); err != nil { | ||
return fmt.Errorf("failed to restore original effective netlink MAC address %s: %v", hwaddr, err) | ||
} | ||
err = utils.SetVFEffectiveMAC(s.nLink, conf.OrigVfState.HostIFName, conf.OrigVfState.EffectiveMAC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this bit not conditioned by conf.MAC
as before ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that conf.OrigVfState.EffectiveMAC
is saved at L88 and is conditioned by conf.MAC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need it as we always have a mac address it can be the original one or the one saved in line 88
from main.go
err = sm.FillOriginalVfInfo(netConf)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FillOriginalVfInfo does not save effective MAC, see [1][2]
also in case conf.MAC
was not specified i dont think we need to restore it as we did not set it to begin with.
my POV here is why add un-needed steps that can theoretically fail.
[1]
Line 304 in 77c66ac
func (s *sriovManager) FillOriginalVfInfo(conf *sriovtypes.NetConf) error { |
[2]
Line 22 in 77c66ac
func (vs *VfState) FillFromVfInfo(info *netlink.VfInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right nice catch!
ff3bd14
to
ed3500f
Compare
Hi @adrianchiris let me know if the changes are good now. thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of comments, otherwise we are good 2 go.
pkg/sriov/sriov.go
Outdated
if err = s.nLink.LinkSetHardwareAddr(linkObj, hwaddr); err != nil { | ||
return fmt.Errorf("failed to restore original effective netlink MAC address %s: %v", hwaddr, err) | ||
} | ||
err = utils.SetVFEffectiveMAC(s.nLink, conf.OrigVfState.HostIFName, conf.OrigVfState.EffectiveMAC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FillOriginalVfInfo does not save effective MAC, see [1][2]
also in case conf.MAC
was not specified i dont think we need to restore it as we did not set it to begin with.
my POV here is why add un-needed steps that can theoretically fail.
[1]
Line 304 in 77c66ac
func (s *sriovManager) FillOriginalVfInfo(conf *sriovtypes.NetConf) error { |
[2]
Line 22 in 77c66ac
func (vs *VfState) FillFromVfInfo(info *netlink.VfInfo) { |
@adrianchiris done :) |
Hi @SchSeba
The network card is Intel x710. To workaround the above warning message, a manual reset of VF MAC is done on node:
I wonder if this PR should set VF MAC to |
Hi @korroot there is another solution that I think is better. just enable in the sriov-cni the trust:on it will allow the vpp application to control the administrative mac address |
I have |
@@ -1,6 +1,7 @@ | |||
package sriov | |||
|
|||
import ( | |||
"github.com/k8snetworkplumbingwg/sriov-cni/pkg/utils" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps it goes better with the other github imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/utils/utils.go
Outdated
|
||
linkObj, err := netLinkManager.LinkByName(netDeviceName) | ||
if err != nil { | ||
return fmt.Errorf("failed to get netlink device with name %s: %q", linkObj.Attrs().Name, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would not reference linkObj if err occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use another name for the local var, then you can still use linkObj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right I add a orgLinkObj that will never change even if there was an error in the LinkByName functions
@SchSeba, there is a retry in the sriov.go file line 359 that in my view, shouldn't be there. The retry is intended for LinkSetHardwareAddr not for LinkSetVfHardwareAddr. I think that it was mistankely added the other way around (ReleaseVF and ResetVFConfig). Do you agree? Should we remove it in this PR as well? |
Hi @mlguerrero12 thanks for the comment that retry was to fix another issue not related to this one where the mac address change in the vf administrative mac address failed |
Hi @korroot this sounds like a bug in the driver as using trust: on should allow the vf to change the mac address. can you try to check a newer kernel version maybe?? |
It seems a bit odd that a previous PR added the retry in the add command for "LinkSetHardwareAddr" and in the del command for "LinkSetVfHardwareAddr". Both have the same comment which it seems to indicate issues with setting the netdev VF MAC address (LinkSetHardwareAddr function). Now, in this PR, you are adding the retry in the del command for "LinkSetHardwareAddr" which was missing. So, the only one that doesn't have the retry now is in the add command for "LinkSetVfHardwareAddr". What am I missing? c.c. @cgoncalves |
Just to explain myself better. First PR was meant to solve the issue while setting the netdev VF MAC address (LinkSetHardwareAddr) after setting the MAC via PF (LinkSetVfHardwareAddr). Add command (ApplyVFConfig and SetupVF functions)
Del Command (ResetVFConfig and ReleaseVF functions)
This PR adds the missing retry for del command LinkSetHardwareAddr. So now we have. Add command (ApplyVFConfig and SetupVF functions)
Del Command (ResetVFConfig and ReleaseVF functions)
So, we either delete the retry for del command LinkSetVfHardwareAddr or add it for add command LinkSetVfHardwareAddr to have consistency. Please correct me if I'm seeing things wrong. |
Thanks for the amazing review @mlguerrero12 ! I will add it also for the add so we consist with all the others :) |
@SchSeba we had a problem with MAC address setting on this setup:
The errors in
As you can see, the device is set to be trusted but this doesn't prevent the error from happening. |
Hi @ivan4th thanks for the update! you are right about putting the mac address under the PF to be another possible solution is if you know what is going to be the mac address that vpp will allocate you can configure it in the pod definition this way it will match. or you can also do it the other way around to tell vpp to use the mac address allocated by this sriov-cni if it's possible. The last option will be to expose the option to have mac address and adminMacAddress in the network attachment for the sriov-cni this way you will be able to configure the |
@ivan4th I put it on the next community meeting agenda https://docs.google.com/document/d/1sJQMHbxZdeYJPgAWK1aSt6yzZ4K_8es7woVIrwinVwI/edit# if you can join to talk about the use-case it will be good :) |
@SchSeba thanks, I will join the meeting. Our use case is an in-house VPP-based network layer for Kubernetes, in some ways resembling but also rather unlike Calico-VPP, and the assumption was that VPP has full control over the VFs it gets, including setting the MAC address according to one configured in some CRs of our own (these correspond to VPP's idea of network interfaces) |
@mlguerrero12 can you take another look? |
My only remark is about the comment "Some NIC drivers (i.e. i40e/iavf) set VF MAC ......". Please delete it in the retries for LinkSetVfHardwareAddr and add it to SetVFEffectiveMAC function. That comment is for LinkSetHardwareAddr not LinkSetVfHardwareAddr. |
@mlguerrero12 done :) |
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has evolved since its conception.
it does not deal only with MAC reset on cmd DEL.
it affects how we set admin and effective mac on both ADD and DEL flows.
commit msg (and PR topic) should be updated to reflect this.
this will also help get reviewers in the proper mindset when reviewing this.
pkg/utils/utils.go
Outdated
return fmt.Errorf("failed to parse MAC address %s: %v", macAddress, err) | ||
} | ||
|
||
OrgLinkObj, err := netLinkManager.LinkByName(netDeviceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: O -> o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/utils/utils.go
Outdated
@@ -284,6 +284,51 @@ func CleanCachedNetConf(cRefPath string) error { | |||
return nil | |||
} | |||
|
|||
/* | |||
SetVFEffectiveMAC will try to set the mac address on a specific VF interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation looks odd for the first line. should it be just a single space ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you use /*
for a function title it put the first line like that :)
I check it to //
pkg/sriov/sriov.go
Outdated
@@ -231,7 +210,10 @@ func (s *sriovManager) ApplyVFConfig(conf *sriovtypes.NetConf) error { | |||
return fmt.Errorf("failed to parse MAC address %s: %v", conf.MAC, err) | |||
} | |||
|
|||
if err = s.nLink.LinkSetVfHardwareAddr(pfLink, conf.VFID, hwaddr); err != nil { | |||
err = utils.Retry(5, 200*time.Millisecond, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats a new one since i last reviewed (i think .. :) )
I assume its to align with how we do things in ResetVFConfig.
its unclear why we are retrying here... (and also after removing the comment at L350 also in resetVFConfig)
either:
- define a wrapper func "SetVFHardwareAddress" in utils which will do this and add comment to that method. (like u did with setting effective mac)
- add comments here and below to explain why its being retired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was to be equal with the other functions base on this comment #239 (comment)
- I create the function and also add unit tests for that new function
- I add a small comment
I hope this time the PR is really done :) |
This commit add a retry function for all the mac address allocation functions. this is needed because some of the drivers have async mac allocation or get a device or resouce busy errors until the vf device is fully configured on the system so instead of failing the CNI we just wait 100 miliseconds Signed-off-by: Sebastian Sch <sebassch@gmail.com>
if err != nil { | ||
return fmt.Errorf("failed to get netlink device with name %s: %q", orgLinkObj.Attrs().Name, err) | ||
} | ||
if linkObj.Attrs().Vfs[vfID].Mac.String() != macAddress { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can safely assume that Vfs[vfID] will always exist during the del command. The config is read from the cache and the vf might not be present. This will panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no way we use the vfID many times before we call this function in the ResetVFConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you don't set vlan or spoofchk, it is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is right but today we have the assumption that the cache is configured as expected.
but I agree we can have a check in the cmfDel before we call any function that we have all the stuff needed before we call the release one.
we can have a follow-up PR for it if that sounds good to you :)
LGTM |
LGTM, @SchSeba please add an issue to followup on to check vfid exists as stated here: #239 (comment) feel free to merge this one once created. |
created #271 |
This commit add a retry function for all the mac address allocation functions.
this is needed because some of the drivers have async mac allocation or get a device or resouce busy errors until the vf device is fully configured on the system so instead of failing the CNI we just wait 100 miliseconds
Signed-off-by: Sebastian Sch sebassch@gmail.com