Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EVPN]Handling error scenarios during route programming and IMR add #2670

Merged
merged 1 commit into from
Mar 7, 2023

Conversation

dgsudharsan
Copy link
Collaborator

What I did
Handling error scenarios for the following cases

  1. When route is propagated through FRR but SONiC is yet to process VRF - VNI mapping. The route needs to be skipped
  2. When VNI is configured as L3 in SONiC but misconfigured as L2 in FRR leading to remote IMR add, avoid adding the IMR since the local VLAN VNI map doesn't exist according to the recent updated design.

Why I did it
These scenarios will lead to issues and result in SAI failures which needs to be avoided

How I verified it
Added UT to verify.

Details if related

@dgsudharsan
Copy link
Collaborator Author

@srj102 Can you please review?

@liat-grozovik
Copy link
Collaborator

@prsunny can you please help to review/approve?

@srj102
Copy link
Contributor

srj102 commented Feb 27, 2023

@tapashdas

@@ -2376,6 +2376,13 @@ bool EvpnRemoteVnip2pOrch::addOperation(const Request& request)
return false;
}

VRFOrch* vrf_orch = gDirectory.get<VRFOrch*>();
if (vrf_orch->isL3VniVlan(vni_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. wouldn't this check is applicable for fdborch also ?
  2. If the vni became an L3VNI after IMR routes were installed then OA will remove VLAN-VNI mappings and add the VRF-VNI mapping, leaving us with a scenario where the SAI has MAC/IMR routes with no corresponding vlan vni mappings.
  3. What happens without this fix ? Are there SAI errors we see ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In case of FDB orch we have check to ensure Vlan membership which will prevent FDB getting added since the IMR is prevented
  2. Yes that's true. But as per the current behavior we will fail at removing VLAN VNI mapping at SAI level due to references. Do you think we should add a check at orchagent. I am not sure if this check is valid for other platforms and so I didn't prefer to add
  3. Without the fix SAI returns error and orchagent crashes.

if (vrf_orch->isL3VniVlan(vni_id))
{
SWSS_LOG_WARN("Ignoring remote VNI add for L3 VNI:%d, remote:%s", vni_id, remote_vtep.c_str());
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be return true instead of return false ?

The log says "Ignore" but in reality it is being deferred.
Also a large number of IMR routes being held in the m_tosync queue and being processed in a continous loop will load the OA unnecessarily.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can correct the log. I prefer not to ignore but defer. If we ignore we might miss those remote. Since this is a corner error scenario I don't think it will overwhelm orchagent during regular processing


if (!l3Vni)
{
it++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as for IMR handling. Not Removing this from the m_tosyncqueue in a scale scenario and in a misconfiguration case will unnecessarily load the OA.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, completely ignoring will lead to the message being lost. Since this is a error scenario, I would rather prefer retrying here. It won't affect regular use case scenarios

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we see this in startup scenarios as well ? In which case the retry becomes necessary. 

For example in an L3VNI scenario, FRR is configured correctly and SWSS is configured for both L2 and L3VNIs. 

The vrforch processes the L3VNI configuration after receiving the update from fpmsyncd. 

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srj102 It is quite possible but I haven't encountered it. I believe my change would handle that scenario as well

for (auto &vni_str: vni_labelv)
{
vni = static_cast<uint32_t>(std::stoul(vni_str));
if (!m_vrfOrch->isL3VniVlan(vni))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the vrf-vni configuration is removed from OA and not from FRR after routes are installed then the routes still remain in the HW.
Is this scenario handled ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case the VNI removal leads to error logs in SAI since it has references. This scenario again depends on references to maintain which I believe we don't do it. IMO I am fixing the scenarios which are easier to fix with the current implementation and then document the ones which are harder to fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes thats correct we don't maintain the routes in the OA and handling this will be harder. If this leads to a SAI error and an OA crash we will need to see whether this can be handled at the SAI level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently SAI error is not creating orchagent crash since we are not handling status for this call. So we will get syslog error. For now I consider this a lower priority and document this.

@prsunny prsunny merged commit a2c9a61 into sonic-net:master Mar 7, 2023
AntonHryshchuk added a commit to AntonHryshchuk/sonic-buildimage that referenced this pull request Mar 8, 2023
Update sonic-swss submodule pointer to include the following:
* a2c9a61 [EVPN]Handling error scenarios during route programming and IMR add ([sonic-net#2670](sonic-net/sonic-swss#2670))
* 115efe8 [bfdorch] add default TOS value for BFD session ([sonic-net#2689](sonic-net/sonic-swss#2689))
* a198289 [orchagent, SRv6]: create seglist support to set sid list type ([sonic-net#2406](sonic-net/sonic-swss#2406))

Signed-off-by: AntonHryshchuk <antonh@nvidia.com>
@dgsudharsan dgsudharsan deleted the evpn_fix_1 branch March 9, 2023 02:03
dgsudharsan added a commit to dgsudharsan/sonic-buildimage that referenced this pull request Mar 14, 2023
Update sonic-swss submodule pointer to include the following:
* 98a16cf [ACL] Write ACL table/rule creation status into STATE_DB ([sonic-net#2662](sonic-net/sonic-swss#2662))
* a2c9a61 [EVPN]Handling error scenarios during route programming and IMR add ([sonic-net#2670](sonic-net/sonic-swss#2670))
* 115efe8 [bfdorch] add default TOS value for BFD session ([sonic-net#2689](sonic-net/sonic-swss#2689))
* a198289 [orchagent, SRv6]: create seglist support to set sid list type ([sonic-net#2406](sonic-net/sonic-swss#2406))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
prsunny pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Mar 14, 2023
Update sonic-swss submodule pointer to include the following:
* 98a16cf [ACL] Write ACL table/rule creation status into STATE_DB ([#2662](sonic-net/sonic-swss#2662))
* a2c9a61 [EVPN]Handling error scenarios during route programming and IMR add ([#2670](sonic-net/sonic-swss#2670))
* 115efe8 [bfdorch] add default TOS value for BFD session ([#2689](sonic-net/sonic-swss#2689))
* a198289 [orchagent, SRv6]: create seglist support to set sid list type ([#2406](sonic-net/sonic-swss#2406))
StormLiangMS pushed a commit that referenced this pull request Mar 19, 2023
…2670)

*[EVPN]Handling error scenarios during route programming and IMR add (#2670)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants