Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify all the RouteTables that point at the main kernel table #8418

Merged
merged 7 commits into from
Aug 8, 2024

Conversation

fasaxc
Copy link
Member

@fasaxc fasaxc commented Jan 18, 2024

Description

Replace the plethora of RouteTable instances that we used to program routes into the main kernel routing table with a single unified RouteTable.

  • Instead of handling conflicts by trying to program both routes into the kernel, detect conflicts in user space and program only the winning route in the kernel. The previous behaviour resulted in a messy retry loop, spewing errors until the conflict was resolved.
  • Given each source of routes its own RouteClass with an associated priority to allow clean conflict resolution. for example, a local workload route has a higher priority than an IPAM block "blackhole" route. Bridge the gap from old API to new by introducing a ClassView object that exposes the old API to the "manager" layer objects. They each act as if they have their own RouteTable but the ClassView adds the RouteClass in order to disambiguate routes.
  • Align route programming with the indexing approach in the kernel. In the kernel, routes are indexed on table, CIDR, ToS and metric. Previously, we would read back routes and only take account of the CIDR. This resulted in the possibility of aliasing: we would read back two routes and only see one CIDR.
  • Identifying which routes belong to Calico in the main routing table is very messy and requires heuristics that were spread across several components. Unify those heuristics into a single "ownership policy" abstraction. The main routing table gets its own ownership policy.
  • Fix that conntrack cleanup wasn't properly handled when routes moved between RouteTables. conntrack cleanup would only be sequenced correctly when routes moved within a RouteTable. For example, if a route moved from one workload to another, we'd correctly sequence removal of the route, conntrack cleanup and adding the new route. However, if a route moved from local workload to remote VXLAN via the tunnel, we'd do the cleanup in parallel with programming the new route resulting in potential glitches.
  • Move conntrack cleanup to a new object to encapsulate the fiddly calculation.

Related issues/PRs

CORE-9902
CORE-10126
CI-1357

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Felix's route programming now handles routing conflicts deterministically, prioritising routes based on their type.  Conntrack cleanup has been improved; cleanup is now correctly sequenced with route programming when IP addresses move from local to remote workloads. 

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@marvin-tigera marvin-tigera added this to the Calico v3.28.0 milestone Jan 18, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jan 18, 2024
@fasaxc fasaxc force-pushed the route-unification branch from f67efab to ae7138b Compare January 19, 2024 11:36
@fasaxc fasaxc force-pushed the route-unification branch 3 times, most recently from 5622c58 to ff922e2 Compare February 15, 2024 14:58
@fasaxc fasaxc added docs-not-required Docs not required for this change and removed docs-pr-required Change is not yet documented labels Feb 15, 2024
@fasaxc fasaxc force-pushed the route-unification branch 8 times, most recently from 351a716 to a7caec2 Compare May 3, 2024 10:50
@radTuti radTuti modified the milestones: Calico v3.28.0, Calico v3.28.1 May 3, 2024
@fasaxc fasaxc force-pushed the route-unification branch 3 times, most recently from 3f7cadc to d4276c4 Compare June 17, 2024 10:51
@fasaxc fasaxc force-pushed the route-unification branch 2 times, most recently from ddaff28 to 937977c Compare June 21, 2024 13:09
@fasaxc fasaxc mentioned this pull request Jul 4, 2024
3 tasks
@fasaxc fasaxc force-pushed the route-unification branch from fbf4100 to 02a009f Compare July 11, 2024 12:52
@fasaxc fasaxc modified the milestones: Calico v3.28.1, Calico v3.29.0 Jul 11, 2024
@fasaxc fasaxc force-pushed the route-unification branch from a050676 to 0dece97 Compare July 17, 2024 14:11
@fasaxc fasaxc force-pushed the route-unification branch from 0dece97 to 4ea7466 Compare July 17, 2024 14:36
@fasaxc fasaxc changed the title [WIP] Unify all the RouteTables that point at the main kernel table Unify all the RouteTables that point at the main kernel table Jul 17, 2024
@fasaxc
Copy link
Member Author

fasaxc commented Jul 17, 2024

Still filling in the last bits of coverage but I think this is ready for a first pass now...

@fasaxc fasaxc marked this pull request as ready for review July 17, 2024 17:13
@fasaxc fasaxc requested a review from a team as a code owner July 17, 2024 17:13
@fasaxc fasaxc force-pushed the route-unification branch from 44cbc45 to 91c4262 Compare July 18, 2024 10:32
@fasaxc fasaxc mentioned this pull request Jul 18, 2024
3 tasks
@fasaxc fasaxc force-pushed the route-unification branch from 91c4262 to 4860486 Compare July 18, 2024 12:47
fasaxc added 4 commits July 19, 2024 11:15
Replace the plethora of RouteTable instances that we used to program routes
into the main kernel routing table with a single unified RouteTable.

- Instead of handling conflicts by trying to program both routes into the
  kernel, detect conflicts in user space and program only the winning route
  in the kernel.  The previous behaviour resulted in a messy retry loop,
  spewing errors until the conflict was resolved.

- Given each class of routes its own `RouteClass` with an associated priority
  to allow clean conflict resolution.  for example, a local workload route has
  a higher priority than an IPAM block "blackhole" route.

- Align route programming with the indexing approach in the kernel.  In
  the kernel, routes are indexed on table, CIDR, ToS and metric.
  Previously, we would read back routes and only take account of the
  CIDR.  This resulted in the possibility of aliasing: we would read
  back two routes and only see one CIDR.

- Identifying which routes belong to Calico in the main routing table
  is very messy and requires heuristics that were spread across several
  components.  Unify those heuristics into a single "ownership policy"
  abstraction.  The main routing table gets its own ownership policy.

- Fix that conntrack cleanup wasn't properly handled when routes moved
  between RouteTables.  conntrack cleanup would only be sequenced
  correctly when routes moved within a RouteTable.  For example, if a
  route moved from one workload to another, we'd correctly sequence
  removal of the route, conntrack cleanup and adding the new route.
  However, if a route moved from local workload to remote VXLAN via the
  tunnel, we'd do the cleanup in parallel with programming the new route
  resulting in potential glitches.

- Move conntrack cleanup to a new object to encapsulate the fiddly
  calculation.  To align with Conntrack, this is still CIDR based instead of
  taking into account the ToS and Priority.  For now, we assume that cleanups
  may happen for alternate-priority routes but that the RouteTable won't
  program more than one route per CIDR.
@fasaxc fasaxc force-pushed the route-unification branch from 4860486 to 83881c6 Compare July 19, 2024 10:16
@fasaxc fasaxc force-pushed the route-unification branch from 83881c6 to d01fa34 Compare July 19, 2024 15:01
Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was easier to review than I feared, looks good to me!

// StartConntrackCleanupAndReset.
//
// No cleanup is triggered if there is a desired route for the given CIDR
// and that desirec route is staying on this interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: desirec

for {
select {
case <-ch:
logrus.WithField("ip", ipAddr).Info("Done waiting for pending conntrack deletion to finish")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth including how long it took in both of these logs?

@fasaxc
Copy link
Member Author

fasaxc commented Aug 8, 2024

/merge-when-ready

@marvin-tigera
Copy link
Contributor

OK, I will merge the pull request when it's ready, leave the commits as is when I merge it, and leave the branch after I've merged it.

@marvin-tigera
Copy link
Contributor

Removing "merge-when-ready" label due to new commits

@fasaxc fasaxc force-pushed the route-unification branch from 115f80e to 5cfb93f Compare August 8, 2024 13:28
@fasaxc fasaxc merged commit 6d43538 into projectcalico:master Aug 8, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants