Frontend: Cleanup Monitor() function #417

uablrek · 2023-05-17T12:00:51Z

Description

There seems to be a bug in here somewhere. But I couldn't follow all logic so I made a general cleanup.

One thing that might have been a bug is the "break" in

Meridio/cmd/frontend/internal/frontend/service.go

Line 314 in 60a45b5

break

The Monitor can exit more or less silently and announceFe will not be called in intervals ever after.

Issue link

May fix the NSP-FE issue described in #355 (comment)

Checklist

Purpose
- Bug fix
- New functionality
- Documentation
- Refactoring
- CI
Test
- Unit test
- E2E Test
- Tested manually
Introduce a breaking change
- Yes (description required)
- No

uablrek · 2023-05-17T12:01:45Z

I added timeouts of 5s (hard-coded) for communication with the NSP

zolug · 2023-05-17T12:19:25Z

cmd/frontend/internal/frontend/service.go

 					}
+					// refresh NSP entry (even if announceFrontend() above fails?)


Yes, even if the send towards NSP fails, the FE still has external connectivity. The FE should keep trying to inform NSP about it.

That question is still in place. Would be better to remove, in order to avoid confusion.
Without the retry block, the FE would not be able to announce its connectivity without a prior state change (-> down, up).

zolug · 2023-05-17T12:23:07Z

cmd/frontend/internal/frontend/service.go

+				// although configured at least one IP family has no connectivity
+				if hasConnectivity {
+					denounce = true
+					// should this be moved to "if denounce" above?


It could be moved there, yes. It reflects the status of external connectivity and is used by probes.

zolug · 2023-05-17T12:26:04Z

cmd/frontend/internal/frontend/service.go

 		// status of external connectivity; requires 1 GW per IP family if configured to be reachable
-		init, noConnectivity := true, true
+		hasConnectivity := false
 		connectivityMap := map[string]bool{}
 		delay := 3 * time.Second // when started grant more time to write the whole config (minimize intial link flapping)
 		_ = fes.WaitStart(ctx)


The following code from the start of Monitor() should be moved after WaitStart() as well. The error should not happen that case.

lp, err := fes.routingService.LookupCli() if err != nil { errCh <- fmt.Errorf("routing service cli not found: %v", err) return }

Can WaitStart() be moved to before the go func()? UPDATED: yes it can

It would block the main thread of the FE until bird/birdc is available. Maybe not a huge deal at the moment, but not a desired side-effect imho.

You are right. I move them back into the go function.

zolug · 2023-05-17T12:32:45Z

cmd/frontend/internal/frontend/service.go

-					logger.Info("protocol output", "out", protocolOut)
-					//linkCh <- "No protocols match"
+				logger.Error(err, "protocol output", "out", strings.Split(protocolOut, "\n"))
+				denounce = true


IMHO not likely to happen, but I'm not in favor of doing denounce in this case.
Not sure what could lead to such errors, but I wouldn't risk link flapping. It simply means, we couldn't check the status, but it might be "up".
Next retry is due in 1 second. It might fetch the status properly. (If it's a persistent issue the entry will timeout at NSP eventually.)
Maybe if the same problem persisted for several subsequent iteration...

I will add a counter for the "session errors" and denounce after X tries (X configurable)

Btw, whenever the code deems denounce is required, the VIPs must be denounced as well. (Unless the FE is terminating/restarting).
Otherwise, for example if the connection is still UP according to bird, then BGP will still attract traffic.

Similarly, health.SetServingStatus should be adjusted.

zolug · 2023-05-17T12:34:03Z

cmd/frontend/internal/frontend/service.go

+			bfdOut, err := fes.routingService.ShowBfdSessions(ctx, lp, `NBR-BFD`)
+			if err != nil {
+				logger.Error(err, "BFD output", "out", strings.Split(bfdOut, "\n"))
+				denounce = true


Again, I don't think we should do denounce.

zolug · 2023-05-17T12:39:01Z

cmd/frontend/internal/frontend/service.go

+					logger.Error(err, "denounceFrontend")
+				}
+				connectivityMap = map[string]bool{}
+				delay = 3 * time.Second // Avoid flapping (more?)


Not sure about this one. An env var would be nice.
It also overrides the initial delay value (if they would differ).

Honestly I have no idea what those errors I called "session errors" are and what that can cause them. So, please take a close look, and suggest what to do differently if needed

zolug · 2023-05-17T12:44:42Z

cmd/frontend/internal/frontend/service.go

+				connectivityMap = map[string]bool{}
+				delay = 3 * time.Second // Avoid flapping (more?)
+			}
+
 			select {
 			case <-ctx.Done():
 				logger.Info("Shutting down")


In a pending PR I added a denounce right before return (with short timeout) so that any watcher could be informed ASAP without blocking too long (if NSP was not available).

Ok, I add that. But with a brand new context

uablrek · 2023-05-17T14:16:20Z

Updated. I will merge and force-push after all is settled. I have not tested to set the env variables. I wait with that until I have fixed all review comments.
Also the announce log is temporary on DEBUG, I must remember to set it back to TRACE.
BTW is the 5s hard-coded timeout for announce/denounce ok?

uablrek · 2023-05-17T14:20:41Z

I am unsure if the operator:

V := something

creates a new variable that must later be GC'ed. Anyway IMHO a var ( ) make the code clearer in this case

zolug · 2023-05-18T13:16:19Z

cmd/frontend/internal/frontend/service.go

 			if strings.Contains(protocolOut, bird.NoProtocolsLog) {
 				logger.Info("protocol output", "out", protocolOut)
-				denounce = true
+				sessionErrors++


This is not a "session error". It indicates that there are no gateways configured. Either not added yet (in which case it doesn't matter if it was a session error or not), or all the gateways had been removed from the Attractor (FE config).
A denounce is required.

zolug · 2023-05-18T13:19:07Z

cmd/frontend/internal/frontend/service.go

+		denounce bool = true // Always denounce on container start
+	)
+
+	if s := os.Getenv("DELAY_NO_CONNECTIVITY"); s != "" {


The env variables are rather hidden here. Easy to miss, especially since envconfig package is used all across Meridio.

zolug · 2023-05-18T13:28:22Z

cmd/frontend/internal/frontend/service.go

+		// sessionErrors may be temporary
+		sessionErrors int
+		// maxSessionErrors is max sessionErrors before a denounce
+		maxSessionErrors int = 3


No clue why these errors pop up. But I would prefer a bit higher value, maybe 5.

zolug · 2023-05-18T13:32:51Z

cmd/frontend/internal/frontend/service.go

@@ -330,6 +333,12 @@ func (fes *FrontEndService) Monitor(ctx context.Context, errCh chan<- error) {
 				if refreshCancel != nil {
 					refreshCancel()
 				}
+				if hasConnectivity {
+					// (logging will not work inside denounceFrontend())
+					if err := denounceFrontend(context.Background(), fes.targetRegistryClient); err != nil {


I wonder if blocking the shutdown for 5 seconds could be a bit too much if e.g. NSP is not around.
Anyways, we can revisit it, if required...

zolug · 2023-05-18T15:14:20Z

cmd/frontend/internal/frontend/service.go

+					refreshCancel()
+					refreshCancel = nil
+				}
+				if err := denounceFrontend(ctx, fes.targetRegistryClient); err != nil {


Just realized, that this will fail most of the times on startup. The connection establishment with NSP might be delayed.
Yet, part of the intention to call denounce on start was, to inform watchers after abrupt termination of the fe process.
In the baseline the initial 3 seconds delay more or less ensured the connection with NSP was up.

I think either the goroutine at start should wait for the nsp connection to become ready, or some simple but lame init delay could be used.

or some simple but lame init delay could be used.

I go for this option as a start. Same as before, a 3s delay

Seems that in the baseline it wasn't the delay rather the grpc option WaitForReady that alowed the initial (or any) denounce to succeed. It blocks RPCs until the connection is ready.

Even with the 3 seconds delay, denounce fails on xcluster for me.

I think the easiest way forward would be if denounce did not have hard coded timeout, instead it would be the callers responsibility to pass a feasible context. Or maybe just increase the hard-coded value in denounce to 30 seconds, and in case of shutdown pass a context with lower timeout.

I think it may be best just to remove the timeout in announceFrontend/denounceFrontend entirely and make a comment. Not adding a ctx with timeout at all in the calls. I.e. same as it was before. Timeouts would be hard to decide, especially if they are dynamic. The passed ctx can still be canceled from main.

Or do you think there is a case where the calls can hang for an unreasonable time (whatever that is)?

and in case of shutdown pass a context with lower timeout.

I missed this. Of course, a timeout is needed on shutdown

You're right. Expect for the shutdown, I don't see much need for any timeout at the moment.

uablrek · 2023-05-22T06:09:24Z

I added a pointer to the config struct in FrontEndService struct. IMO that's better than to copy all config vars. Not only unnecessary code is removed, but also that the value is actually a configurable becomes clear.

And fix an unwanted break

zolug reviewed May 17, 2023

View reviewed changes

LionelJouin added kind/bug Something isn't working component/front-end labels May 17, 2023

uablrek requested a review from zolug May 17, 2023 15:52

zolug reviewed May 18, 2023

View reviewed changes

uablrek requested a review from zolug May 22, 2023 06:09

zolug approved these changes May 23, 2023

View reviewed changes

Frontend: Cleanup Monitor() function

10256c5

And fix an unwanted break

uablrek force-pushed the uablrek-monitor-cleanup branch from ac7ac4d to 10256c5 Compare May 23, 2023 07:44

uablrek merged commit 21f3cbf into master May 23, 2023

uablrek deleted the uablrek-monitor-cleanup branch May 23, 2023 08:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frontend: Cleanup Monitor() function #417

Frontend: Cleanup Monitor() function #417

uablrek commented May 17, 2023

uablrek commented May 17, 2023

zolug May 17, 2023

zolug May 18, 2023

zolug May 17, 2023

zolug May 17, 2023

uablrek May 17, 2023 •

edited

Loading

zolug May 17, 2023

uablrek May 17, 2023

zolug May 17, 2023

uablrek May 17, 2023

zolug May 17, 2023

zolug May 17, 2023

zolug May 17, 2023 •

edited

Loading

uablrek May 17, 2023

uablrek May 17, 2023

zolug May 17, 2023 •

edited

Loading

uablrek May 17, 2023

uablrek commented May 17, 2023

uablrek commented May 17, 2023

zolug May 18, 2023

zolug May 18, 2023

zolug May 18, 2023

zolug May 18, 2023

zolug May 18, 2023 •

edited

Loading

uablrek May 22, 2023

zolug May 22, 2023

uablrek May 22, 2023

uablrek May 22, 2023

zolug May 22, 2023 •

edited

Loading

uablrek commented May 22, 2023

		}
		// refresh NSP entry (even if announceFrontend() above fails?)

Frontend: Cleanup Monitor() function #417

Frontend: Cleanup Monitor() function #417

Conversation

uablrek commented May 17, 2023

Description

Issue link

Checklist

uablrek commented May 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uablrek May 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zolug May 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zolug May 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uablrek commented May 17, 2023

uablrek commented May 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zolug May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zolug May 22, 2023 • edited Loading

Choose a reason for hiding this comment

uablrek commented May 22, 2023

uablrek May 17, 2023 •

edited

Loading

zolug May 17, 2023 •

edited

Loading

zolug May 17, 2023 •

edited

Loading

zolug May 18, 2023 •

edited

Loading

zolug May 22, 2023 •

edited

Loading