Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontend: Cleanup Monitor() function #417

Merged
merged 1 commit into from
May 23, 2023
Merged

Conversation

uablrek
Copy link
Contributor

@uablrek uablrek commented May 17, 2023

Description

There seems to be a bug in here somewhere. But I couldn't follow all logic so I made a general cleanup.

One thing that might have been a bug is the "break" in

The Monitor can exit more or less silently and announceFe will not be called in intervals ever after.

Issue link

May fix the NSP-FE issue described in #355 (comment)

Checklist

  • Purpose
    • Bug fix
    • New functionality
    • Documentation
    • Refactoring
    • CI
  • Test
    • Unit test
    • E2E Test
    • Tested manually
  • Introduce a breaking change
    • Yes (description required)
    • No

@uablrek
Copy link
Contributor Author

uablrek commented May 17, 2023

I added timeouts of 5s (hard-coded) for communication with the NSP

}
// refresh NSP entry (even if announceFrontend() above fails?)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, even if the send towards NSP fails, the FE still has external connectivity. The FE should keep trying to inform NSP about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That question is still in place. Would be better to remove, in order to avoid confusion.
Without the retry block, the FE would not be able to announce its connectivity without a prior state change (-> down, up).

// although configured at least one IP family has no connectivity
if hasConnectivity {
denounce = true
// should this be moved to "if denounce" above?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be moved there, yes. It reflects the status of external connectivity and is used by probes.

// status of external connectivity; requires 1 GW per IP family if configured to be reachable
init, noConnectivity := true, true
hasConnectivity := false
connectivityMap := map[string]bool{}
delay := 3 * time.Second // when started grant more time to write the whole config (minimize intial link flapping)
_ = fes.WaitStart(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following code from the start of Monitor() should be moved after WaitStart() as well. The error should not happen that case.

lp, err := fes.routingService.LookupCli()
	if err != nil {
		errCh <- fmt.Errorf("routing service cli not found: %v", err)
		return
	}

Copy link
Contributor Author

@uablrek uablrek May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can WaitStart() be moved to before the go func()? UPDATED: yes it can

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would block the main thread of the FE until bird/birdc is available. Maybe not a huge deal at the moment, but not a desired side-effect imho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I move them back into the go function.

logger.Info("protocol output", "out", protocolOut)
//linkCh <- "No protocols match"
logger.Error(err, "protocol output", "out", strings.Split(protocolOut, "\n"))
denounce = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO not likely to happen, but I'm not in favor of doing denounce in this case.
Not sure what could lead to such errors, but I wouldn't risk link flapping. It simply means, we couldn't check the status, but it might be "up".
Next retry is due in 1 second. It might fetch the status properly. (If it's a persistent issue the entry will timeout at NSP eventually.)
Maybe if the same problem persisted for several subsequent iteration...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a counter for the "session errors" and denounce after X tries (X configurable)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, whenever the code deems denounce is required, the VIPs must be denounced as well. (Unless the FE is terminating/restarting).
Otherwise, for example if the connection is still UP according to bird, then BGP will still attract traffic.

Similarly, health.SetServingStatus should be adjusted.

bfdOut, err := fes.routingService.ShowBfdSessions(ctx, lp, `NBR-BFD`)
if err != nil {
logger.Error(err, "BFD output", "out", strings.Split(bfdOut, "\n"))
denounce = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think we should do denounce.

logger.Error(err, "denounceFrontend")
}
connectivityMap = map[string]bool{}
delay = 3 * time.Second // Avoid flapping (more?)
Copy link
Collaborator

@zolug zolug May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. An env var would be nice.
It also overrides the initial delay value (if they would differ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I have no idea what those errors I called "session errors" are and what that can cause them. So, please take a close look, and suggest what to do differently if needed

connectivityMap = map[string]bool{}
delay = 3 * time.Second // Avoid flapping (more?)
}

select {
case <-ctx.Done():
logger.Info("Shutting down")
Copy link
Collaborator

@zolug zolug May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a pending PR I added a denounce right before return (with short timeout) so that any watcher could be informed ASAP without blocking too long (if NSP was not available).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I add that. But with a brand new context

@LionelJouin LionelJouin added kind/bug Something isn't working component/front-end labels May 17, 2023
@uablrek
Copy link
Contributor Author

uablrek commented May 17, 2023

Updated. I will merge and force-push after all is settled. I have not tested to set the env variables. I wait with that until I have fixed all review comments.
Also the announce log is temporary on DEBUG, I must remember to set it back to TRACE.
BTW is the 5s hard-coded timeout for announce/denounce ok?

@uablrek
Copy link
Contributor Author

uablrek commented May 17, 2023

I am unsure if the operator:

V := something

creates a new variable that must later be GC'ed. Anyway IMHO a var ( ) make the code clearer in this case

@uablrek uablrek requested a review from zolug May 17, 2023 15:52
if strings.Contains(protocolOut, bird.NoProtocolsLog) {
logger.Info("protocol output", "out", protocolOut)
denounce = true
sessionErrors++
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a "session error". It indicates that there are no gateways configured. Either not added yet (in which case it doesn't matter if it was a session error or not), or all the gateways had been removed from the Attractor (FE config).
A denounce is required.

denounce bool = true // Always denounce on container start
)

if s := os.Getenv("DELAY_NO_CONNECTIVITY"); s != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env variables are rather hidden here. Easy to miss, especially since envconfig package is used all across Meridio.

// sessionErrors may be temporary
sessionErrors int
// maxSessionErrors is max sessionErrors before a denounce
maxSessionErrors int = 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No clue why these errors pop up. But I would prefer a bit higher value, maybe 5.

@@ -330,6 +333,12 @@ func (fes *FrontEndService) Monitor(ctx context.Context, errCh chan<- error) {
if refreshCancel != nil {
refreshCancel()
}
if hasConnectivity {
// (logging will not work inside denounceFrontend())
if err := denounceFrontend(context.Background(), fes.targetRegistryClient); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if blocking the shutdown for 5 seconds could be a bit too much if e.g. NSP is not around.
Anyways, we can revisit it, if required...

refreshCancel()
refreshCancel = nil
}
if err := denounceFrontend(ctx, fes.targetRegistryClient); err != nil {
Copy link
Collaborator

@zolug zolug May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized, that this will fail most of the times on startup. The connection establishment with NSP might be delayed.
Yet, part of the intention to call denounce on start was, to inform watchers after abrupt termination of the fe process.
In the baseline the initial 3 seconds delay more or less ensured the connection with NSP was up.

I think either the goroutine at start should wait for the nsp connection to become ready, or some simple but lame init delay could be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or some simple but lame init delay could be used.

I go for this option as a start. Same as before, a 3s delay

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that in the baseline it wasn't the delay rather the grpc option WaitForReady that alowed the initial (or any) denounce to succeed. It blocks RPCs until the connection is ready.

Even with the 3 seconds delay, denounce fails on xcluster for me.

I think the easiest way forward would be if denounce did not have hard coded timeout, instead it would be the callers responsibility to pass a feasible context. Or maybe just increase the hard-coded value in denounce to 30 seconds, and in case of shutdown pass a context with lower timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be best just to remove the timeout in announceFrontend/denounceFrontend entirely and make a comment. Not adding a ctx with timeout at all in the calls. I.e. same as it was before. Timeouts would be hard to decide, especially if they are dynamic. The passed ctx can still be canceled from main.

Or do you think there is a case where the calls can hang for an unreasonable time (whatever that is)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in case of shutdown pass a context with lower timeout.

I missed this. Of course, a timeout is needed on shutdown

Copy link
Collaborator

@zolug zolug May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Expect for the shutdown, I don't see much need for any timeout at the moment.

@uablrek
Copy link
Contributor Author

uablrek commented May 22, 2023

I added a pointer to the config struct in FrontEndService struct. IMO that's better than to copy all config vars. Not only unnecessary code is removed, but also that the value is actually a configurable becomes clear.

@uablrek uablrek requested a review from zolug May 22, 2023 06:09
And fix an unwanted break
@uablrek uablrek force-pushed the uablrek-monitor-cleanup branch from ac7ac4d to 10256c5 Compare May 23, 2023 07:44
@uablrek uablrek merged commit 21f3cbf into master May 23, 2023
@uablrek uablrek deleted the uablrek-monitor-cleanup branch May 23, 2023 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/front-end kind/bug Something isn't working
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants