-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frontend: Cleanup Monitor() function #417
Conversation
I added timeouts of 5s (hard-coded) for communication with the NSP |
} | ||
// refresh NSP entry (even if announceFrontend() above fails?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even if the send towards NSP fails, the FE still has external connectivity. The FE should keep trying to inform NSP about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That question is still in place. Would be better to remove, in order to avoid confusion.
Without the retry block, the FE would not be able to announce its connectivity without a prior state change (-> down, up).
// although configured at least one IP family has no connectivity | ||
if hasConnectivity { | ||
denounce = true | ||
// should this be moved to "if denounce" above? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be moved there, yes. It reflects the status of external connectivity and is used by probes.
// status of external connectivity; requires 1 GW per IP family if configured to be reachable | ||
init, noConnectivity := true, true | ||
hasConnectivity := false | ||
connectivityMap := map[string]bool{} | ||
delay := 3 * time.Second // when started grant more time to write the whole config (minimize intial link flapping) | ||
_ = fes.WaitStart(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following code from the start of Monitor() should be moved after WaitStart() as well. The error should not happen that case.
lp, err := fes.routingService.LookupCli()
if err != nil {
errCh <- fmt.Errorf("routing service cli not found: %v", err)
return
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can WaitStart() be moved to before the go func()? UPDATED: yes it can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would block the main thread of the FE until bird/birdc is available. Maybe not a huge deal at the moment, but not a desired side-effect imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I move them back into the go function.
logger.Info("protocol output", "out", protocolOut) | ||
//linkCh <- "No protocols match" | ||
logger.Error(err, "protocol output", "out", strings.Split(protocolOut, "\n")) | ||
denounce = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO not likely to happen, but I'm not in favor of doing denounce in this case.
Not sure what could lead to such errors, but I wouldn't risk link flapping. It simply means, we couldn't check the status, but it might be "up".
Next retry is due in 1 second. It might fetch the status properly. (If it's a persistent issue the entry will timeout at NSP eventually.)
Maybe if the same problem persisted for several subsequent iteration...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a counter for the "session errors" and denounce after X tries (X configurable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, whenever the code deems denounce is required, the VIPs must be denounced as well. (Unless the FE is terminating/restarting).
Otherwise, for example if the connection is still UP according to bird, then BGP will still attract traffic.
Similarly, health.SetServingStatus should be adjusted.
bfdOut, err := fes.routingService.ShowBfdSessions(ctx, lp, `NBR-BFD`) | ||
if err != nil { | ||
logger.Error(err, "BFD output", "out", strings.Split(bfdOut, "\n")) | ||
denounce = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't think we should do denounce.
logger.Error(err, "denounceFrontend") | ||
} | ||
connectivityMap = map[string]bool{} | ||
delay = 3 * time.Second // Avoid flapping (more?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one. An env var would be nice.
It also overrides the initial delay value (if they would differ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I have no idea what those errors I called "session errors" are and what that can cause them. So, please take a close look, and suggest what to do differently if needed
connectivityMap = map[string]bool{} | ||
delay = 3 * time.Second // Avoid flapping (more?) | ||
} | ||
|
||
select { | ||
case <-ctx.Done(): | ||
logger.Info("Shutting down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a pending PR I added a denounce right before return (with short timeout) so that any watcher could be informed ASAP without blocking too long (if NSP was not available).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I add that. But with a brand new context
Updated. I will merge and force-push after all is settled. I have not tested to set the env variables. I wait with that until I have fixed all review comments. |
I am unsure if the operator: V := something creates a new variable that must later be GC'ed. Anyway IMHO a |
if strings.Contains(protocolOut, bird.NoProtocolsLog) { | ||
logger.Info("protocol output", "out", protocolOut) | ||
denounce = true | ||
sessionErrors++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a "session error". It indicates that there are no gateways configured. Either not added yet (in which case it doesn't matter if it was a session error or not), or all the gateways had been removed from the Attractor (FE config).
A denounce is required.
denounce bool = true // Always denounce on container start | ||
) | ||
|
||
if s := os.Getenv("DELAY_NO_CONNECTIVITY"); s != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The env variables are rather hidden here. Easy to miss, especially since envconfig package is used all across Meridio.
// sessionErrors may be temporary | ||
sessionErrors int | ||
// maxSessionErrors is max sessionErrors before a denounce | ||
maxSessionErrors int = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clue why these errors pop up. But I would prefer a bit higher value, maybe 5.
@@ -330,6 +333,12 @@ func (fes *FrontEndService) Monitor(ctx context.Context, errCh chan<- error) { | |||
if refreshCancel != nil { | |||
refreshCancel() | |||
} | |||
if hasConnectivity { | |||
// (logging will not work inside denounceFrontend()) | |||
if err := denounceFrontend(context.Background(), fes.targetRegistryClient); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if blocking the shutdown for 5 seconds could be a bit too much if e.g. NSP is not around.
Anyways, we can revisit it, if required...
refreshCancel() | ||
refreshCancel = nil | ||
} | ||
if err := denounceFrontend(ctx, fes.targetRegistryClient); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized, that this will fail most of the times on startup. The connection establishment with NSP might be delayed.
Yet, part of the intention to call denounce on start was, to inform watchers after abrupt termination of the fe process.
In the baseline the initial 3 seconds delay more or less ensured the connection with NSP was up.
I think either the goroutine at start should wait for the nsp connection to become ready, or some simple but lame init delay could be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or some simple but lame init delay could be used.
I go for this option as a start. Same as before, a 3s delay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that in the baseline it wasn't the delay rather the grpc option WaitForReady that alowed the initial (or any) denounce to succeed. It blocks RPCs until the connection is ready.
Even with the 3 seconds delay, denounce fails on xcluster for me.
I think the easiest way forward would be if denounce did not have hard coded timeout, instead it would be the callers responsibility to pass a feasible context. Or maybe just increase the hard-coded value in denounce to 30 seconds, and in case of shutdown pass a context with lower timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be best just to remove the timeout in announceFrontend/denounceFrontend entirely and make a comment. Not adding a ctx with timeout at all in the calls. I.e. same as it was before. Timeouts would be hard to decide, especially if they are dynamic. The passed ctx can still be canceled from main.
Or do you think there is a case where the calls can hang for an unreasonable time (whatever that is)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in case of shutdown pass a context with lower timeout.
I missed this. Of course, a timeout is needed on shutdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Expect for the shutdown, I don't see much need for any timeout at the moment.
I added a pointer to the config struct in |
And fix an unwanted break
ac7ac4d
to
10256c5
Compare
Description
There seems to be a bug in here somewhere. But I couldn't follow all logic so I made a general cleanup.
One thing that might have been a bug is the "break" in
Meridio/cmd/frontend/internal/frontend/service.go
Line 314 in 60a45b5
The Monitor can exit more or less silently and announceFe will not be called in intervals ever after.
Issue link
May fix the NSP-FE issue described in #355 (comment)
Checklist