-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime link stability timeout for main/backup #1775
Runtime link stability timeout for main/backup #1775
Conversation
Removed early exit from activation phase.
bf1dd06
to
07a8981
Compare
I think you are forgetting something here. If you need to have some "initial measurement period" to decide the parameter that should remain valid for a long time of functioning of a given link, then this should be the parameter valid in any kind of conditions, no matter what characteristics particular links have. If you limit RTT measurements only to active links, then it is likely the following situation: you have 3 equivalent links, one with lower RTT connected and active and 2 backup links currently idle. The RTT on the backup links is currently unknown because they aren't active. When the main link goes down, which is early seen by the exceeded stability timeout, it activates the backup link. However RTT on that link proves to be quite large and the ACK intervals get stretched way beyond the current stability timeout, which triggers activation of the second backup link. That one might also have a higher RTT, in result of which this will also activate the 1st - best - link as the stability timeout couldn't yet converge towards worse conditions of the 2 other backup links (while activation of the 1st link wasn't expected because links are equivalent). I think the only solution would be to forcefully activate always a link that was first connected, and keep it alive during the initial measurement period. SRT (since always) has a cache mechanism that remembers last characteristics of a link, so this forced activation wouldn't have to be done every time the link is reconnected, only when it is connected the very first time. This will be however necessary to have at least a rough measurement of the RTT for that link, especially if this should make the results worse than current links. The RTT measured in that link should comprise the minimum conditions on the whole group for STO, that is, even if this link isn't active and its RTT isn't being measured. Caching the RTT value should ensure that the forced activation of the link will happen only the very first time in a lifetime when the application uses this link. If this is implemented, and it is ensured that every idle link always has its last remembered RTT recorded, and the highest RTT from all links is always dictating the minimum value for STO (stating it's still below the maximum required by latency), then getting switched to that link shall never consider activation of another link when a higher-RTT link takes over and it starts exceeding STO in the beginning. Also, the STO should be immediately modified upon reconnection of a link that would not take over the activation (that is, it would be idle in the beginning) so that the group is ready to make this link take over when the currently active link goes down, even if this means giving it more stability tolerance than it is required by the value of its RTT. |
is_stable = false; | ||
} | ||
} | ||
const int is_stable = sendBackup_CheckRunningLinkStable(u, currtime, d->weight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you protect yourself against rapid changes in this value in case when, for example, RTT starts to increase dramatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RTT changes are smoothed by the IIR filter (first on the receiver side, then again on the sender side). So there won't be that high fluctuation. And if some fluctuations fall outside the stability threshold, that is an indicator that something unstable is happening on the link.
Fixed naming.
The purpose of the activation phase that lasts for minimum 60 ms (see below), when we don't yet use the dynamic stability timout. It was observed that during this period RTT estimate more or less converges to a real value.
Both are subject to separate PRs to minimize the number of changes by each small step.
|
The existing option
SRTO_GROUPSTABTIMEO
sets a timeout for the last response time of a group member link.The same value is used for all member links.
An optimal timeout value depends on the network conditions (RTT, loss rate), and can be calculated at runtime.
This PR replaces the usage of
SRTO_GROUPSTABTIMEO
with a dynamic link stability timeout calculated at runtime.When an idle or newly connected link is activated, it first transits to an "activation phase" and stays there for
SRTO_PEERLATENCY + 50ms
with the minimum allowed value of 60 ms, and the maximum allowed value ofSRTO_PEERIDLETIMEO
.It is expected that at the start of sending RTT, RTTVar and link stability are not yet known. Therefore a link needs to stay active for this period to judge if it is stable or not.
After the activation phase ends, the link stability timeout is determined as
2 × RTT+ 4 × RTTVar
with the minimum allowed value of 60 ms, and the maximum allowed value ofSRTO_LATENCY
.Fixes #1768
TODO
SRTO_GROUPSTABTIMEO
to set the minimum timeout value -> moved to Main/backup: Repurpose SRTO_GROUPSTABTIMEO to set the minimum timeout value #1792.5 * COMM_SYN_INTERVAL_US
weight
parameter ofsendBackup_CheckRunningLinkStable
is only needed for tracing. To remove?Decision: leaving for now. Will be useful for further testing.
Initial Link Stability Timeout = max(Minimum Link Stability Timeout, SRT Latency)
, where Minimum Link Stability Timeout = 60 msActivation Period = Initial Link Stability Timeout + 5 * SYN, where SYN=10
Dynamic Link Stability Timeout = min(max(Minimum Link Stability Timeout, 2RTT + 4RTTVar), SRT Latency)