Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime link stability timeout for main/backup #1775

Merged
merged 7 commits into from
Feb 8, 2021

Conversation

maxsharabayko
Copy link
Collaborator

@maxsharabayko maxsharabayko commented Jan 29, 2021

The existing option SRTO_GROUPSTABTIMEO sets a timeout for the last response time of a group member link.
The same value is used for all member links.

An optimal timeout value depends on the network conditions (RTT, loss rate), and can be calculated at runtime.

This PR replaces the usage of SRTO_GROUPSTABTIMEO with a dynamic link stability timeout calculated at runtime.

When an idle or newly connected link is activated, it first transits to an "activation phase" and stays there for SRTO_PEERLATENCY + 50ms with the minimum allowed value of 60 ms, and the maximum allowed value of SRTO_PEERIDLETIMEO.
It is expected that at the start of sending RTT, RTTVar and link stability are not yet known. Therefore a link needs to stay active for this period to judge if it is stable or not.

After the activation phase ends, the link stability timeout is determined as 2 × RTT+ 4 × RTTVar with the minimum allowed value of 60 ms, and the maximum allowed value of SRTO_LATENCY.

Fixes #1768

TODO

  • Repurpose SRTO_GROUPSTABTIMEO to set the minimum timeout value -> moved to Main/backup: Repurpose SRTO_GROUPSTABTIMEO to set the minimum timeout value #1792.
  • PR [core] Minor: renamed CUDT m_tsTmpActiveSince to m_tsFreshActivation #1774 extracts some minor renaming. Update this PR once that one is merged.
  • Replace 50ms in activation period with 5 * COMM_SYN_INTERVAL_US
  • The weight parameter of sendBackup_CheckRunningLinkStable is only needed for tracing. To remove?
    Decision: leaving for now. Will be useful for further testing.
  • Remove PEERIDLETIMEO from initial stability timeout.
    Initial Link Stability Timeout = max(Minimum Link Stability Timeout, SRT Latency), where Minimum Link Stability Timeout = 60 ms
    Activation Period = Initial Link Stability Timeout + 5 * SYN, where SYN=10
    Dynamic Link Stability Timeout = min(max(Minimum Link Stability Timeout, 2RTT + 4RTTVar), SRT Latency)

@maxsharabayko maxsharabayko added Type: Enhancement Indicates new feature requests [core] Area: Changes in SRT library core labels Jan 29, 2021
@maxsharabayko maxsharabayko added this to the v1.4.3 milestone Jan 29, 2021
srtcore/group.cpp Outdated Show resolved Hide resolved
@maxsharabayko maxsharabayko self-assigned this Jan 29, 2021
@ethouris
Copy link
Collaborator

ethouris commented Feb 4, 2021

I think you are forgetting something here. If you need to have some "initial measurement period" to decide the parameter that should remain valid for a long time of functioning of a given link, then this should be the parameter valid in any kind of conditions, no matter what characteristics particular links have.

If you limit RTT measurements only to active links, then it is likely the following situation: you have 3 equivalent links, one with lower RTT connected and active and 2 backup links currently idle. The RTT on the backup links is currently unknown because they aren't active. When the main link goes down, which is early seen by the exceeded stability timeout, it activates the backup link. However RTT on that link proves to be quite large and the ACK intervals get stretched way beyond the current stability timeout, which triggers activation of the second backup link. That one might also have a higher RTT, in result of which this will also activate the 1st - best - link as the stability timeout couldn't yet converge towards worse conditions of the 2 other backup links (while activation of the 1st link wasn't expected because links are equivalent).

I think the only solution would be to forcefully activate always a link that was first connected, and keep it alive during the initial measurement period. SRT (since always) has a cache mechanism that remembers last characteristics of a link, so this forced activation wouldn't have to be done every time the link is reconnected, only when it is connected the very first time. This will be however necessary to have at least a rough measurement of the RTT for that link, especially if this should make the results worse than current links. The RTT measured in that link should comprise the minimum conditions on the whole group for STO, that is, even if this link isn't active and its RTT isn't being measured. Caching the RTT value should ensure that the forced activation of the link will happen only the very first time in a lifetime when the application uses this link.

If this is implemented, and it is ensured that every idle link always has its last remembered RTT recorded, and the highest RTT from all links is always dictating the minimum value for STO (stating it's still below the maximum required by latency), then getting switched to that link shall never consider activation of another link when a higher-RTT link takes over and it starts exceeding STO in the beginning. Also, the STO should be immediately modified upon reconnection of a link that would not take over the activation (that is, it would be idle in the beginning) so that the group is ready to make this link take over when the currently active link goes down, even if this means giving it more stability tolerance than it is required by the value of its RTT.

srtcore/core.h Outdated Show resolved Hide resolved
srtcore/core.h Outdated Show resolved Hide resolved
srtcore/group.cpp Outdated Show resolved Hide resolved
srtcore/core.h Outdated Show resolved Hide resolved
is_stable = false;
}
}
const int is_stable = sendBackup_CheckRunningLinkStable(u, currtime, d->weight);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you protect yourself against rapid changes in this value in case when, for example, RTT starts to increase dramatically?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTT changes are smoothed by the IIR filter (first on the receiver side, then again on the sender side). So there won't be that high fluctuation. And if some fluctuations fall outside the stability threshold, that is an indicator that something unstable is happening on the link.

Fixed naming.
@maxsharabayko
Copy link
Collaborator Author

If you limit RTT measurements only to active links, then it is likely the following situation: you have 3 equivalent links, one with lower RTT connected and active and 2 backup links currently idle. The RTT on the backup links is currently unknown because they aren't active. When the main link goes down, which is early seen by the exceeded stability timeout, it activates the backup link. However RTT on that link proves to be quite large and the ACK intervals get stretched way beyond the current stability timeout, which triggers activation of the second backup link. That one might also have a higher RTT, in result of which this will also activate the 1st - best - link as the stability timeout couldn't yet converge towards worse conditions of the 2 other backup links (while activation of the 1st link wasn't expected because links are equivalent).

The purpose of the activation phase that lasts for minimum 60 ms (see below), when we don't yet use the dynamic stability timout. It was observed that during this period RTT estimate more or less converges to a real value.
And here is where we have several improvements planned:

  • improve RTT convergence;
  • do not silence parallel links if the higher weight link is still in the "activation phase".

Both are subject to separate PRs to minimize the number of changes by each small step.

Initial Link Stability Timeout = max(Minimum Link Stability Timeout, SRT Latency),
where Minimum Link Stability Timeout = 60 ms
Activation Period = Initial Link Stability Timeout + 5 * SYN, where SYN=10

@maxsharabayko maxsharabayko marked this pull request as ready for review February 5, 2021 16:35
@maxsharabayko maxsharabayko merged commit 7656759 into Haivision:master Feb 8, 2021
@maxsharabayko maxsharabayko deleted the develop/main-backup branch February 8, 2021 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[core] Area: Changes in SRT library core Type: Enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Main/backup: The activation period ends earlier than expected
2 participants