Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dynamic discovery timeout to not retry sending requests, but wait for the same request to complete #2337

Merged
merged 2 commits into from
Mar 24, 2022

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Mar 24, 2022

What does this pull request do? Explain your changes. (required)
The last change with the dynamic timeout caused an issue in teststreams described in Discord.

The dynamic timeout PR changed the discovery between B<>O to work as follows:

  1. B tries to discover O in 500 ms
  2. If no Os found, then increase the timeout to 1s and send the requests again
  3. In no Os found, then increase the timeout to 2s and send the requests again

The problem is that if O responds in, let's say, 1.5s. Then, now the discovery will now take: 0.5s + 1s + 1.5s = 3s
So even though the response is in 1.5s, the overall time from the black-box perspective is 3s. That pollutes the orch teststream data.

This PR changes the dynamic timeout to not send new request, but to wait for the initial requests to complete (as proposed initially by @yondonfu)

Specific updates (required)

How did you test each of these updates (required)

Tested with local geth. Introduced artificial delay in O and checked the logs in B.

Does this pull request close any open issues?

Checklist:

@leszko leszko requested a review from yondonfu March 24, 2022 12:38
@leszko leszko changed the title Fix dynamic timeout to not retry sending requests, but wait for the same request to complete Fix dynamic discovery timeout to not retry sending requests, but wait for the same request to complete Mar 24, 2022
@yondonfu
Copy link
Member

The problem is that if O responds in, let's say, 1.5s. Then, now the discovery will now take: 0.5s + 1s + 1.5s = 3s
So even though the response is in 1.5s, the overall time from the black-box perspective is 3s. That pollutes the orch teststream data.

Hm yeah I think this makes sense.

orch-tester relies on metrics scraped from the B for the avg upload, download and round trip times. When B receives a segment, it will record the start time here and then run discovery here. At the beginning of a stream, discovery will block until the working O set is populated (during the stream discovery would happen in the background to re-populate the working set) so if discovery takes longer then that would be reflected in the times recorded for the first segment of the stream. Additionally, discovery will block again if the working O set is empty for any subsequent segment which I think might happen in the scenario where there is only 1 O available and it doesn't return a response fast enough before the next segment arrives [1] - in this scenario discovery would add to the times recorded as well.

[1] I think the threshold is 2x the segment duration so if the segment duration is 2s and the previous segment has taken longer than 4s then we have to re-run discovery, but if it it took less than 4s we just send the next segment to the same O.

// If no new sessions are available, re-use last session when oldest segment is in-flight for < 2 * segDur

discovery/discovery.go Show resolved Hide resolved
discovery/discovery.go Show resolved Hide resolved
@leszko leszko requested a review from yondonfu March 24, 2022 14:52
Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after squashing

@leszko leszko force-pushed the rl/fix-dynamic-discovery-timeout branch from 81a6a8a to 1ee7a27 Compare March 24, 2022 16:09
@leszko leszko merged commit ce5ab12 into livepeer:master Mar 24, 2022
@leszko leszko deleted the rl/fix-dynamic-discovery-timeout branch March 24, 2022 16:27
@leszko leszko mentioned this pull request May 10, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants