-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy progress + Progress Groups #2513
Conversation
I don't think we should push terms like "lazy" to the users. They shouldn't need to think about the internal implementation. I would suggest 2 changes:
|
@sipsma Any update with this? |
I've been working on it intermittently but have been very busy with other stuff the last few weeks. Have more time this week so I will aim to get some updated code out for review by end of week (let me know if you have any time constraints for the v0.10 release I should know about). In general, I'm structuring it in two commits; the first is the baseline behavior that at least results in merge+diff unlazies to show up under the right vertex (basically the same idea as how lazy blob refs work). After that I'll layer on some of the fancier progress grouping you were describing for merges specifically. |
This is just the order of items, right? Because I don't think we need anything else special if we have grouping support. |
For Merge yes. For Diff, I think grouping support won't cover all the cases (it would only help when diff is plugged into a merge, which while a majority of the cases is not all) and we will need the first commit. |
Targeting v0.10 RC for next week |
Updated w/ the 2 commits described before; they don't overlap in terms of code so I'm happy to split them up into separate PRs if desired, but I wanted to develop them together because they do have indirect impacts on each other and I wanted to ensure the final output looked okay. |
Is what I get every time on the first build. Otherwise, groups seem to work in #2596 but the issue with the out of order items remains and needs to be addressed separately progress.movHaven't reviewed updated code yet. |
Just to confirm, the issue is that steps aren't appearing in the right order? so like in the screenshot/video |
The issue with the first screenshot (no progressgroup there) is that the In the video (this uses mergeop) yes |
I can't reproduce it locally. I'm starting buildkit w/
I ran 5 times and each time the output looks the same with everything being complete (blue):
Are you doing anything different? It could also possibly be a timing thing that just varies host to host. I wonder if the client progressui is being shut down before the final vertex
I'll take a look at this, my initial suspicion is that it's a result of the code handling all the progress group updates after handling the non-group vertexes. |
I don't think it is progress groups. It is expected that the copy to scratch runs early as it doesn't depend on other commands. But we should figure out how to avoid this and keep the order that makes sense in the build graph. I'll see if I can debug the first case more on my side. |
Ah right. It might not be that hard to add another field to LLB metadata that specifies an order progress vertexes (and groups) should appear in. The progressui trace can then just sort its vertex list by that index. |
I actually hit the first case where the
But when it doesn't complete they are:
Not sure what's happening differently that causes the vertexes received to be different in the broken case. But the reason this happens now is because the commit to support start/stop intervals in progress results in all intervals being merged and a complete time is never received for the start time of It seems like figuring out why the vertexes received by the progress code are different in the broken case would be a good next step. Let me know if there's any known cases where that's expected to happen. |
Is this all from same vertex? If it is then it already looks wrong that it switches so much. A single If the server is already sending the wrong data then an additional backward-compatible way to fix it would be that if there is no progress-group then for a single vertex a completed time always marks it as finished. |
Yeah those are all from just the vertex for the
Pushed an extra commit with this workaround for now. |
I looked the server side. The reason it goes active for 5 times is:
As usual, now I'm having trouble triggering the race so don't know what path produces the starting without matching completed time. |
I may have figured out the underlying issue. I added logging to the I think the issue is on this line. I made a change to just use a random ID in those Just need to figure out a better solution than using a random ID now, will push this fix soon. EDIT: I also just realized this code in flightcontrol could be involved too, so need to determine whether that's a relevant factor or not. |
07579c6
to
29a5ceb
Compare
Pushed the fix (and removed the previous commit with the workaround in the progress ui). I updated to just include the start time (in unix nanos) with the ID, so then we can still dedupe multiple updates within the same start/stop interval, but won't dedupe across start/stop intervals. |
solver/jobs.go
Outdated
pw.Write(v.Digest.String(), *v) | ||
// id dedupes progress updates; use of the start time ensures that distinct | ||
// stop/start intervals of the vertex don't get deduped with one another. | ||
id := v.Digest.String() + strconv.Itoa(int(v.Started.UnixNano())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about using started time there. Might as well just use random then. If this is for the second cache key then what if we just pass a suffix for that case. Then at least the caller side still controls the ID. Could even be variadic I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that we care about preserving vertex updates from different start/stop intervals, but it's okay to dedupe multiple updates within a single interval, so the start time was the right key to use in every possible situation. That being said, putting in suffixes set by the caller works perfectly fine too, so updated with that.
// interval's stop time is nil, it is treated as positive infinity and consumes | ||
// any intervals after it. Intervals with nil start times are ignored and not | ||
// returned. | ||
func mergeIntervals(intervals []interval) []interval { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of these calculations wouldn't it be easier to just add the last compled-started
diff every time vertex goes to completed state. If vertex is active then current time is now-started+previous_sum
. Not a blocker but looks like it would be faster and simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to handle as many cases as possible, including ones where progress updates come into the client out of order in arbitrary ways. I don't think that's expected at the moment but wanted to be safe against unknown bugs or possible behavior changes in the future. So I'd prefer to leave this as is given the extra complication is fairly minimal and self-contained, but if you still prefer the other version let me know and I'll update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the extra complexity might become an issue for this for big builds? Does this get called repeatedly after every 100ms on tty mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will get called for each job
representing a vertex update (i.e. not status update, log, etc.) on each ui refresh. I just quickly ran a benchmark test using the complex
test case in the unit tests I added and got 775 ns/op (in a linux VM running on an m1 macbook). That test case has 15 start/stop intervals to merge, whereas in the real world we'd usually expect at most 5. But round that up to 1 microsecond anyways, then assume that an extremely large build has 100 vertex job
s per update and it still only takes 100 microseconds. Let me know if you think the estimate is off, but I'd be biased towards thinking the performance shouldn't make a difference here.
localTimeDiff time.Duration | ||
vertexes []*vertex | ||
byDigest map[digest.Digest]*vertex | ||
nextIndex int | ||
updates map[digest.Digest]struct{} | ||
modeConsole bool | ||
groups map[string]*vertexGroup // group id -> group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all items would be "groups" and vertex without a progressgroup would just be a group with only one item I think some duplications could be avoided and maybe easier to understand. Not needed atm unless you think it is easy but I think worth considering as optimization.
We can assume that vertex digest and progress group id never collide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about ways to refactor the existing code more and fit it in with the groups better, but ultimately became biased against it because it felt too risky to significantly impact the existing code even more right before the release, especially given most of the testing of the progress is not automated. So I'd prefer to save that for a follow-up, probably after the release is done.
This is easier to read than quotes, which result in lots of escaping in progress output. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Now, when a merge or diff ref is unlazied, the progress will show up under the vertex for the merge/diff ref. Additionally, any ancestors of the op that also need to be unlazied as part of unlazying the merge/diff will show status updates under its vertex in the progress. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
This allows clients to specify that LLB states should be grouped in progress output under a custom name. Status updates for all vertexes in the group will show up under a single vertex in the output. The intended use cases are for Dockerfile COPY's that use MergeOp as a backend and for grouping some other internal vertexes during frontend builds. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Previously, vertexes sent to progress might have multiple intervals in which they started+stopped, but the output only knew about the most recent interval. Now, all the intervals are tracked and the total time spent on each (after accounting for overlaps) is displayed. This is implemented both for normal vertexes and progress group vertexes. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
This fixes an issue where if a vertex rapidly had progress updates (such as the case where CacheMap runs in succession for multiple indexes), then progress updates were all using the same ID and could incorrectly override one another, resulting in missing vertex updates. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Are you working on this? |
I haven't been but if you're good with that approach I'm happy to implement it. I should have time to send out a PR early next week. |
@sipsma I'm not sure if there is a way to just do it automatically. We could compute how far the vertex is from the final build result that should be correct but that doesn't define how the multiple inputs of a mergeop should be sorted. Atm. the most critical need is something that would put Note that there is some sorting fixes already in place iirc. It should be that when there is a lot of screen space it pushes the active items down so they are visible. That should still take precedence over the order defined in LLB. We need to avoid the case where the user doesn't see what is currently running as the whole screen is filled with already completed progress rows. |
Makes sense. Taking a look at this today. |
Example output for the following LLB (built w/ a local export in order to force unlazying):
Here's the output when using
tty
mode:And here's the output when using
plain
mode: