-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build stalls early #991
Comments
The CLI invocation looked like
|
Upgraded to 5.1, still happens
|
is the client output I get prior to this happening |
Looking at the go routines it appeared that the build was stalled on a runc execution. With @tonistiigi help we saw that it was the
|
created #1003 to mitigate the hang part grpc does not return errors from |
Building buildkitd from master now I get the message
The embedded error message is expanded below for convenience.
|
Please post more logs before this message. And enable debug mode if you haven't already. |
Trying that. Without debug mode on the next message was just from startup. Might end up being tomorrow before I have it though. |
@msg555 It seems some grpc logs may not show up. Please also set |
Let me know if you want more log data, going further back gets into the previous build which perhaps could be relevant.
|
@msg555 Ok, that looks interesting. Do you have gc configured on this machine? Can you post the config? |
I have not set up a GC policy |
The issue is that there is a 10 sec timeout (seems hardcoded) on grpc for transferring first bytes. In your logs the container that starts the frontend seems to take more than 10 sec to start. During the same time containerd gc is running. Note that this isn't technically the buildkit gc, buildkit marks objects as unused and then on background containerd gc actually deletes them. In the logs you can see From these logs my initial thought was that containerd gc that takes 14.7 holds some lock that prevents frontend container from starting. But it looks like it is always called on background and from @dmcgowan the db should not be locked for that time. I tried to add some sleeps in the remove path to confirm this and it did seem that even if I made gc artificially slow it didn't stop creating new containers. There is probably a way to delay starting the server until container has come up, but I'm still not quite sure why you are hitting this. It could be that the gc running in parallel is a coincidence in this log. Maybe you can try to get a second one for comparison. It could be that the machine is just so constrained that it reaches the timeout. Another thing is that although you provide a stacktrace, this is from the time where the timeout had already been reached and all the slow processes had been completed. If you can it would be nice to get a stacktrace from the time when this error was generated. Could you make a change that prints the stack at https://github.com/moby/buildkit/pull/1003/files#diff-35ffd957a392f33dd60921b3168d8fd9R312 line. In case there is a lock that is preventing frontend from starting it should be visible from that trace.
|
@tonistiigi I've attached the last several seconds of logs from three different failures. They each have different error messages generated but I wouldn't be surprised if they all have the same underlying cause. frontend_grpc_closed_1.log |
Tested with #1007 and getting similar issues as before. I basically kept running my build until I hit the familiar frontend grpc closed error. These are all the errors I hit and traces in the meantime. |
The |
@msg555 The logs posted for the frontend issue in the last comment seem to be lacking the stactrace debug when the issue appears like it was in #991 (comment) |
Ran some more tests on the branch https://github.com/msg555/buildkit/commits/test_052019 (master + async_remove + extra logging) and have two more logs. The incidence rate seems to be way down on this branch (maybe from the async_remove?) but it still does occasionally fail. Here are the two logs I extracted from 30 builds. frontend_grpc_closed.log I'll start testing on #1013 |
Failure rate was looking pretty low when running on https://github.com/msg555/buildkit/commits/test_052119. However, after adding |
This is running buildkitd from the
moby/buildkit:v0.5.0
image. If I wait long enough the connection is eventually dropped. This issue is transient and I do not have a good way of reliably reproducing it. It has happened on several different images I've tried to build.Below is the output after sending SIGABRT to the buildkitd daemon:
The text was updated successfully, but these errors were encountered: