Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with --experimental_remote_download_outputs=toplevel #8508

Closed
keith opened this issue May 29, 2019 · 32 comments
Closed

Crash with --experimental_remote_download_outputs=toplevel #8508

keith opened this issue May 29, 2019 · 32 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@keith
Copy link
Member

keith commented May 29, 2019

Description of the problem / feature request:

When using --experimental_remote_download_outputs=toplevel, and having a cache return a 500, bazel crashes. I would expect it to recover in this case and fall back to retrying the request, or worst case doing the build locally. Without this flag it does seem to be resilient to this.

ERROR: /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/com_google_protobuf/BUILD:388:1: Linking of rule '@com_google_protobuf//:protoc' failed due to unexpected I/O exception: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>

com.google.devtools.build.lib.remote.blobstore.http.HttpException: 504 Gateway Time-out
<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.15.3</center>
</body>
</html>

	at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:116)
	at com.google.devtools.build.lib.remote.blobstore.http.HttpDownloadHandler.channelRead0(HttpDownloadHandler.java:41)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:438)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:253)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1476)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1225)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1272)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:337)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:359)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:345)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Unknown Source)

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

bazel run TARGET when pointing at an unhealthy http cache and passing --experimental_remote_download_outputs=toplevel

What operating system are you running Bazel on?

macOS

What's the output of bazel info release?

release 0.26.0

cc @buchgr

@aiuto aiuto added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels May 31, 2019
@buchgr
Copy link
Contributor

buchgr commented Jun 3, 2019

@keith there's no retries for the http protocol yet. I believe this error happens independent of --experimental_remote_download_outputs=toplevel no?

@buchgr buchgr self-assigned this Jun 3, 2019
@keith
Copy link
Member Author

keith commented Jun 3, 2019 via email

@buchgr
Copy link
Contributor

buchgr commented Jun 3, 2019

@keith to fallback to local execution. it's a bug either way.

@buchgr
Copy link
Contributor

buchgr commented Jul 9, 2019

I can't reproduce this with 0.27.1 ... can you?

buchgr@buchgr0:~/code/bazel2$ bazel build src:bazel --remote_http_cache=http://localhost:9090 --experimental_remote_download_outputs=toplevel --experimental_inmemory_jdeps_files --experimental_inmemory_dotd_files
INFO: Invocation ID: 0ee5a3cd-bdeb-42ad-b149-5b7dc94cbe01
INFO: Analyzed target //src:bazel (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
WARNING: Reading from Remote Cache:
500 Internal Server Error
timed out
[54 / 89] 22 actions running
    Building src/main/java/com/google/devtools/build/lib/remote/libremote.jar (25 source files) and running annotation processors (OptionProcessor); 3s remote-cache, worker
    Compiling Java headers src/main/java/com/google/devtools/build/lib/exec/local/liblocal-hjar.jar (5 files); 2s remote-cache, linux-sandbox

@keith
Copy link
Member Author

keith commented Jul 9, 2019

Do you have a specific fix in mind that would have addressed this? I'll have to merge this again on our side to test well enough to see if it breaks again

@keith
Copy link
Member Author

keith commented Jul 9, 2019

It looks like I can't test this in our case because we hit this:

ERROR: /root/.cache/bazel/_bazel_root/c99f8fb4c845b3aae6d69f1a3c75aa35/external/com_google_protobuf/BUILD:130:1: C++ compilation of rule '@com_google_protobuf//:protobuf' failed due to unexpected I/O exception: Failed to fetch file with hash '1ef4741cf10441feeca5c8dde2b6c754f0cbd96fcae697d54a604757289a5e9b' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.
java.io.IOException: Failed to fetch file with hash '1ef4741cf10441feeca5c8dde2b6c754f0cbd96fcae697d54a604757289a5e9b' because it does not exist remotely. --experimental_remote_outputs=minimal does not work if your remote cache evicts files during builds.

So if you think this is fixed on your side I guess we can close this

@buchgr
Copy link
Contributor

buchgr commented Jul 10, 2019

@keith in order to remove this error you need to clear your remote cache. which remote cache are you using?

@keith
Copy link
Member Author

keith commented Jul 11, 2019

This is from an internal cache implementation. But in theory anything could be evicted from other caches during a build as well right?

@buchgr
Copy link
Contributor

buchgr commented Jul 11, 2019

@keith
Copy link
Member Author

keith commented Jul 11, 2019

@buchgr do you know how bazel would handle errors from the cache, even if the key itself hasn't been evicted, it's just temporarily inaccessible? Specifically if we returned a 500 in this case would this be better than returning a 404?

@keith
Copy link
Member Author

keith commented Jul 16, 2019

@buchgr I enabled this flag set on envoy and we've started seeing this crash as well https://circleci.com/gh/envoyproxy/envoy/249586 we're pointing to the GCS remote cache. Any ideas on how we can mitigate this?

@buchgr
Copy link
Contributor

buchgr commented Jul 16, 2019

@keith disable garbage collection on GCS. The issue is that GCS does not understand the action graph of what it's caching and evicts entries purely based on time. So it might have a cache entry for an action A1 but deleted all the outputs of this action. If an action A2 now needs the outputs of A1, then Bazel can't download them and also can't re-run action A1 and thus has to print this error.

I ll have to write a blog post about this and discussing potential ways for mitigation. For GCS specifically the only currently viable strategy is to disable garbage collection and wipe the whole cache from time to time. I imagine it would be easy enough to write a cloud function (or something) that reads in the action graph from GCS and properly evicts items from time to time.

@keith
Copy link
Member Author

keith commented May 14, 2020

Pretty sure this is still a problem FWIW

@meisterT meisterT reopened this May 15, 2020
@meisterT
Copy link
Member

Thanks Keith, we should probably document what Jakob described.

@meisterT meisterT added the help wanted Someone outside the Bazel team could own this label May 15, 2020
@bjacklyn
Copy link

bjacklyn commented Aug 7, 2020

Is there any active work on bits without the bytes to make it gracefully handle download errors like the normal build does?

I just uprev'd to v3.4.1 and tried experimenting with this feature (--remote_download_toplevel and --remote_download_minimal). I am able to produce the below crash 100% of the time from fully deleted ~/.cache/bazel directory. However my stacktrace is quite different, does this look like the same issue?

It is not a garbage collection issue in my case because we only cleanup on a cron once/month. Although I suspect it is a similar problem, for example:

  • The AC exists but the CAS doesn't for a file because our nginx kills uploads that are taking too long/too large, or
  • The CAS exists but during download something in our unreliable network kills the download
ERROR: <snip>/BUILD.bazel:499:10: Couldn't build file <snip>/binary: Linking of rule '//<snip>:binary' failed due to Exec failed due to IOException: null
com.google.devtools.build.lib.remote.BulkTransferException
	at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
	at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:228)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:127)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
	at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
	at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:332)
	at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:785)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:948)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:907)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:425)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:926)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:310)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
	Suppressed: io.netty.channel.ExtendedClosedChannelException
		at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

@ulfjack
Copy link
Contributor

ulfjack commented Aug 12, 2020

I'm collecting all issues for build-without-the-bytes in the tracking issue (#6862). I'm making slow progress getting through the backlog. If you have issues that are easy to repro, I can take a look.

@ulfjack
Copy link
Contributor

ulfjack commented Sep 5, 2020

I am unable to reproduce this issue.

@keith
Copy link
Member Author

keith commented Sep 8, 2020

FYI I still see a similar stack to one above here:

WARNING: Reading from Remote Cache:
UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
load_shed
ERROR: /Users/iosci/buildkite/src/iosci84/lyft/ios/Modules/AppShortcuts/BUILD:25:10: Bundling, processing and signing AppShortcutsTests.__internal__.__test_bundle failed due to Exec failed due to IOException: null
com.google.devtools.build.lib.remote.BulkTransferException
	at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
	at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.start(LocalSpawnRunner.java:315)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.runOnce(LocalSpawnRunner.java:217)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner$SubprocessHandler.run(LocalSpawnRunner.java:191)
	at com.google.devtools.build.lib.exec.local.LocalSpawnRunner.exec(LocalSpawnRunner.java:141)
	at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:241)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)

but I can file a new issue if you think it's unrelated

@ulfjack
Copy link
Contributor

ulfjack commented Sep 8, 2020

Is this with the gRPC remote cache? AFAIK, the HTTP remote cache only supports HTTP 1.1. Also, can you provide the full stack trace? I think the interesting part is below where you cut it off.

@keith
Copy link
Member Author

keith commented Sep 8, 2020

This is with GCP RBE via gRPC. Super long log: https://gist.github.com/keith/53adb5908b9419443d682398d7dcb832

@ulfjack
Copy link
Contributor

ulfjack commented Sep 11, 2020

This seems to be working as expected from Bazel's point of view. You should bring this up with, err, RBE team? @EricBurnett @bergsieker

So, I created a remote execution service that artificially causes GOAWAY like the one you are seeing. Bazel retries the error 5 times (for a total of 6 calls) with exponentially increasing delays between calls. The default exponential backoff configuration is still fairly aggressive, with delays of 100ms, 200ms, 400ms, 800ms, and 1600ms. If the remote system is overloaded for a period of time, then it is not entirely surprising if it returns GOAWAY 6 times in ~3.1 seconds.

If this is a common issue, then it would look like a service misconfiguration to me. For example, the service could reduce the number of concurrent streams the client is allowed to open. Or it could use flow control to reduce the rate of incoming packets.

We could also increase the backoff times in Bazel. They look pretty aggressive to me.

@bjacklyn
Copy link

@ulfjack I have a reproducible example, although I haven't been able to narrow it down to a small example repro I can share here. In order to hopefully collect more information I built bazel from source and replaced the netty-all-4.1.34.jar with the latest netty-all-4.1.52.jar and the new netty, however it provides a similar stacktrace without much info.
https://mvnrepository.com/artifact/io.netty/netty-all/4.1.52.Final

200911 22:33:11.745:WT 280 [com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run] Aborting evaluation while evaluating ActionLookupData{actionLookupKey=<snip> BuildConfigurationValue.Key[fef9552d036c94cb695ac8e87cfd860d9e76bea766e0fdd0a035c9050fb3dad6], actionIndex=598}
com.google.devtools.build.lib.skyframe.ActionExecutionFunction$ActionExecutionFunctionException: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: Linking of rule '<snip>' failed: I/O exception during sandboxed execution: null
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:329)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.google.devtools.build.lib.actions.AlreadyReportedActionExecutionException: Linking of rule '<snip>' failed: I/O exception during sandboxed execution: null
        ... 6 more
Caused by: com.google.devtools.build.lib.actions.UserExecException: I/O exception during sandboxed execution: null
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:96)
        at com.google.devtools.build.lib.sandbox.SandboxModule$SandboxFallbackSpawnRunner.exec(SandboxModule.java:469)
        at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:241)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
        at com.google.devtools.build.lib.rules.cpp.CppLinkAction.beginExecution(CppLinkAction.java:332)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:785)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:948)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:907)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:425)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:926)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:310)
        ... 5 more
Caused by: com.google.devtools.build.lib.remote.BulkTransferException
        at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:229)
        at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:121)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:225)
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.runSpawn(AbstractSandboxSpawnRunner.java:122)
        at com.google.devtools.build.lib.sandbox.AbstractSandboxSpawnRunner.exec(AbstractSandboxSpawnRunner.java:91)
        ... 21 more
        Suppressed: io.netty.channel.StacklessClosedChannelException
                at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
        Suppressed: io.netty.channel.StacklessClosedChannelException
                at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)

I've been able to confirm that HttpCacheClient in its get() method is trying to writeAndFlush() to a channel that is already closed after a succeedAndReset() call on that channel has closed it. I would expect that it shouldn't be possible to acquire a closed channel, but it seems like there is a race condition between a channel being released/closed and re-acquired/used.

@bjacklyn
Copy link

@ulfjack -- I have created a small repo that reproduces the issue by building an open source project (protobufs): https://github.com/bjacklyn/bazel-remote-download-failure-repro

I can file a separate github issue for this. It would be great if you can confirm that you are able to reproduce what I am seeing ^ though. There seem to be a lot of dials and knobs in order to hit the issue, but if my understanding is correct the root cause seems straight forward -- the channel is closed after it has been released which causes the next guy who acquires it to fail.

@ulfjack
Copy link
Contributor

ulfjack commented Oct 5, 2020

@bjacklyn, is that related to this bug? I fixed a bug in the HTTP cache uploader / downloader that was due to a race condition where it tried to acquire a closed channel. That fix is here: e6089c9

Can you still repro the issue after that change?

@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) type: bug and removed help wanted Someone outside the Bazel team could own this type: documentation (cleanup) labels Dec 9, 2020
@coeuvre
Copy link
Member

coeuvre commented Dec 9, 2020

Bazel should run the local action to rebuild the missing entry. I will work on this.

@coeuvre coeuvre self-assigned this Dec 9, 2020
@rmohr
Copy link

rmohr commented Mar 16, 2021

Hit this on 3.7.2 with --remote_download_outputs=toplevel:

20:35:27: ERROR: /home/prow/go/src/github.com/kubevirt/kubevirt/pkg/util/cluster/BUILD.bazel:3:11: GoCompilePkg pkg/util/cluster/go_default_library.a failed due to Exec failed due to IOException: null
20:35:27: com.google.devtools.build.lib.remote.BulkTransferException
20:35:27: 	at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:125)
20:35:27: 	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy$SpawnExecutionContextImpl.prefetchInputs(AbstractSpawnStrategy.java:232)
20:35:27: 	at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:243)
20:35:27: 	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:135)
20:35:27: 	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:102)
20:35:27: 	at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
20:35:27: 	at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
20:35:27: 	at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)
20:35:27: 	at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:859)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1019)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:978)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
20:35:27: 	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:469)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:845)
20:35:27: 	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:314)
20:35:27: 	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
20:35:27: 	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
20:35:27: 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
20:35:27: 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
20:35:27: 	at java.base/java.lang.Thread.run(Unknown Source)
20:35:27: 	Suppressed: java.io.IOException: Failed to fetch file with hash '1a57a9024a7f44913cbb44b5453177ceeba65e50f83ae3f9e47774a5662ab3db' because it does not exist remotely. --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.
20:35:27: 		at com.google.devtools.build.lib.remote.RemoteActionInputFetcher.prefetchFiles(RemoteActionInputFetcher.java:129)
20:35:27: 		... 21 more

We use greenhouse from kubernetes/test-infra as a cache. It also only evicts based on last-access time without anderstanding the action graph. It indeed looks like this only happens when a build is started and while it is running, the cache evicts entries.

@brentleyjones
Copy link
Contributor

We are still hitting this with 4.0.0, with Google RBE gRPC remote cache:

ERROR: /Users/REDACTED/BUILD:3:13: Bundling REDACTED failed: Exec failed due to IOException: null

@meisterT
Copy link
Member

@coeuvre is there a fix that we can cherrypick in 4.1?

@coeuvre
Copy link
Member

coeuvre commented Apr 1, 2021

Sorry to be late. It looks like there are different issues posted on this thread.

  1. AC exists, but failed to download from CAS. Crash with --experimental_remote_download_outputs=toplevel #8508 (comment), Crash with --experimental_remote_download_outputs=toplevel #8508 (comment)
    • This requires action rewinding and I will try to find time working on this in Q2.
  2. load_shed issue. Crash with --experimental_remote_download_outputs=toplevel #8508 (comment)
    • The changes to correctly handle GOWAY errors are included in 4.1.0rc1. (gRPC dynamic connection pool)
  3. HTTP cache issue. Crash with --experimental_remote_download_outputs=toplevel #8508 (comment)

@brentleyjones, can you please share more details about the error? (e.g. enable --verbose_failures)

@keith Do you mind we close this issue and track it in #8250?

@brentleyjones
Copy link
Contributor

@brentleyjones, can you please share more details about the error? (e.g. enable --verbose_failures)

That was with --verbose_failures on. It's odd that it was a one-liner without a stacktrace.

@keith
Copy link
Member Author

keith commented Apr 1, 2021

happy to track wherever you'd prefer!

@coeuvre
Copy link
Member

coeuvre commented Apr 7, 2021

@brentleyjones Feel free to open a new issue with your problem.

Let's track this issue in #8250. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

9 participants