Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some DexShardsToMerge/DexMerger outputs do not get downloaded when using BwtB #17603

Closed
lukaciko opened this issue Feb 27, 2023 · 17 comments
Closed
Labels
awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@lukaciko
Copy link

Description of the bug:

When using Bazel 6.0 or later to build an android_binary (with D8, not defining dex_shards), a remote cache and setting --remote_download_toplevel the builds usually fail on the MergeDexZips action with errors like:

ERROR: ... Merging dex shards for //<foo>/app:app failed: (Exit 1): singlejar_local failed: ... No such file or directory
singlejar_local: src/tools/singlejar/input_jar.cc:23: Cannot open input jar bazel-out/darwin_arm64-fastbuild/bin/<foo>/dexfiles/app/3.shard.zip: No such file or directory

Looking at output files and timing profiles this seems to be because not all files of the dexfiles tree artifact are created, either by being downloaded from the remote cache by running the action locally.

This tree artifact is created by the DexShardsToMerge, which uses SpawnActionTemplate to create DexMerger actions. We can see this using aquery:

SpawnActionTemplate with output TreeArtifact <foo>/app/dexfiles/app
  Mnemonic: DexShardsToMerge
  Target: //<foo>/app:app
  Configuration: darwin_arm64-fastbuild
  Execution platform: @local_config_platform//:host
  Inputs: [bazel-out/darwin_arm64-fastbuild/bin/<foo>/app/dexsplits/app, bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/external/bazel_tools/tools/android/d8_dexmerger, bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/external/bazel_tools/tools/android/d8_dexmerger.jar, bazel-out/darwin_arm64-opt-exec-2B5CBBC6/internal/_middlemen/external_Sbazel_Utools_Stools_Sandroid_Sd8_Udexmerger-runfiles]
  Outputs: [bazel-out/darwin_arm64-fastbuild/bin/<foo>/app/dexfiles/app (TreeArtifact)]

It looks like this might happen when some files in dexfiles tree artifact get a remote cache hit and others do not. The ones that do not get a hit get created locally, the ones that do get a hit never get downloaded.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

MacOS 13.2.1 & Ubuntu Focal Fossa

What is the output of bazel info release?

release 6.0.0-dev

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

This is a our patched version of the 6.0.0 release binary. We apply a few patches to improve performance of Android builds, fix dexing issues, (#16369 (comment)) and some other issues. We build it with:

bazelisk build //src:bazel --compilation_mode=opt --embed_label="6.0.0-dev" --stamp

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

The last commit on the 6.0.0 release branch fixed a seemingly related issue: 86883db.

There is an ongoing work on BwtB #10880.

I have tried reverting that commit and creating the binary on top of recent master (5900d6d), but ran into the same issue.

Any other information, logs, or outputs that you want to share?

In the timing profile I see that Remote.download and Remote.downloadInMemoryOutput get logged for the files that do not get created. There is a start time for these, but no wall duration.

@lukaciko
Copy link
Author

@tjgq @coeuvre does this sound familiar?

@tjgq
Copy link
Contributor

tjgq commented Feb 27, 2023

This is probably related to #16333. I'm working on a PR to handle partial trees correctly in the prefetcher; expect it to be submitted within a few days.

@sgowroji sgowroji added type: bug team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Feb 27, 2023
@tjgq
Copy link
Contributor

tjgq commented Mar 1, 2023

@lukaciko I think 52deefe should have fixed this, but I can't be 100% sure. Would you be able to build Bazel at head and check whether the issue is gone? If not, can you please provide a reproducer?

@tjgq tjgq added the awaiting-user-response Awaiting a response from the author label Mar 1, 2023
@lukaciko
Copy link
Author

lukaciko commented Mar 2, 2023

@tjgq this seems to fix the issue. I didn't test the head, but your patch on top of 6.1.0rc2. Thank you for looking into it this quickly!

@tjgq
Copy link
Contributor

tjgq commented Mar 2, 2023

Thanks for confirming! I think it's too late to get this into 6.1, but I'll cherry-pick it into 6.2.

@tjgq tjgq closed this as completed Mar 2, 2023
@lukaciko
Copy link
Author

lukaciko commented Mar 6, 2023

@tjgq with the patch we're getting errors on AndroidAapt2 actions:

Processing Android resources for <> failed: Exec failed due to IOException: 2 errors during bulk transfer:
java.io.IOException: <>/bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/resources/androidx_browser_browser/res (File exists)
java.io.IOException: <>/bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/resources/com_google_android_support_wearable/res (File exists)

I modified the binary to print one of the stack traces:

com.google.devtools.build.lib.unix.NativePosixFiles.stat(Native Method)
com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:209)
com.google.devtools.build.lib.unix.UnixFileSystem.isWritable(UnixFileSystem.java:274)
com.google.devtools.build.lib.vfs.Path.isWritable(Path.java:807)
com.google.devtools.build.lib.remote.AbstractActionInputPrefetcher.finalizeDownload(AbstractActionInputPrefetcher.java:526)
com.google.devtools.build.lib.remote.AbstractActionInputPrefetcher.lambda$downloadFileNoCheckRx$4(AbstractActionInputPrefetcher.java:433)
io.reactivex.rxjava3.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:107)
...

I see there's a TODO on the isWritable() call from the stack trace:

// TODO(tjgq): Fix the TOCTTOU race between isWritable and setWritable. This requires keeping
// track of the original directory permissions. Note that nested artifacts are relatively rare
// and will eventually be disallowed (see issue #16729).
if (!parentDir.isWritable()) {

Is that the issue we're seeing and is there a way to work around it? The _aar/unzipped/ directories are tree artifacts created by AarImport.java.

@tjgq
Copy link
Contributor

tjgq commented Mar 6, 2023

@lukaciko The TODO only applies if there are nested artifacts (a declare_file or a declare_directory inside another declare_directory). Does your build succeed with the --experimental_strict_conflict_checks flag? (Please do a clean build, because incrementality is broken for that flag.)

If the build succeeds with --experimental_strict_conflict_checks, then something else is the problem.

@tjgq
Copy link
Contributor

tjgq commented Mar 6, 2023

(To clarify: if there are nested artifacts, that flag will produce an error in the analysis phase, before you get the execution phase error reported above.)

@tjgq
Copy link
Contributor

tjgq commented Mar 6, 2023

I'm also wondering why the reported error is File exists. EEXIST would make sense for mkdir(2) or symlink(2), but not for stat(2) as the stack trace claims.

@lukaciko
Copy link
Author

lukaciko commented Mar 6, 2023

I apologise, the stack trace was from another error probably caused by me messing with the output base. This should be the correct stack trace:

java.io.IOException: /private/var/tmp/_bazel_lukaciko/6e28d036e693e1d95f82047d5dad0b71/execroot/client_android/bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/resources/androidx_media_media/res (File exists)
com.google.devtools.build.lib.unix.UnixFileSystem.createWritableDirectory(UnixFileSystem.java:354)
com.google.devtools.build.lib.vfs.Path.createWritableDirectory(Path.java:447)
com.google.devtools.build.lib.remote.AbstractActionInputPrefetcher.finalizeDownload(AbstractActionInputPrefetcher.java:517)
com.google.devtools.build.lib.remote.AbstractActionInputPrefetcher.lambda$downloadFileNoCheckRx$4(AbstractActionInputPrefetcher.java:433)
io.reactivex.rxjava3.internal.operators.completable.CompletablePeek$CompletableObserverImplementation.onComplete(CompletablePeek.java:107)
io.reactivex.rxjava3.internal.operators.completable.CompletableCreate$Emitter.onComplete(CompletableCreate.java:65)
com.google.devtools.build.lib.remote.util.RxFutures$OnceCompletableOnSubscribe$1.onSuccess(RxFutures.java:85)
...

Building with --experimental_strict_conflict_checks (after a bazel clean) does not fail on our project.

We see this error rather consistently in CI, but it's harder to reproduce locally.

@tjgq
Copy link
Contributor

tjgq commented Mar 6, 2023

Can you figure out what kind of action creates the tree artifact (something like bazel aquery 'outputs(bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/resources/androidx_media_media, deps(//your:target))' should work, assuming androidx_media_media is the tree artifact root)? What I'm trying to establish is whether this tree artifact is materialized as a symlink to another tree artifact (the answer will point to one of two code paths).

Also, does your CI build incrementally? (i.e., does it start with an empty output tree, or does it reuse the output tree from a previous build)?

@tjgq
Copy link
Contributor

tjgq commented Mar 7, 2023

@lukaciko Could you try patching #17678 and let me know if it fixes this issue?

@lukaciko
Copy link
Author

lukaciko commented Mar 7, 2023

Still seeing the issue. Our CI builds incrementally - we have a output base that's reused from previous builds. We do not use disk_cache on CI, just remote_cache & use BwtB. Should we clean outputs / invalidate caches on these internal bazel changes?

This is the action creating the tree artifact:

action 'Extracting AAR Resources for @maven//:androidx_media_media'
  Mnemonic: AarResourcesExtractor
  Target: @maven//:androidx_media_media
  Configuration: android-arm64-v8a-fastbuild
  Execution platform: @local_config_platform//:host
  ActionKey: 5ca496296c180ae21582dcbba7f9771573a8feb49cfb07ce7b58813d9e6f866b
  Inputs: [bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/v1/https/artifactory.spotify.net/artifactory/android-repo/androidx/media/media/1.5.0/media-1.5.0.aar, bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/external/bazel_tools/tools/android/aar_resources_extractor, bazel-out/darwin_arm64-opt-exec-2B5CBBC6/internal/_middlemen/external_Sbazel_Utools_Stools_Sandroid_Saar_Uresources_Uextractor-runfiles, external/bazel_tools/tools/android/aar_resources_extractor.py]
  Outputs: [bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/assets/androidx_media_media (TreeArtifact), bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/data-binding-br/androidx_media_media (TreeArtifact), bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/data-binding-setter_store/androidx_media_media (TreeArtifact), bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/resources/androidx_media_media (TreeArtifact)]
  Environment: [PATH=/bin:/usr/bin:/usr/local/bin]
  Command Line: (exec bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/external/bazel_tools/tools/android/aar_resources_extractor \
    --input_aar \
    bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/v1/https/artifactory.spotify.net/artifactory/android-repo/androidx/media/media/1.5.0/media-1.5.0.aar \
    --output_res_dir \
    bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/resources/androidx_media_media \
    --output_assets_dir \
    bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/assets/androidx_media_media \
    --output_databinding_br_dir \
    bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/data-binding-br/androidx_media_media \
    --output_databinding_setter_store_dir \
    bazel-out/android-arm64-v8a-fastbuild/bin/external/maven/_aar/unzipped/data-binding-setter_store/androidx_media_media)
# Configuration: 9212dfaa1946445fad71fe73bfb2f3561a48d8e83192556b2014cbe7c96fcb11
# Execution platform: @local_config_platform//:host

@tjgq
Copy link
Contributor

tjgq commented Mar 7, 2023

It is certainly our intention that incremental builds should never crash, regardless of the preexisting state of the output tree, and including when you upgrade Bazel to a new version.

My best guess is that the createWritableDirectory call (which is actually a stat(2) optionally followed by a mkdir(2) if the directory doesn't yet exist, or a chmod(2) if it does) is racing with another createWritableDirectory call (or a createSymbolicLink call, although I'm doubtful since no symlink seems to be involved) for the same path. But the locking done by DirectoryContext is supposed to prevent that race, as long as there are no nested artifacts.

It sounds like this happens pretty consistently - how hard would it be to provide a minimal repro? I've tried to reproduce it with another Android project, but no luck so far. Alternatively, I could try to give you a patch to augment the stack trace with additional information about the state of the filesystem and other concurrent prefetches, so we can better understand the conditions under which this happens.

@tjgq
Copy link
Contributor

tjgq commented Mar 7, 2023

@lukaciko I've updated #17678 with further modifications (hopefully) fixing the remaining race conditions. Would you mind trying it again?

@lukaciko
Copy link
Author

@tjgq it looks like the issue is gone. Took me a bit of time - ran into an issue when trying it on the 6.1 branch, I believe it was because I didn't include 94c519b. Ending up testing the latest master and that worked. I see this should make it into the 6.2 release, looking forward to that. Thanks!

@tjgq
Copy link
Contributor

tjgq commented Mar 10, 2023

Great to hear! Yes, this fix will be in 6.2.

FYI, I suspect there's one remaining correctness issue in the presence of "templated tree artifacts" (in an Android context, that only applies to dex merging) + BwtB, as hinted at by this comment. I don't know how easy it is to trigger in practice, but I'm also planning to fix it for 6.2, regardless. Just wanted to mention it in case you come across some other weirdness in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

3 participants