Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add build/test Windows CI #11009

Closed
ScottTodd opened this issue Nov 2, 2022 · 8 comments
Closed

Add build/test Windows CI #11009

ScottTodd opened this issue Nov 2, 2022 · 8 comments
Assignees
Labels
enhancement ➕ New feature or request infrastructure Relating to build systems, CI, or testing platform/windows 🚪 Windows-specific build, execution, benchmarking, and deployment

Comments

@ScottTodd
Copy link
Member

GitHub Actions managed runner images look pretty well set up for development already: https://github.com/actions/runner-images/blob/main/images/win/Windows2022-Readme.md

We could use Docker to manage other dependencies as needed (CUDA/Vulkan SDK, Python packages, etc.).

I'd probably start with build_all and test_all from https://github.com/iree-org/iree/blob/main/.github/workflows/ci.yml, which make use of these scripts:

After getting something working, we try larger runners (self-hosted or managed), depending on how slow the builds are.

I don't expect much will be shared between pre/post-submit CI and release builds.

@ScottTodd ScottTodd added enhancement ➕ New feature or request infrastructure Relating to build systems, CI, or testing platform/windows 🚪 Windows-specific build, execution, benchmarking, and deployment labels Nov 2, 2022
@ScottTodd ScottTodd self-assigned this Nov 2, 2022
@ScottTodd ScottTodd reopened this Nov 2, 2022
@ScottTodd ScottTodd reopened this Nov 2, 2022
@ScottTodd ScottTodd reopened this Nov 2, 2022
ScottTodd added a commit that referenced this issue Nov 3, 2022
Progress on #11009

Tested here:
https://github.com/ScottTodd/iree/actions/runs/3380820232/jobs/5614038219

Post-submit only for now, so we can monitor it. I saw one build hang in
the "Configuring MSVC" step:
https://github.com/ScottTodd/iree/actions/runs/3380762369/jobs/5613916852
ScottTodd added a commit that referenced this issue Nov 7, 2022
…1032)

Progress on #11009, depends on
#11048

Changes:
* `build_runtime` + `test_runtime` -> `build_test_runtime` (overhead
from repository cloning, artifact upload, and artifact download was
taking longer than just running the tests from the same job)
* `build_runtime_windows` -> `build_test_runtime_windows`
  * Runs on `managed-windows-cpu` (larger build machine)
* Runs tests, instead of just builds (now that all runtime tests pass on
Windows)
* Runs on presubmit now too, instead of just postsubmit (the build
appears to be stable)

Sample run:
https://github.com/iree-org/iree/actions/runs/3412369869/jobs/5677798847
@ScottTodd
Copy link
Member Author

ScottTodd commented Nov 7, 2022

Some tests are failing on Windows. These should either be fixed or disabled prior to adding CI:

https://github.com/iree-org/iree/actions/runs/3414442364/jobs/5682427788

ScottTodd added a commit that referenced this issue Nov 8, 2022
Progress on #11009 (see some test
runs of a `build_test_all_windows` CI job on
#11050)

* Tag `simple_embedding_vulkan_test` as using Vulkan, so it can be
filtered
* Initialize a struct in pack_test.cc
* Fix PYTHONPATH separator for iree_local_py_test on Windows (while
keeping CMake happy)
ScottTodd added a commit that referenced this issue Nov 8, 2022
Progress on #11009

This fixes these two tests on Windows:
*
`iree/compiler/Dialect/HAL/Target/LLVM/test/smoketest_system.mlir.test`
* `iree/tests/e2e/regression/libm_linking.mlir.test`

Developers should run vcvarsall before ctest, or configure their IDE as
needed. For VSCode, I have this set:
```json
    "cmakeExplorer.extraCtestEnvVars": {
      "VCTOOLSINSTALLDIR": "C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\Preview\\VC\\Tools\\MSVC\\14.31.31103\\",
      "UNIVERSALCRTSDKDIR": "C:\\Program Files (x86)\\Windows Kits\\10\\",
      "UCRTVersion": "10.0.19041.0",
    },
```
@ScottTodd
Copy link
Member Author

These tests (also mentioned above, with links to issues) are still failing:

iree/tests/e2e/matmul/e2e_matmul_direct_i8_small_ukernel_vmvx_local-task
iree/tests/e2e/matmul/e2e_matmul_direct_f32_small_ukernel_vmvx_local-task
iree/tests/e2e/regression/check_regression_llvm-cpu_lowering_config.mlir
iree/tests/e2e/tosa_ops/check_vmvx_local-sync_microkernels_fully_connected.mlir
iree/runtime/bindings/python/vm_types_test

but I'd still like to use ctest_all.sh for Windows CI.

I see a few ways to skip those tests / mark them as XFAIL, but I'm not sure which to use.

  • Add a nowindows tag to the individual test cases
  • Exclude entire suites of tests on Windows (e.g. all of e2e_matul)
  • Add explicit filters to ctest_all, like how we filter some tests for ASan:
    # These tests currently have asan failures
    declare -a excluded_tests=(
    # TODO(#5716): Fix flaky ASan crash in these tests
    "iree/tests/e2e/models/collatz.mlir.test"
    "iree/tests/e2e/models/edge_detection.mlir.test"

Explicit filters in the script seems the easiest, but then developers running ctest manually will still see failures. That would make it easier to work on fixing the errors though.

ScottTodd added a commit that referenced this issue Nov 11, 2022
Progress on #11009 (see other
implementation ideas for filtering at
#11009 (comment))
ScottTodd added a commit that referenced this issue Nov 12, 2022
Progress on #11009

Sample successful run:
https://github.com/iree-org/iree/actions/runs/3448018286/jobs/5754640528

At first this will only run on postsubmit, then we can move this to
presubmit if it is stable and we have enough resources. We should also
keep an eye on how long this takes to run (checkout seems to take
substantially longer than on Linux), and also consider using Docker to
manage the dependencies and environment.

Notes:
* This does not test Vulkan or CUDA - we'd want a runner with a physical
GPU (or SwiftShader) for that
* This downloads CUDA deps on demand via CMake (so there's some
additional network activity)
@ScottTodd
Copy link
Member Author

We have two CI jobs now: build_test_runtime_windows and build_test_all_windows. The runtime build runs on presubmit, but the "all" build is a bit too slow right now to comfortably run on presubmit (it only runs on postsubmit right now - a step above running on a cron, at least).

Timing varies from run to run, but we're looking at roughly

1m      - check out repo
5m      - update submodules
20-30m  - build project
30s     - run tests

I'd like to aim for 15 minutes total, but could tolerate 25 minutes. We have longer builds (especially multistage builds like those involved in benchmarking), but spending so much time downloading and rebuilding is wasteful.

To speed up the build, I just evaluated ccache on Windows with MSVC and was able to get it working on my local machine. We normally use GCS to host our remote cache though, which GitHub-hosted runners don't have write access to:

export CCACHE_REMOTE_STORAGE="http://storage.googleapis.com/iree-sccache/ccache"
# Note ccache is read-only here since this job runs on a GitHub-hosted runner
# and GitHub runners don't have write access to GCS

To speed up the submodule checkout, I've tried a few different variations on https://github.com/actions/checkout settings and explicit git submodule update settings. The next step might be to try using the cache action : https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows, or a Docker image with some initial submodule data populated.

ScottTodd added a commit that referenced this issue Jan 14, 2023
One step towards enabling [ccache](https://ccache.dev/) on our Windows
CI, but there are a few details to still work through:
#11009 (comment).

On my machine, I see these results (sample size 1):

> clean build (no cache): 528 seconds
> ```
> λ ccache --show-stats
> Cacheable calls:   3942 / 3943 (99.97%)
>   Hits:               2 / 3942 ( 0.05%)
>     Direct:           0 /    2 ( 0.00%)
>     Preprocessed:     2 /    2 (100.0%)
>   Misses:          3940 / 3942 (99.95%)
> Uncacheable calls:    1 / 3943 ( 0.03%)
> Local storage:
>   Cache size (GB): 2.21 / 5.00 (44.21%)
>   Cleanups:          16
> ```
> clean build (with cache): 96 seconds
> ```
> λ ccache --show-stats
> Cacheable calls:   3942 / 3943 (99.97%)
>   Hits:            3939 / 3942 (99.92%)
>     Direct:        3939 / 3939 (100.0%)
>     Preprocessed:     0 / 3939 ( 0.00%)
>   Misses:             3 / 3942 ( 0.08%)
> Uncacheable calls:    1 / 3943 ( 0.03%)
> Local storage:
>   Cache size (GB): 2.21 / 5.00 (44.23%)
> ```

My only changes to enable ccache were:
* Download ccache.exe and put it on my `PATH`
* Configure CMake with `-DCMAKE_C_COMPILER_LAUNCHER=ccache
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache` (added to `"cmake.configureArgs":
[ ... ]` in VSCode settings)
ScottTodd added a commit that referenced this issue Jan 23, 2023
This uses GitHub's [actions/cache](https://github.com/actions/cache)
together with [ccache](https://ccache.dev/) to speed up our
`build_test_all_windows` GitHub Actions CI job. I also tested caching
with the `build_test_runtime_windows` job, but benefits were negligible
there.

We use ccache for our CMake Linux jobs, but those jobs are running on
self-hosted runners and not GitHub-managed runners. The self-hosted
runners have write access to the GCS bucket we store our remote cache
in, while the GitHub-managed runners do not. The
[actions/cache](https://github.com/actions/cache) action can be used to
similarly store a remote cache, though with more steps in the job
definitions.

Git submodules have been taking much longer to update on Windows than on
Linux (6-10 minutes vs 1-2 minutes). We can similarly use the cache
action to start the `.git/modules/` directory with an initial set of
files, at least until git or
[actions/checkout](https://github.com/actions/checkout) improves
behavior on Windows.

This uses two caches: one for git submodules and one for ccache. The
caches are always read/restored, and they are only written/saved when
`needs.setup.outputs.write-caches` is true (currently only when running
workflows on postsubmit).

Note: we have 10GB of cache per repository, which is space for about 4
commits worth of cache entries at current sizes (2.4GB). I'm using
`ccache_all_windows_${{ github.sha }}` as the primary key for immutable
cache entries, then `ccache_all_windows` as the "restore" key pattern,
which will match the most recently added cache entry. Cache entries can
be managed at https://github.com/iree-org/iree/actions/caches.

Progress on #11009. Once this
lands we can probably move the `build_test_all_windows` job to run on
presubmit.

## Experimental results:

Note: these are best-case results. I've also observed many cache misses
where hits would be expected, so more analysis will be needed.

### `build_test_runtime_windows`

cache size: 27MB (git) + 59MB (ccache) = 86MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775683130)
| 4m 20s | 35s | 1m 13s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395857)
| 5m 26s | 39s | 1m50s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837992)
| 4m 5s | 20s | 42s

### `build_test_all_windows`

cache size: 230MB (git) + 2167MB (ccache) = 2397MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775681967)
| 31m 16s | 5m 58s | 22m 46s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395752)
| 30m 8s | 6m 30s | 14m 55s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837849)
| 14m 9s | 1m15s | 4m34s

Note: 5 minutes of the total time is spent uploading cache data, which
will only happen on postsubmit.
ScottTodd added a commit that referenced this issue Mar 3, 2023
As we're using shallow clones, this is providing dubious value. 1.7GB of
cache data is also cutting into our repo limit of 10GB, which may be
better used by ccache (recent runs are getting only 16% cache hits
reliably... need to also investigate that).

Test run:
https://github.com/openxla/iree/actions/runs/4326504405/jobs/7554009641

Progress on #11009
ScottTodd added a commit that referenced this issue Mar 3, 2023
We've been going over the max cache size (see "cleanups" in [these
logs](https://github.com/openxla/iree/actions/runs/4326924360/jobs/7554950036#step:9:8614)):
```
+ ccache --show-stats
Cacheable calls:   4826 / 4885 (98.79%)
  Hits:             594 / 4826 (12.31%)
    Direct:          83 /  594 (13.97%)
    Preprocessed:   511 /  594 (86.03%)
  Misses:          4232 / 4826 (87.69%)
Uncacheable calls:   59 / 4885 ( 1.21%)
Local storage:
  Cache size (GB): 2.65 / 3.00 (88.43%)
  Cleanups:          85
```

[ccache](https://ccache.dev/) by default compresses with zstd level 1,
but we can increase that:
https://ccache.dev/manual/4.2.1.html#config_compression_level
> As a rule of thumb, use level 5 or lower since higher levels may slow
down compilations noticeably.

On my dev machine, level 5 saves 600MB:
```
λ ccache --recompress 5
Recompressing... 100.0% [=========================================================================]

Original data:          24.6 GB
Old compressed data:     4.4 GB (18.1% of original size)
  - Compression ratio: 5.537 x  (81.9% space savings)
New compressed data:     3.8 GB (15.5% of original size)
  - Compression ratio: 6.439 x  (84.5% space savings)
Size change:          -623.5 MB
```

Sample run:
https://github.com/openxla/iree/actions/runs/4327549301/jobs/7556311688#step:9:8617
(starting from existing compression level cache)

---

Progress on #11009
@ScottTodd
Copy link
Member Author

I'd like to aim for 15 minutes total, but could tolerate 25 minutes.

We're at around 17-20 minutes now:

1m     - check out repo
1m     - fetch ccache (from GitHub)
5-8m   - update submodules
6m     - build project
30s    - run tests
3m     - save ccache (to GitHub)

The submodule update taking so long is unfortunate (it's only slower on the larger GitHub-managed runners, but we can't build in any reasonable time on the standard sized runners). We asked GitHub support if they could help investigate that, but we haven't heard back from their engineering team yet.


Beyond that, I'd like to tweak the ccache setup slightly before turning this build on for presubmits. The current behavior uses the IREE commit as the cache key, which results in every commit creating a new cache entry (evicting previous cache entries, since each entry is ~3.4GB and we have 10GB total). If there is no exact match, the cache fetch falls back to the latest cache entry. This is reasonable for postsubmit (where code is always moving forward), but could be improved for presubmit (where base commits can vary). We could change the cache key to use the LLVM commit instead, which would increase the likelihood that a PR would be able to fetch a relevant cache entry.

On our Linux runners, we use a remote ccache in one of our cloud buckets (without the 10GB limit). We can't currently do that for Windows since the GitHub-managed runners don't have write access to those buckets.

@ScottTodd
Copy link
Member Author

Windows CI cache behavior regressed around 2eb6450 (from #12562).

Before:

+ ccache --show-stats
Cacheable calls:   4857 / 4916 (98.80%)
  Hits:            4813 / 4857 (99.09%)
    Direct:        4465 / 4813 (92.77%)
    Preprocessed:   348 / 4813 ( 7.23%)
  Misses:            44 / 4857 ( 0.91%)
Uncacheable calls:   59 / 4916 ( 1.20%)
Local storage:
  Cache size (GB): 3.62 / 4.00 (90.45%)

After:

+ ccache --show-stats
Cacheable calls:   3915 / 4917 (79.62%)
  Hits:            3868 / 3915 (98.80%)
    Direct:        3519 / 3868 (90.98%)
    Preprocessed:   349 / 3868 ( 9.02%)
  Misses:            47 / 3915 ( 1.20%)
Uncacheable calls: 1002 / 4917 (20.38%)
Local storage:
  Cache size (GB):  4.0 /  4.0 (99.88%)
  Cleanups:           1
  Hits:            3868 / 3915 (98.80%)
  Misses:            47 / 3915 ( 1.20%)

We can check which calls weren't able to be cached by inspecting the ccache log file. Ideally the cache size would be smaller too... looks like we're hitting the limit we set again (4GB this time).

qedawkins pushed a commit to qedawkins/iree that referenced this issue Apr 2, 2023
This uses GitHub's [actions/cache](https://github.com/actions/cache)
together with [ccache](https://ccache.dev/) to speed up our
`build_test_all_windows` GitHub Actions CI job. I also tested caching
with the `build_test_runtime_windows` job, but benefits were negligible
there.

We use ccache for our CMake Linux jobs, but those jobs are running on
self-hosted runners and not GitHub-managed runners. The self-hosted
runners have write access to the GCS bucket we store our remote cache
in, while the GitHub-managed runners do not. The
[actions/cache](https://github.com/actions/cache) action can be used to
similarly store a remote cache, though with more steps in the job
definitions.

Git submodules have been taking much longer to update on Windows than on
Linux (6-10 minutes vs 1-2 minutes). We can similarly use the cache
action to start the `.git/modules/` directory with an initial set of
files, at least until git or
[actions/checkout](https://github.com/actions/checkout) improves
behavior on Windows.

This uses two caches: one for git submodules and one for ccache. The
caches are always read/restored, and they are only written/saved when
`needs.setup.outputs.write-caches` is true (currently only when running
workflows on postsubmit).

Note: we have 10GB of cache per repository, which is space for about 4
commits worth of cache entries at current sizes (2.4GB). I'm using
`ccache_all_windows_${{ github.sha }}` as the primary key for immutable
cache entries, then `ccache_all_windows` as the "restore" key pattern,
which will match the most recently added cache entry. Cache entries can
be managed at https://github.com/iree-org/iree/actions/caches.

Progress on iree-org#11009. Once this
lands we can probably move the `build_test_all_windows` job to run on
presubmit.

## Experimental results:

Note: these are best-case results. I've also observed many cache misses
where hits would be expected, so more analysis will be needed.

### `build_test_runtime_windows`

cache size: 27MB (git) + 59MB (ccache) = 86MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775683130)
| 4m 20s | 35s | 1m 13s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395857)
| 5m 26s | 39s | 1m50s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837992)
| 4m 5s | 20s | 42s

### `build_test_all_windows`

cache size: 230MB (git) + 2167MB (ccache) = 2397MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775681967)
| 31m 16s | 5m 58s | 22m 46s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395752)
| 30m 8s | 6m 30s | 14m 55s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837849)
| 14m 9s | 1m15s | 4m34s

Note: 5 minutes of the total time is spent uploading cache data, which
will only happen on postsubmit.
qedawkins pushed a commit to qedawkins/iree that referenced this issue Apr 2, 2023
As we're using shallow clones, this is providing dubious value. 1.7GB of
cache data is also cutting into our repo limit of 10GB, which may be
better used by ccache (recent runs are getting only 16% cache hits
reliably... need to also investigate that).

Test run:
https://github.com/openxla/iree/actions/runs/4326504405/jobs/7554009641

Progress on iree-org#11009
qedawkins pushed a commit to qedawkins/iree that referenced this issue Apr 2, 2023
We've been going over the max cache size (see "cleanups" in [these
logs](https://github.com/openxla/iree/actions/runs/4326924360/jobs/7554950036#step:9:8614)):
```
+ ccache --show-stats
Cacheable calls:   4826 / 4885 (98.79%)
  Hits:             594 / 4826 (12.31%)
    Direct:          83 /  594 (13.97%)
    Preprocessed:   511 /  594 (86.03%)
  Misses:          4232 / 4826 (87.69%)
Uncacheable calls:   59 / 4885 ( 1.21%)
Local storage:
  Cache size (GB): 2.65 / 3.00 (88.43%)
  Cleanups:          85
```

[ccache](https://ccache.dev/) by default compresses with zstd level 1,
but we can increase that:
https://ccache.dev/manual/4.2.1.html#config_compression_level
> As a rule of thumb, use level 5 or lower since higher levels may slow
down compilations noticeably.

On my dev machine, level 5 saves 600MB:
```
λ ccache --recompress 5
Recompressing... 100.0% [=========================================================================]

Original data:          24.6 GB
Old compressed data:     4.4 GB (18.1% of original size)
  - Compression ratio: 5.537 x  (81.9% space savings)
New compressed data:     3.8 GB (15.5% of original size)
  - Compression ratio: 6.439 x  (84.5% space savings)
Size change:          -623.5 MB
```

Sample run:
https://github.com/openxla/iree/actions/runs/4327549301/jobs/7556311688#step:9:8617
(starting from existing compression level cache)

---

Progress on iree-org#11009
@ScottTodd
Copy link
Member Author

actions/checkout@ef43818 may improve checkout time. Time to run some more experiments :)

@ScottTodd
Copy link
Member Author

Nope, still catastrophically slow :(

GMNGeoffrey added a commit that referenced this issue Apr 20, 2023
The GitHub-provided `actions/checkout` action is for some reason
unusably slow on the large managed Windows runners. We assumed this was
because of some tricky IO issue or something, but I decide to just try
directly using `git` commands to see and lo the checkout time goes from
10 minutes to 1.5 🚀 

With the caching improvements from
#13183, this gets the Windows build
down under 10 minutes, which means we can run it on presubmit (left for
a future PR).

Part of #11009

Tested:
Enabled this workflow on push to my branch:
https://github.com/openxla/iree/actions/runs/4750681034/jobs/8439091687

skip-ci: this only affects the Windows job, which isn't run on presubmit
jpienaar pushed a commit that referenced this issue May 1, 2023
One step towards enabling [ccache](https://ccache.dev/) on our Windows
CI, but there are a few details to still work through:
#11009 (comment).

On my machine, I see these results (sample size 1):

> clean build (no cache): 528 seconds
> ```
> λ ccache --show-stats
> Cacheable calls:   3942 / 3943 (99.97%)
>   Hits:               2 / 3942 ( 0.05%)
>     Direct:           0 /    2 ( 0.00%)
>     Preprocessed:     2 /    2 (100.0%)
>   Misses:          3940 / 3942 (99.95%)
> Uncacheable calls:    1 / 3943 ( 0.03%)
> Local storage:
>   Cache size (GB): 2.21 / 5.00 (44.21%)
>   Cleanups:          16
> ```
> clean build (with cache): 96 seconds
> ```
> λ ccache --show-stats
> Cacheable calls:   3942 / 3943 (99.97%)
>   Hits:            3939 / 3942 (99.92%)
>     Direct:        3939 / 3939 (100.0%)
>     Preprocessed:     0 / 3939 ( 0.00%)
>   Misses:             3 / 3942 ( 0.08%)
> Uncacheable calls:    1 / 3943 ( 0.03%)
> Local storage:
>   Cache size (GB): 2.21 / 5.00 (44.23%)
> ```

My only changes to enable ccache were:
* Download ccache.exe and put it on my `PATH`
* Configure CMake with `-DCMAKE_C_COMPILER_LAUNCHER=ccache
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache` (added to `"cmake.configureArgs":
[ ... ]` in VSCode settings)
jpienaar pushed a commit that referenced this issue May 1, 2023
This uses GitHub's [actions/cache](https://github.com/actions/cache)
together with [ccache](https://ccache.dev/) to speed up our
`build_test_all_windows` GitHub Actions CI job. I also tested caching
with the `build_test_runtime_windows` job, but benefits were negligible
there.

We use ccache for our CMake Linux jobs, but those jobs are running on
self-hosted runners and not GitHub-managed runners. The self-hosted
runners have write access to the GCS bucket we store our remote cache
in, while the GitHub-managed runners do not. The
[actions/cache](https://github.com/actions/cache) action can be used to
similarly store a remote cache, though with more steps in the job
definitions.

Git submodules have been taking much longer to update on Windows than on
Linux (6-10 minutes vs 1-2 minutes). We can similarly use the cache
action to start the `.git/modules/` directory with an initial set of
files, at least until git or
[actions/checkout](https://github.com/actions/checkout) improves
behavior on Windows.

This uses two caches: one for git submodules and one for ccache. The
caches are always read/restored, and they are only written/saved when
`needs.setup.outputs.write-caches` is true (currently only when running
workflows on postsubmit).

Note: we have 10GB of cache per repository, which is space for about 4
commits worth of cache entries at current sizes (2.4GB). I'm using
`ccache_all_windows_${{ github.sha }}` as the primary key for immutable
cache entries, then `ccache_all_windows` as the "restore" key pattern,
which will match the most recently added cache entry. Cache entries can
be managed at https://github.com/iree-org/iree/actions/caches.

Progress on #11009. Once this
lands we can probably move the `build_test_all_windows` job to run on
presubmit.

## Experimental results:

Note: these are best-case results. I've also observed many cache misses
where hits would be expected, so more analysis will be needed.

### `build_test_runtime_windows`

cache size: 27MB (git) + 59MB (ccache) = 86MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775683130)
| 4m 20s | 35s | 1m 13s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395857)
| 5m 26s | 39s | 1m50s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837992)
| 4m 5s | 20s | 42s

### `build_test_all_windows`

cache size: 230MB (git) + 2167MB (ccache) = 2397MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775681967)
| 31m 16s | 5m 58s | 22m 46s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395752)
| 30m 8s | 6m 30s | 14m 55s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837849)
| 14m 9s | 1m15s | 4m34s

Note: 5 minutes of the total time is spent uploading cache data, which
will only happen on postsubmit.
jpienaar pushed a commit that referenced this issue May 1, 2023
As we're using shallow clones, this is providing dubious value. 1.7GB of
cache data is also cutting into our repo limit of 10GB, which may be
better used by ccache (recent runs are getting only 16% cache hits
reliably... need to also investigate that).

Test run:
https://github.com/openxla/iree/actions/runs/4326504405/jobs/7554009641

Progress on #11009
jpienaar pushed a commit that referenced this issue May 1, 2023
We've been going over the max cache size (see "cleanups" in [these
logs](https://github.com/openxla/iree/actions/runs/4326924360/jobs/7554950036#step:9:8614)):
```
+ ccache --show-stats
Cacheable calls:   4826 / 4885 (98.79%)
  Hits:             594 / 4826 (12.31%)
    Direct:          83 /  594 (13.97%)
    Preprocessed:   511 /  594 (86.03%)
  Misses:          4232 / 4826 (87.69%)
Uncacheable calls:   59 / 4885 ( 1.21%)
Local storage:
  Cache size (GB): 2.65 / 3.00 (88.43%)
  Cleanups:          85
```

[ccache](https://ccache.dev/) by default compresses with zstd level 1,
but we can increase that:
https://ccache.dev/manual/4.2.1.html#config_compression_level
> As a rule of thumb, use level 5 or lower since higher levels may slow
down compilations noticeably.

On my dev machine, level 5 saves 600MB:
```
λ ccache --recompress 5
Recompressing... 100.0% [=========================================================================]

Original data:          24.6 GB
Old compressed data:     4.4 GB (18.1% of original size)
  - Compression ratio: 5.537 x  (81.9% space savings)
New compressed data:     3.8 GB (15.5% of original size)
  - Compression ratio: 6.439 x  (84.5% space savings)
Size change:          -623.5 MB
```

Sample run:
https://github.com/openxla/iree/actions/runs/4327549301/jobs/7556311688#step:9:8617
(starting from existing compression level cache)

---

Progress on #11009
jpienaar pushed a commit that referenced this issue May 1, 2023
The GitHub-provided `actions/checkout` action is for some reason
unusably slow on the large managed Windows runners. We assumed this was
because of some tricky IO issue or something, but I decide to just try
directly using `git` commands to see and lo the checkout time goes from
10 minutes to 1.5 🚀 

With the caching improvements from
#13183, this gets the Windows build
down under 10 minutes, which means we can run it on presubmit (left for
a future PR).

Part of #11009

Tested:
Enabled this workflow on push to my branch:
https://github.com/openxla/iree/actions/runs/4750681034/jobs/8439091687

skip-ci: this only affects the Windows job, which isn't run on presubmit
rengolin pushed a commit to plaidml/iree that referenced this issue May 2, 2023
This uses GitHub's [actions/cache](https://github.com/actions/cache)
together with [ccache](https://ccache.dev/) to speed up our
`build_test_all_windows` GitHub Actions CI job. I also tested caching
with the `build_test_runtime_windows` job, but benefits were negligible
there.

We use ccache for our CMake Linux jobs, but those jobs are running on
self-hosted runners and not GitHub-managed runners. The self-hosted
runners have write access to the GCS bucket we store our remote cache
in, while the GitHub-managed runners do not. The
[actions/cache](https://github.com/actions/cache) action can be used to
similarly store a remote cache, though with more steps in the job
definitions.

Git submodules have been taking much longer to update on Windows than on
Linux (6-10 minutes vs 1-2 minutes). We can similarly use the cache
action to start the `.git/modules/` directory with an initial set of
files, at least until git or
[actions/checkout](https://github.com/actions/checkout) improves
behavior on Windows.

This uses two caches: one for git submodules and one for ccache. The
caches are always read/restored, and they are only written/saved when
`needs.setup.outputs.write-caches` is true (currently only when running
workflows on postsubmit).

Note: we have 10GB of cache per repository, which is space for about 4
commits worth of cache entries at current sizes (2.4GB). I'm using
`ccache_all_windows_${{ github.sha }}` as the primary key for immutable
cache entries, then `ccache_all_windows` as the "restore" key pattern,
which will match the most recently added cache entry. Cache entries can
be managed at https://github.com/iree-org/iree/actions/caches.

Progress on iree-org#11009. Once this
lands we can probably move the `build_test_all_windows` job to run on
presubmit.

## Experimental results:

Note: these are best-case results. I've also observed many cache misses
where hits would be expected, so more analysis will be needed.

### `build_test_runtime_windows`

cache size: 27MB (git) + 59MB (ccache) = 86MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775683130)
| 4m 20s | 35s | 1m 13s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395857)
| 5m 26s | 39s | 1m50s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837992)
| 4m 5s | 20s | 42s

### `build_test_all_windows`

cache size: 230MB (git) + 2167MB (ccache) = 2397MB (total)

Configuration | Logs | total time | submodule checkout timing | build
timing
------------- | ---- | ---------- | ------------------------- |
------------
baseline |
[logs](https://github.com/iree-org/iree/actions/runs/3956450018/jobs/6775681967)
| 31m 16s | 5m 58s | 22m 46s
new (cache miss) |
[logs](https://github.com/iree-org/iree/actions/runs/3963023407/jobs/6790395752)
| 30m 8s | 6m 30s | 14m 55s
new (cache hit) |
[logs](https://github.com/iree-org/iree/actions/runs/3963233498/jobs/6790837849)
| 14m 9s | 1m15s | 4m34s

Note: 5 minutes of the total time is spent uploading cache data, which
will only happen on postsubmit.
rengolin pushed a commit to plaidml/iree that referenced this issue May 2, 2023
As we're using shallow clones, this is providing dubious value. 1.7GB of
cache data is also cutting into our repo limit of 10GB, which may be
better used by ccache (recent runs are getting only 16% cache hits
reliably... need to also investigate that).

Test run:
https://github.com/openxla/iree/actions/runs/4326504405/jobs/7554009641

Progress on iree-org#11009
rengolin pushed a commit to plaidml/iree that referenced this issue May 2, 2023
We've been going over the max cache size (see "cleanups" in [these
logs](https://github.com/openxla/iree/actions/runs/4326924360/jobs/7554950036#step:9:8614)):
```
+ ccache --show-stats
Cacheable calls:   4826 / 4885 (98.79%)
  Hits:             594 / 4826 (12.31%)
    Direct:          83 /  594 (13.97%)
    Preprocessed:   511 /  594 (86.03%)
  Misses:          4232 / 4826 (87.69%)
Uncacheable calls:   59 / 4885 ( 1.21%)
Local storage:
  Cache size (GB): 2.65 / 3.00 (88.43%)
  Cleanups:          85
```

[ccache](https://ccache.dev/) by default compresses with zstd level 1,
but we can increase that:
https://ccache.dev/manual/4.2.1.html#config_compression_level
> As a rule of thumb, use level 5 or lower since higher levels may slow
down compilations noticeably.

On my dev machine, level 5 saves 600MB:
```
λ ccache --recompress 5
Recompressing... 100.0% [=========================================================================]

Original data:          24.6 GB
Old compressed data:     4.4 GB (18.1% of original size)
  - Compression ratio: 5.537 x  (81.9% space savings)
New compressed data:     3.8 GB (15.5% of original size)
  - Compression ratio: 6.439 x  (84.5% space savings)
Size change:          -623.5 MB
```

Sample run:
https://github.com/openxla/iree/actions/runs/4327549301/jobs/7556311688#step:9:8617
(starting from existing compression level cache)

---

Progress on iree-org#11009
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this issue Jul 6, 2023
As we're using shallow clones, this is providing dubious value. 1.7GB of
cache data is also cutting into our repo limit of 10GB, which may be
better used by ccache (recent runs are getting only 16% cache hits
reliably... need to also investigate that).

Test run:
https://github.com/openxla/iree/actions/runs/4326504405/jobs/7554009641

Progress on iree-org#11009
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this issue Jul 6, 2023
We've been going over the max cache size (see "cleanups" in [these
logs](https://github.com/openxla/iree/actions/runs/4326924360/jobs/7554950036#step:9:8614)):
```
+ ccache --show-stats
Cacheable calls:   4826 / 4885 (98.79%)
  Hits:             594 / 4826 (12.31%)
    Direct:          83 /  594 (13.97%)
    Preprocessed:   511 /  594 (86.03%)
  Misses:          4232 / 4826 (87.69%)
Uncacheable calls:   59 / 4885 ( 1.21%)
Local storage:
  Cache size (GB): 2.65 / 3.00 (88.43%)
  Cleanups:          85
```

[ccache](https://ccache.dev/) by default compresses with zstd level 1,
but we can increase that:
https://ccache.dev/manual/4.2.1.html#config_compression_level
> As a rule of thumb, use level 5 or lower since higher levels may slow
down compilations noticeably.

On my dev machine, level 5 saves 600MB:
```
λ ccache --recompress 5
Recompressing... 100.0% [=========================================================================]

Original data:          24.6 GB
Old compressed data:     4.4 GB (18.1% of original size)
  - Compression ratio: 5.537 x  (81.9% space savings)
New compressed data:     3.8 GB (15.5% of original size)
  - Compression ratio: 6.439 x  (84.5% space savings)
Size change:          -623.5 MB
```

Sample run:
https://github.com/openxla/iree/actions/runs/4327549301/jobs/7556311688#step:9:8617
(starting from existing compression level cache)

---

Progress on iree-org#11009
NatashaKnk pushed a commit to NatashaKnk/iree that referenced this issue Jul 6, 2023
…3186)

The GitHub-provided `actions/checkout` action is for some reason
unusably slow on the large managed Windows runners. We assumed this was
because of some tricky IO issue or something, but I decide to just try
directly using `git` commands to see and lo the checkout time goes from
10 minutes to 1.5 🚀 

With the caching improvements from
iree-org#13183, this gets the Windows build
down under 10 minutes, which means we can run it on presubmit (left for
a future PR).

Part of iree-org#11009

Tested:
Enabled this workflow on push to my branch:
https://github.com/openxla/iree/actions/runs/4750681034/jobs/8439091687

skip-ci: this only affects the Windows job, which isn't run on presubmit
@ScottTodd
Copy link
Member Author

Going to call this fixed, though the compiler build is still postsubmit only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement ➕ New feature or request infrastructure Relating to build systems, CI, or testing platform/windows 🚪 Windows-specific build, execution, benchmarking, and deployment
Projects
No open projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

1 participant