Pack pipeline cache contents using tar/7z #10925

willsmythe · 2019-07-17T14:35:48Z

Basic information

Question, bug, or feature? : Feature
Task name: CacheBeta/Cache

Environment

Hosted

Description

For general information on caching in Azure Pipelines, see: https://aka.ms/pipeline-caching-docs

To improve cache restore/save performance especially for caches with a large number of small files (like node_modules), the Cache task should have built-in support for "packing" the cache contents, meaning consolidate all files for the specified "path" into a single file and only store this file in the cache on the server. Why?

Reduce number of network connections during cache restore/save
Improve performance (for most scenarios)
Improve reliability

For performance reasons, "tar" should be used on Linux and macOS, and "7z" on Windows.

Turning on cache content packing

For now, the option for packing a cache's contents should be controlled via an environment variable (e.g. AZDEVOPS_PIPELINECACHE_PACK), with a decision coming later about whether to always pack or give developers the option (likely via an input on the task).

Changes to the generated cache fingerprint

Definitions: the key is the developer-provided identifier for the cache that is typically a mix of strings and file paths; the fingerprint is a hash generated from the key segments and is the actual identifier for the cache on the server.

Since packing changes the actual contents of the cache (i.e. a single tar or 7z file versus many individual files), the task (technically the agent plugin) needs to append an appropriate segment to the developer-provided key to ensure a different fingerprint is produced (which logically makes sense since the cache's contents on the server are different from the "same" cache whose content wasn't packed). We should establish a "namespace" for these key segments injected by the task, and then define different key segments for the different pack formats, for example:

microsoft.azure.pipelines.caching.pack=tar (on posix)
microsoft.azure.pipelines.caching.pack=7z (on Windows)

Notice the key segment doesn't say anything about the OS, just the format of the contents [which happens to be determined by the OS].

The naming convention for key segments follows the convention for Docker labels and gives us room to support other key segments in the future. Developers should be blocked from specifying key segments in this namespace.

All of this should be somewhat transparent to the developer (but should still be reported in the logs so developers understand why turning on/off pack impacts cache's identifer). Developers should continue to use variables like $(Agent.OS) in their cache key when they know the cache's contents are different for different OSes (and not just rely on the auto-injected pack key segment creating this differentiation).

Runtime behavior

When caching packing is enabled ....

On restore

The task (technically the agent plugin) should append an appropriate key segment to the developer-provided key (and optional "restore keys") based on the preferred pack technology for the environment (tar on posix, 7z on Windows).

This generated fingerprint will then be looked up on the server as usual. If there is a cache hit, the downloaded contents will be appropriately unpacked and dropped into the developer-specified path.

On save

Like during restore, the task should append an appropriate key segment based on the preferred pack technology. If a cache with this key doesn't already exist on the server, the task should appropriately pack the files in the specified path and upload this single file as the contents for the new cache.

The text was updated successfully, but these errors were encountered:

jneira · 2019-07-19T12:45:46Z

I've tried to manually pack and unpack cache files to avoid #10841, but strangely the file.tar.gz is recognized as a directory by the tar utility:

tar: /archive.tar.gz: Cannot read: Is a directory
tar: At beginning of tape, quitting now
tar: Error is not recoverable: exiting now

but its file attributes say that it is a file (-rw-r--r--)!

willsmythe · 2019-07-19T13:34:15Z

@jneira - to make it easier to test the performance of tar/zip, I created pipeline step templates that handle tar/untarring files cached with the CacheBeta@0 task. See details here: https://github.com/willsmythe/caching-templates

Feel free to give it a try. If you run into a problem, please report it at willsmythe/caching-templates.

Disclaimer: this is not an official solution from Microsoft. It simply wraps the Microsoft-provided CacheBeta task and provides tar/untar (or zip/unzip on Windows) support.

Alternatively, point me to your repo and I can take a look ...

jneira · 2019-07-20T22:28:03Z

@willsmythe thanks! Actually i am implementing the template steps manually (afaiu, tar the original folder, put the tar file in other folder, and cache the last one) so maybe i'll give a try

jneira · 2019-07-22T05:41:31Z

Finally i've being able to cache the tar files with bash script steps. As i only need packing to workaround temporary #10841 only for linux and macos i will keep the manual hack for now. Thanks anyway @willsmythe
My final configuration was:

# .....
variables:
    STACK_ROOT: /home/vsts/.stack
 steps:
  - task: CacheBeta@0
    inputs:
      key: |
        "cache"
        $(Agent.OS)
        $(Build.SourcesDirectory)/$(YAML_FILE)
      path: .azure-cache
      cacheHitVar: CACHE_RESTORED
    displayName: "Download cache"
  - bash: |
      mkdir -p $STACK_ROOT
      tar -xzf .azure-cache/stack-root.tar.gz -C /
      mkdir -p .stack-work
      tar -xzf .azure-cache/stack-work.tar.gz
    displayName: "Unpack cache"
    condition: eq(variables.CACHE_RESTORED, 'true')
# ....
- bash: |
      mkdir .azure-cache
      tar -czf .azure-cache/stack-root.tar.gz $STACK_ROOT
      tar -czf .azure-cache/stack-work.tar.gz .stack-work
    displayName: "Pack cache"

The final build cached is https://dev.azure.com/jneira/haskell-ide-engine/_build/results?buildId=179

lukeapage · 2019-07-22T07:20:05Z

I tried out 7z, zip, tar using the archive task on node_modules.

For zip and tar the performance is worse, for 7zip is marginally better

w/o extra task
cache restore - 1m30s (~810mb)

w/ tar (no compression)
cache restore -35s (~940mb)
untar - 1min

w/ 7zip
cache restore - 10s (~140mb)
un zip - 1m

w/ zip
cache restore - 20s (~280mb)
unzip - 1m 6 s

willsmythe · 2019-09-09T15:28:35Z

This feature is merged and will be available in the v2.157.0 agent, which should be rolling out everywhere this week.

The functionality is currently "opt in" --- you need to set the AZP_CACHING_TAR variable to true to use it.

IMPORTANT: this variable is only checked on "cache save", which only runs if needed (i.e. a cache entry with the same key doesn't already exist) and the build status is successful. On "cache restore", regardless of this variable's value, the cache's contents are untarred whenever the cache entry metadata indicates the contents are cached.

lukeapage · 2019-09-10T12:47:19Z

We've been using 7z compression (as above) as its the best timings. However, the 7z task to compress when the cache fails is expensive and the problem is that we get cache misses all the time - we might get a cache hit and then 20 minutes later a cache fail.. and we end up compressing again and again on different pipelines for the exact same cache key and that in the end makes the builds alot slower that if we weren't 7z-ing.

Is it worth me trying the built-in tar? is it likely quicker than my custom test above where I have two seperate job to untar/fetch from azure?

johnterickson · 2019-11-04T21:11:37Z

@fadnavistanmay Close this out when we've deployed to all rings

johnterickson · 2019-11-04T21:21:44Z

We're rolling out TARing as default with agent 160. If this is what you want, you can just remove this env var AZP_CACHING_TAR. If you want to specifically not use TARing, then set AZP_CACHING_CONTENT_FORMAT to Files. We'll be documenting this new environment variable.

fadnavistanmay · 2019-11-06T17:33:32Z

This is releases as part of the agent 2.160.0 Thanks

willsmythe added enhancement Area: PipelineCaching labels Jul 17, 2019

ghost added Area: Release triage labels Jul 17, 2019

willsmythe mentioned this issue Jul 17, 2019

Preserve attributes for files cached with Pipeline Caching #10841

Closed

kmkumaran removed the Area: Release label Jul 18, 2019

This was referenced Jul 18, 2019

Cache task is slow and unreliable #10935

Closed

Proposals/pipeline caching microsoft/azure-pipelines-yaml#113

Merged

jneira mentioned this issue Jul 21, 2019

Use preview of pipeline caching in azure builds haskell/haskell-ide-engine#1335

Merged

johnterickson assigned fadnavistanmay Jul 24, 2019

willsmythe mentioned this issue Jul 31, 2019

Adding tarring support for Pipeline Cache microsoft/azure-pipelines-agent#2369

Merged

willsmythe mentioned this issue Sep 9, 2019

chore: add caching to azure pipelines jestjs/jest#8866

Merged

lukeapage mentioned this issue Sep 10, 2019

CacheBeta task: Frequent cache misses right after cache hit #11314

Closed

johnterickson mentioned this issue Nov 4, 2019

CacheBeta: task: cache turns symlinks into real files breaking virtualenvs #11350

Closed

johnterickson removed the triage label Nov 4, 2019

johnterickson mentioned this issue Nov 4, 2019

Pipeline Caching - executable bit is unset after restore #11204

Closed

This was referenced Nov 4, 2019

CacheBeta task: .git folder is not cached? #10942

Closed

CacheBeta task: "The path provided is invalid" for Maven repository #11259

Closed

fadnavistanmay closed this as completed Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pack pipeline cache contents using tar/7z #10925

Pack pipeline cache contents using tar/7z #10925

willsmythe commented Jul 17, 2019

jneira commented Jul 19, 2019 •

edited

Loading

willsmythe commented Jul 19, 2019

jneira commented Jul 20, 2019 •

edited

Loading

jneira commented Jul 22, 2019

lukeapage commented Jul 22, 2019

willsmythe commented Sep 9, 2019

lukeapage commented Sep 10, 2019

johnterickson commented Nov 4, 2019

johnterickson commented Nov 4, 2019

fadnavistanmay commented Nov 6, 2019

Pack pipeline cache contents using tar/7z #10925

Pack pipeline cache contents using tar/7z #10925

Comments

willsmythe commented Jul 17, 2019

Basic information

Environment

Description

Turning on cache content packing

Changes to the generated cache fingerprint

Runtime behavior

On restore

On save

jneira commented Jul 19, 2019 • edited Loading

willsmythe commented Jul 19, 2019

jneira commented Jul 20, 2019 • edited Loading

jneira commented Jul 22, 2019

lukeapage commented Jul 22, 2019

willsmythe commented Sep 9, 2019

lukeapage commented Sep 10, 2019

johnterickson commented Nov 4, 2019

johnterickson commented Nov 4, 2019

fadnavistanmay commented Nov 6, 2019

jneira commented Jul 19, 2019 •

edited

Loading

jneira commented Jul 20, 2019 •

edited

Loading