Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposals/pipeline caching #113

Merged
merged 13 commits into from
Apr 26, 2019
Merged

Proposals/pipeline caching #113

merged 13 commits into from
Apr 26, 2019

Conversation

mitchdenny
Copy link
Member

This is our draft proposal for Pipeline Caching. The contents have been adapted from our internal planning docs but this will be the spec for the feature moving forward. Comments and suggestions welcome. You can see from the commit history that we started with a somewhat complex model and worked to simplify it so that it was easy to adopt. Features specified in the more complex model may be added later.

design/pipeline-caching.md Outdated Show resolved Hide resolved
design/pipeline-caching.md Outdated Show resolved Hide resolved
design/pipeline-caching.md Outdated Show resolved Hide resolved
@mitchdenny mitchdenny self-assigned this Jan 31, 2019
@mitchdenny mitchdenny added the enhancement New feature or request label Jan 31, 2019
Copy link
Member

@baywet baywet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd would be interesting to be able to label caches. For example if in the same repository I have a front end application (JS) and and back end app (dotnet) maybe I'll want to use a single build definition and I want to be able to label those caches with unique names for the build to avoid restoring the wrong cache at the wrong place.

@JonathanGiles
Copy link

I can't wait to see Maven caching for Java builds! Great news!

@MichelZ
Copy link

MichelZ commented Feb 6, 2019

While this strategy is certainly something, it involves a lot of manual work (determine what to cache, cache it, restore cache).

Why won't you do it the same as other CI providers do? Create a permanent disk per pipeline or something, and mount this disk to the hosted Agent on every build. This disk could cache packages, could host source code (source code retrieval sometimes takes a while, too), and whatever else you need.

@Bjego
Copy link

Bjego commented Feb 6, 2019

It'd would be interesting to be able to label caches. For example if in the same repository I have a front end application (JS) and and back end app (dotnet) maybe I'll want to use a single build definition and I want to be able to label those caches with unique names for the build to avoid restoring the wrong cache at the wrong place.

You can hash your cache I think thats a clean way of using it multiple times.. Once you hash your package.json for node_modules and your csproj files for nuget.

I like this approach. It's a clean and simple way to include caching if needed.

  1. Load content from cache
  2. Download missing content
  3. Persist content in the cache

And also caching time of 7 days is pretty long. It could be 48 hours. If you don't build your code often (multiple times a day), you even could wait 2,or 3 minutes more to get your build done.

@dimitriy-k
Copy link

@eps1lon I use for now RestoreAndSaveCache@1 task and it cannot cache outside of the project root, so that why coping to the project root is needed. Not sure if CacheBeta@0 can work outside of the project root. And as you see, probably it works, but not with correct permissions :(

@ulrikstrid
Copy link
Contributor

I seeing some weird errors in my cache using esy. The cache is pretty bug (almost 4 GB), this in turn seems to lead to build errors down the line.

https://dev.azure.com/strid/reason-native-web/_build/results?buildId=827&view=logs&j=a5e52b91-c83f-5429-4a68-c246fc63a4f7

@lukeapage
Copy link

@eps1lon we get best gains installing cypress inside node modules and then caching all node modules. Can’t comment on attributes as we’ve been using windows fine. (Note 2: to get node modules fast we also 7zip it before caching)

@eps1lon
Copy link

eps1lon commented Aug 13, 2019

@eps1lon we get best gains installing cypress inside node modules and then caching all node modules.

@lukeapage This is for an experimental project where I switched to yarn v2. There are no more node_modules/ and I'm sitting at 6s install times before Cypress.

Do you include the Cypress binary in your node_modules/ or does that still need to be rebuilt on every run?

@ruffsl
Copy link

ruffsl commented Aug 23, 2019

I gave this a quick read-through, but what kinds of key lookup behaviors will be supported? Could each cache key be namespaced, where retrieval is prefix-matched. I know some other CI services only support write-only caches, so one must append a epoc counter to the end of the key/namespace to guarantee a new cache is written if desired, even when the derived key prefix is deterministically the same.

https://circleci.com/docs/2.0/caching/#restoring-cache

I think one reason for write-only caches is that: if the key is the exact same as an existing cache, then the CI worker knows it can skip store_cache step, saving time that would have been spent uploading or even compressing/hashing the cache blob.

I'm not sure how viable this would be, but it might be cool if the caching was file-diff arware, so that it sort of rsync'ed only the files that differ from the closest cache, thus only those changed files would need to be compressed and transported over the network. The store and restore commands would use the key prefix lookup to find the closest cache digest-manifest, and use that to figure what needs to be pushed or pulled to the append cache layer. The CI backend could flatten the trailing layers to remove expired versions of the cache in a sliding window fashion.

@johnterickson
Copy link

@ruffsl You are exactly right on all parts! 🥇

  • Keys are namespaced by branches.
  • "key" is an exact lookup, but (soon-to-be-documented) restoreKeys are prefix-matched: taking the most-recently-uploaded when the prefix matches multiple entries.
  • cache is write-only and thus save can be skipped
  • We already do variable-length chunk-dedup a la rsync You should see a summary at the end. In this example, 1.3GB of local content turned into only 303MB of transfer to the server through a combination of finding identical chunks and compressing new chunks. This is the same tech that Pipeline Artifacts and Universal Packages uses.
Upload statistics:
Total Content: 1,365.2 MB
Physical Content Uploaded: 288.4 MB
Logical Content Uploaded: 591.6 MB
Compression Saved: 303.2 MB
Deduplication Saved: 773.7 MB
Number of Chunks Uploaded: 6,842

https://dev.azure.com/codesharing-su0/cachesandbox/_build/results?buildId=15838&view=logs&j=1600059a-7091-5d41-ccb9-416f6972ece0&t=ca51db84-207b-49cc-a0dd-6580daad2d98&l=70

@marceloavf
Copy link

How to install cypress inside node_modules @lukeapage ?

@lukeapage
Copy link

@marceloavf @eps1lon
Yes..
We set a variable CYPRESS_CACHE_VARIABLE to the working directory / node_modules / .cypress_cache

We also use CYPRESS_INSTALL_BINARY: 0
When a job doesn’t need cypress, then we have s template to do the install with caching, which uses this variable in the cache key

@marceloavf
Copy link

Tried it @lukeapage but I got these errors:

The cypress npm package is installed, but the Cypress binary is missing.

We expected the binary to be installed here: /home/vsts/work/1/s/node_modules/.cypress_cache/3.2.0/Cypress/Cypress

And I added these:

variables:
  CYPRESS_CACHE_FOLDER: $(System.DefaultWorkingDirectory)/node_modules/.cypress_cache
  CYPRESS_INSTALL_BINARY: 0

@elvirdolic
Copy link

Can someone post a working example with cypress and yarn. When I use the task, may yarn task still takes 4 minutes after cache is downloaded and if I use cache-hit-var my yarn task which runs e2e can't be executed because yarn can't be found.

  condition: and(always(), eq(variables['E2E_ENABLED'], 'true'))
  pool:
    name: Hosted Windows 2019 with VS2019
  steps:
    - checkout: self
      clean: false 
    - task: CacheBeta@0
      inputs:
        key: yarn.lock
        path: $(YARN_CACHE_FOLDER)
        cacheHitVar: CACHE_RESTORED
        displayName: Cache yarn
    - script: yarn --frozen-lockfile
    - script: 'yarn pr-ci-e2e' 
      displayName: 'yarn pr-ci-e2e'

@marceloavf
Copy link

marceloavf commented Sep 7, 2019

@elvirdolic

I do use this variable on top of the build

variables:
  CYPRESS_CACHE_FOLDER: $(System.DefaultWorkingDirectory)/node_modules/.cypress_cache

trigger:
  batch: true
  branches:
    include:
      - master

pool:
  vmImage: 'ubuntu-latest'

steps:
  - checkout: self
    persistCredentials: true
    # Clean in on-premises agent
    #clean: true

  - task: NodeTool@0
    inputs:
      versionSpec: '10.x'
    displayName: 'Install Node.js'

  - task: RestoreAndSaveCache@1
    inputs:
      keyfile: '**/yarn.lock, !**/node_modules/**/yarn.lock, !**/.*/**/yarn.lock'
      targetfolder: '**/node_modules, !**/node_modules/**/node_modules'
      vstsFeed: 'f53a47a0-c7d0-4506-af54-dcc005003ee8'

  - bash: yarn install --frozen-lockfile
    displayName: 'yarn install'
    condition: ne(variables['CacheRestored'], 'true')

...

@dimitriy-k
Copy link

dimitriy-k commented Sep 10, 2019

tried again to use CacheBeta@0 instead of RestoreAndSaveCache@1: result cache size is 498MB (vs 244MB) and more time to download it (+20sec) ... revert

@cforce
Copy link

cforce commented Sep 13, 2019

How can i make use of this feature in a Release pipeline (not yml based Build Pipeline)?

@wichert
Copy link

wichert commented Sep 14, 2019

@cforce You can't. At least for now this is only for build pipelines.

@marcussonestedtbarium
Copy link

Nice task! However the namespacing per branch by default is a bit too strict. It'd be nice to provide a custom branch "key" so that feature branches can re-use the cache.

See f.ex. microsoft/azure-pipelines-tasks#11314.

@johnterickson
Copy link

@marceloavf TARing support is in-progress so our task should behave more like RestoreAndSaveCache in that respect.

@cforce, @wichert is correct that this only works in build pipelines at the moment. What's the scenario you're thinking about?

@marcussonestedtbarium I responded in the issue you linked to 👍

@cforce
Copy link

cforce commented Sep 18, 2019

@cforce, @wichert is correct that this only works in build pipelines at the moment. What's the scenario you're thinking about?

Running a test-suite against a deployed after wards in the Release pipeline, where test-suite is build based on source code current state, using a lot maven artifacts (unchanged) same version in a series of releases, needs maven lib release artifacts caching to prevent unnecessary downloading again. O i need a "Maven Caching" feature usable in the Release Pipelines, separate or as part of a maven build task param.

@jhlegarreta jhlegarreta mentioned this pull request Oct 2, 2019
11 tasks
@TaiNguyen2406
Copy link

Can you add caching for the docker build image?

@johnterickson
Copy link

Hi @underwaterhp93 - Please follow microsoft/azure-pipelines-tasks#11034 (comment)

@fadnavistanmay
Copy link

Hi,

With the agent 2.160.0 release, we are rolling out TARing by default. The long time ask of preserving the symbolic links and the file attributes are being addressed. If you want to specifically not use TARing, then set the AZP_CACHING_CONTENT_FORMAT to Files. We'll be documenting this new environment variable.

Thanks

@Kleptine
Copy link

Kleptine commented Apr 7, 2020

Hi, is there any way to have the Cache task cache files locally on a self-hosted agent? Our caches are very large (multiple Gb of game data) and our main pain point is waiting for these caches to download and upload. We run a self-hosted agent, so it seems like we should be able to have the agent save the cache locally instead.

@JustinGrote
Copy link

@Kleptine do you reset your self-hosted agent each time? If not then why not just implement your own caching, copy it to a temp folder, check a timestamp, and copy (or better, hardlink/symlink) the data back?

@Kleptine
Copy link

Kleptine commented Apr 7, 2020

We'd like to take advantage of some of the busting/tagging options of the CacheTask. A more recent cache from a different branch isn't compatible with other branches.

Is the Cache Task open source, or is there code we could work from to implement our own?

@vtbassmatt
Copy link
Member

@Kleptine
Copy link

Kleptine commented Apr 7, 2020

Thanks for the reference. Is there a clean way to modify the agent source and re-deploy it to our self-hosted agent? (ie. is there a build script somewhere that generates the installation tar?)

For what it's worth, we would rather not have to modify the agent source to enable this. :)

I saw that there is a line for "Local Caching Saved". For our runs it is always 0, is there something we haven't enabled?

@vtbassmatt
Copy link
Member

The caching system was really designed to be used with the caching infrastructure we have in the cloud. It does things like file- and block-level deduping, makes immutability guarantees, and other things that we haven't replicated with any kind of local infrastructure. Developer Community is the best place to file a feature request for agent-local caching.

You are free to grab the agent source and do what you want, but you won't be in a supported state. For instance, when new features require an updated agent, we send down an agent-update request. This will (in the best case) clobber your custom agent or in the worst case, knock it offline.

@Kleptine
Copy link

Kleptine commented Apr 9, 2020

Thanks for the info. I'll add a feature request.

@juangps96
Copy link

juangps96 commented May 2, 2023

Dear @mitchdenny @

What was the feature request for agent-local caching?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.