Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching mechanism for MLGraph #807

Open
anssiko opened this issue Jan 22, 2025 · 11 comments
Open

Caching mechanism for MLGraph #807

anssiko opened this issue Jan 22, 2025 · 11 comments

Comments

@anssiko
Copy link
Member

anssiko commented Jan 22, 2025

[ Spun off from issue #780 ]

@reillyeon mentioned to me about introducing a caching mechanism for MLGraph (e.g. save it for later use and avoid repeated graph compilation). Said mechanism might help here.

I'd like to discuss this proposal from @reillyeon a bit more. I believe the group's current working assumption is that graph compilation could take a long time and caching would improve the user experience on subsequent visits to the same site (cross-origin is a harder, separate problem). Depending on the size of the model, underlying implementation and other factors, this could be a significant performance and UX improvement.

@reillyeon have you thought about this more since you came up with the idea? Known implementation blockers?

@bbernhar what could we learn from the WebGPU compilation caches for shaders, pipelines? I see some toggles in Dawn code to control caching and you've done work in this space e.g. in https://issues.chromium.org/issues/41479574 suggesting you might have insights to share.

Also paging @huningxin @fdwr and @RafaelCintron for thoughts. Interested in all insights in the spirit of brainstorming.

A few additional questions:

Do we foresee the caching of compiled graphs to be purely an implementation detail?

Privacy impact? We already discuss caching-related timing attack vectors in privacy considerations and reference the WebGPU compilation cache considerations. Depending on which way we go, might want to revise these considerations.

@reillyeon
Copy link
Contributor

My current thought is that this would be an explicit API:

partial interface MLContext {
  Promise<sequence<DOMString>> listGraphs();
  Promise<MLGraph> loadGraph(DOMString key);
  Promise<undefined> saveGraph(DOMString key, MLGraph graph);
  undefined deleteGraph(DOMString key);
};

GPU shader caching is implicit, however the difference is that a shader program is a small input and so it's easy for the site to regenerate the shader so the browser can hash it to compare with the cache. ML models on the other hand are large because of the weights. Loading all the weights just to discover that a cached version of the model is available would be a waste of time and resources.

@inexorabletash
Copy link
Member

Implicitly, the cache/keys would be partitioned by the origin, same as other storage APIs, which would mitigate privacy concerns. Clearing storage would wipe this cache too.

We'd probably want to collaborate with Storage API experts, on topics such as quota and buckets.

@zolkis
Copy link
Collaborator

zolkis commented Jan 22, 2025

The idea has been discussed with from various angles and it's quite complex. For myself I kind of concluded that implicit caching could be the least problematic, the most efficient, also making possible cross origin caching. But then apps have no control over it. The explicit API has been lingering around for too long time and it's good to discuss it.

If we have solid used cases for an explicit caching API, the one proposed above is simple enough.

As graphs are context bound, I guess reusability is scoped within a given context. But when we'll have generic contexts, we should figure out if and how graphs could be bound to contexts and moved/copied between contexts.

Within one app (and context), is it so that listing graph keys is not required as the app can keep the keys around?

But when there are multiple active contexts, in addition to listing the keys, don't we also need an API (an optional argument) for filtering/matching some capabilities and properties of the (sub)graph? Possible privacy issues there?

@reillyeon
Copy link
Contributor

I'm curious for your thoughts on how an implicit caching solution would avoid the problem of requiring the developer to fully construct the graph only for the implementation to ignore that work and just load the cached version. That seems very wasteful.

Reusability across contexts will likely be implementation-specific.

For example, when building a Core ML model the goal is to cache the compiled model (.mlmodelc package). This package is generic across CPU, GPU and NPU. The target compute units are only specified when loading the model. In that case, listGraphs() would return the same keys for all MLContext instances. The TFLite models Chromium generates today are similarly hardware-agnostic.

If a backend did generate artifacts which were only compatible with some contexts (e.g. as OpenVINO model caching does) then listGraphs() would only return the graph when the query comes from a compatible context.

A goal of model caching is to allow the site and browser to discard as much unnecessary information about the graph as possible. For example, if the underlying platform repacks weights to better match the hardware configuration it shouldn't be necessary to store both the original weights and the packed weights since only the latter is used during interference.

listGraphs() and loadGraph() may also fail if the cached model is from an older version of the platform framework and the graph needs to be rebuilt.

listGraphs() is mainly a developer convenience but you're right that developers can and should remember graph keys in their own storage and so it's not strictly necessary. There are no privacy issues because graphs cached in this way are origin-scoped, similar to other local storage APIs such as IndexedDB and Local Storage.

@bbernhar
Copy link

@bbernhar what could we learn from the WebGPU compilation caches for shaders, pipelines?

GPU caching is most effective when the entire state can be constructed upfront, avoiding the need for dynamic compilation. However, in practice, requiring all state upfront often fails to eliminate UX hitches. It can also result in the pre-compilation of so many shader state combinations that cache sizes become unmanageable.

Originally, it was assumed that WebGPU app developers would design their apps with pipelines in mind and adapt to achieve their goals using fewer state combinations (a "pipeline" includes shader code and some state). Even when only a limited set of dynamic states was specified, the range of values often remained effectively unbounded. This forced implementations to precompile these as static pipelines. Consequently, pipeline caches ended up containing numerous slightly different optimized permutations of the same shader code, differing only in minor pipeline state variations. This left web applications reliant on the runtime/driver to deduplicate these permutations. Drivers, in turn, had to track state changes from one pipeline to the next, creating significant CPU overhead.

As a result, many app developers have abandoned monolithic caches in favor of reverting to the older hash-and-cache approach. This method involves setting various pieces of state independently, hashing them all for a GPU call, and using the hash as a key in an app-managed cache. It's also important to note that if the cache cannot serialize permutations of these shaders, JIT compilation won't be effectively captured. FWIW, implicit caching is still the norm for drivers, even though explicit caches exist.

@zolkis
Copy link
Collaborator

zolkis commented Jan 30, 2025

To confine the scope, it would be nice to share code examples as developer scenarios.

The proposed API (working hypothesis) is a clean "save then load a whole graph" approach, could be a clean addition.

But I couldn't help wondering farther, whether builders could be initialized with (the underlying resources of) an MLGraph, and allow compositions to use sub-graphs (other MLGraph objects). Then calling build() on the composite graph would reuse the resources that can be reused, or otherwise rebuild.

(Currently MLOperands are the intermediary graphs in a builder, and MLGraph as a compiled graph is considered immutable. However, the algorithms operate with the underlying resources of an MLGraph, and those could be used in compositing graphs, as operands, as well. It all depends how we'd spec the algorithms. Maybe an overkill (or nonsense).)

@reillyeon
Copy link
Contributor

The challenge there is whether that capability can be mapped to the underlying frameworks we are using, which all prefer being asked to compile a single graph. Reusing pieces of an already built graph in a larger graph is not possible (at least efficiently within these frameworks).

@mmccool
Copy link

mmccool commented Jan 31, 2025

I wrote a proposal for this also. Although it's not complete yet (I was working on an implementation to try to flush out some issues) I decided to post it as well here: https://github.com/webmachinelearning/hybrid-ai/blob/main/proposals/cache.md

It addresses much of the same points as the proposal above. As a summary, my design uses "hashes" as names and is explicit for various reasons explained in the document, but is only two methods. I also list a number of issues and corner cases that I think will arise in practice - some of which overlap with the discussion above. My general opinion is that the cache API should be flexible enough to allow various implementations, including "no cache", caches of just inputs, caches of compiled models, etc. I would also personally like to see it extendable (in theory) to cross-site caches if we can figure out a means to protect privacy (I talked about some partial solutions in the TPAC breakout last year).

This is not a complete proposal and I'd like to consolidate it with this one, and make various improvements. For one, I don't have an IDL definition of my proposed API yet...

@mmccool
Copy link

mmccool commented Jan 31, 2025

I also designed my cache proposal to work with another feature that I think would be interesting, adapters. I wrote a proposal for that as well, here: https://github.com/webmachinelearning/hybrid-ai/blob/main/proposals/adapters.md

Since it's a different topic though I will open another feature-request issue for discussion (see issue webmachinelearning/proposals#6). The reason adapters are related to caching is that you need to retrieve an existing model to add an adapter to it to create a new model. So the base model needs to be already "there" somehow. Applying the adapter while you build the model defeats the purpose of having a small download for the adapters.

BTW one more feature my cache API design targets: it would be useful to pull "built-in" models out of the "cache", i.e. models installed by other means, not necessarily by the developer of the particular site themselves. This is similar to cross-site caching, BUT as one special case if the set of "built-in" models is tied directly to the browser version, it will provide no additional information for fingerprinting. I think this would be one useful use case for adapters, too.

@mmccool
Copy link

mmccool commented Jan 31, 2025

Another comment: the proposal from @reillyeon also includes a mechanism to list graphs. I think a minimal API that just checks if a single model is in the cache or not is good enough, is more extensible, and gives more freedom to the implementation. It also avoids the problem of people snooping around for fingerprinting purposes (we can limit the number of probes, etc) if we ever DO want to extend it to the cross-site use case or other cases where there may be a lot of models in the cache.

IMO issues like version management, equivalent models (e.g. with different quantization strategies or other optimizations), etc. should be handled by a library on top of the base cache functionality, which should just check if a specific model is in the cache, or not. I think the purpose of the WebNN standard should be to provide fundamental capabilities that can be built upon by other code.

@anssiko
Copy link
Member Author

anssiko commented Feb 4, 2025

Thank you for your feedback, everyone.

@mmccool will revise the caching explainer focusing on implementation perspective using "must be implementable on top of existing major platform APIs and backends" as a guide.

Further feedback welcome on LiteRT, Core ML and other backends' constraints to help improve the explainer.

(I have moved the adapters explainer to the separate proposals repo. It has a broader scope than WebNN API.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants