Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opportunistic Caching #681

Open
mrocklin opened this issue Nov 16, 2016 · 9 comments
Open

Opportunistic Caching #681

mrocklin opened this issue Nov 16, 2016 · 9 comments

Comments

@mrocklin
Copy link
Member

Currently we clean up intermediate results quickly if they are not necessary for any further pending computation. This is good because it minimizes the memory footprint on the workers, often allowing us to process larger-than-distributed-memory computations.

However, this can sometimes be inefficient for interactive workloads when users submit related computations one after the other, so that the scheduler has no opportunity to plan ahead, and instead needs to recompute an intermediate result that was previously computed and garbage collected.

We could hold on to some of these results in hopes that the user will request them again. This trades active memory for potential CPU time. Ideally we would hold onto results that:

  1. Have a small memory footprint
  2. Take a long time to compute
  3. Are likely to be requested again (evidenced by recent behavior)

We did this for the single machine scheduler

We could do it in the distributed scheduler fairly easily by creating a SchedulerPlugin that watched all computations, selected computations to keep based on logic similar to what is currently in cachey, and created a fake Client to keep an active reference to those keys in the scheduler.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2017

To be explicit, the mechanism to keep data on the cluster might look like this:

class CachingPlugin(SchedulerPlugin):
    def __init__(self, scheduler):
        self.scheduler = scheduler
        self.scheduler.add_plugin(self)

    def transition(self, key, start, finish, nbytes=None, startstops=None, *args, **kwrags):
        if start == 'processing' and finish == 'memory' and should_keep(nbytes, startstops, **kwargs):
            self.scheduler.client_desires_keys(keys=[key], client='fake-caching-client')
        no_longer_desired_keys = self.cleanup()
        self.scheduler.client_releases_keys(keys=no_longer_desired_keys, client='fake-caching-client')

client.run_on_scheduler(lambda dask_scheduler: CachingPlugin(dask_scheduler)

@collinwo
Copy link

Thanks. I followed the above example to write an customized cache plugin for our own dask grid now. Trying to test it for some time.

@IPetrik
Copy link
Contributor

IPetrik commented Jun 13, 2019

To be explicit, the mechanism to keep data on the cluster might look like this:

class CachingPlugin(SchedulerPlugin):
    def __init__(self, scheduler):
        self.scheduler = scheduler
        self.scheduler.add_plugin(self)

    def transition(self, key, start, finish, nbytes=None, startstops=None, *args, **kwrags):
        if start == 'processing' and finish == 'memory' and should_keep(nbytes, startstops, **kwargs):
            self.scheduler.client_desires_keys(keys=[key], client='fake-caching-client')
        no_longer_desired_keys = self.cleanup()
        self.scheduler.client_releases_keys(keys=no_longer_desired_keys, client='fake-caching-client')

client.run_on_scheduler(lambda dask_scheduler: CachingPlugin(dask_scheduler)

@mrocklin is the scheduler API explained somewhere? Can you provide more explanation of how this works? What do client_desires_keys and client_releases_keys do?

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 13, 2019 via email

@GenevieveBuckley
Copy link
Contributor

I recently talked with the ilastik team, one of the wishlist items they brought up was 2 level caching (caching to disk, or in RAM) that would work with Dask distributed.

@emilmelnikov this issue is likely the best place for discussion

@kephale
Copy link

kephale commented Feb 24, 2023

@GenevieveBuckley We also care about it in napari now. With the large remote datasets (100s+ GB) we're fetching it would be great to be able to have some on-disk persistence, given that the datasets do not change and we frequently want to revisit the same dataset.

@GenevieveBuckley
Copy link
Contributor

Have you talked to @dcherian about this, @kephale?

@kephale
Copy link

kephale commented Feb 28, 2023

I will now. Thank you @GenevieveBuckley :)

@jakirkham
Copy link
Member

This has been brought up in other issues (and is somewhat tangential to this issue), but would recommend looking at graphchain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants