Opportunistic Caching #681

mrocklin · 2016-11-16T13:09:31Z

Currently we clean up intermediate results quickly if they are not necessary for any further pending computation. This is good because it minimizes the memory footprint on the workers, often allowing us to process larger-than-distributed-memory computations.

However, this can sometimes be inefficient for interactive workloads when users submit related computations one after the other, so that the scheduler has no opportunity to plan ahead, and instead needs to recompute an intermediate result that was previously computed and garbage collected.

We could hold on to some of these results in hopes that the user will request them again. This trades active memory for potential CPU time. Ideally we would hold onto results that:

Have a small memory footprint
Take a long time to compute
Are likely to be requested again (evidenced by recent behavior)

We did this for the single machine scheduler

We could do it in the distributed scheduler fairly easily by creating a SchedulerPlugin that watched all computations, selected computations to keep based on logic similar to what is currently in cachey, and created a fake Client to keep an active reference to those keys in the scheduler.

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-06-08T12:02:13Z

To be explicit, the mechanism to keep data on the cluster might look like this:

class CachingPlugin(SchedulerPlugin):
    def __init__(self, scheduler):
        self.scheduler = scheduler
        self.scheduler.add_plugin(self)

    def transition(self, key, start, finish, nbytes=None, startstops=None, *args, **kwrags):
        if start == 'processing' and finish == 'memory' and should_keep(nbytes, startstops, **kwargs):
            self.scheduler.client_desires_keys(keys=[key], client='fake-caching-client')
        no_longer_desired_keys = self.cleanup()
        self.scheduler.client_releases_keys(keys=no_longer_desired_keys, client='fake-caching-client')

client.run_on_scheduler(lambda dask_scheduler: CachingPlugin(dask_scheduler)

collinwo · 2018-05-24T08:18:02Z

Thanks. I followed the above example to write an customized cache plugin for our own dask grid now. Trying to test it for some time.

IPetrik · 2019-06-13T17:11:41Z

To be explicit, the mechanism to keep data on the cluster might look like this:

class CachingPlugin(SchedulerPlugin):
    def __init__(self, scheduler):
        self.scheduler = scheduler
        self.scheduler.add_plugin(self)

    def transition(self, key, start, finish, nbytes=None, startstops=None, *args, **kwrags):
        if start == 'processing' and finish == 'memory' and should_keep(nbytes, startstops, **kwargs):
            self.scheduler.client_desires_keys(keys=[key], client='fake-caching-client')
        no_longer_desired_keys = self.cleanup()
        self.scheduler.client_releases_keys(keys=no_longer_desired_keys, client='fake-caching-client')

client.run_on_scheduler(lambda dask_scheduler: CachingPlugin(dask_scheduler)

@mrocklin is the scheduler API explained somewhere? Can you provide more explanation of how this works? What do client_desires_keys and client_releases_keys do?

TomAugspurger · 2019-06-13T18:03:46Z

Scheduler plugins are at https://distributed.dask.org/en/latest/plugins.html and the Scheduler API is at https://distributed.dask.org/en/latest/scheduling-state.html#distributed.scheduler.Scheduler

…

On Thu, Jun 13, 2019 at 12:11 PM IPetrik ***@***.***> wrote: To be explicit, the mechanism to keep data on the cluster might look like this: class CachingPlugin(SchedulerPlugin): def __init__(self, scheduler): self.scheduler = scheduler self.scheduler.add_plugin(self) def transition(self, key, start, finish, nbytes=None, startstops=None, *args, **kwrags): if start == 'processing' and finish == 'memory' and should_keep(nbytes, startstops, **kwargs): self.scheduler.client_desires_keys(keys=[key], client='fake-caching-client') no_longer_desired_keys = self.cleanup() self.scheduler.client_releases_keys(keys=no_longer_desired_keys, client='fake-caching-client') client.run_on_scheduler(lambda dask_scheduler: CachingPlugin(dask_scheduler) @mrocklin <https://github.com/mrocklin> is the scheduler API explained somewhere? Can you provide more explanation of how this works? What do client_desires_keys and client_releases_keys do? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#681?email_source=notifications&email_token=AAKAOISH2A65RLTVGQOYDYLP2J5VDA5CNFSM4CWPLB6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXUMQQQ#issuecomment-501794882>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIW2G7Z4QK2QTXCVPZDP2J5VDANCNFSM4CWPLB6A> .

GenevieveBuckley · 2021-03-18T03:01:34Z

I recently talked with the ilastik team, one of the wishlist items they brought up was 2 level caching (caching to disk, or in RAM) that would work with Dask distributed.

@emilmelnikov this issue is likely the best place for discussion

kephale · 2023-02-24T16:21:53Z

@GenevieveBuckley We also care about it in napari now. With the large remote datasets (100s+ GB) we're fetching it would be great to be able to have some on-disk persistence, given that the datasets do not change and we frequently want to revisit the same dataset.

GenevieveBuckley · 2023-02-28T02:10:45Z

Have you talked to @dcherian about this, @kephale?

kephale · 2023-02-28T02:53:35Z

I will now. Thank you @GenevieveBuckley :)

jakirkham · 2023-02-28T03:48:46Z

This has been brought up in other issues (and is somewhat tangential to this issue), but would recommend looking at graphchain

mrocklin mentioned this issue Jun 8, 2017

cached delayed functions? dask/dask#2441

Closed

mrocklin mentioned this issue Nov 22, 2017

ENH: Delayed variant of persist (pin?) dask/dask#2156

Open

This was referenced Mar 7, 2018

Added support for dask arrays in GridInterface holoviz/holoviews#2305

Merged

Adapting the pangeo approach to microscopy pangeo-data/pangeo#144

Closed

mrocklin mentioned this issue May 18, 2018

Does workers hold previous intermediate task results after a compute? #1988

Closed

mrocklin mentioned this issue Nov 26, 2018

Random KeyError during long computations #2372

Closed

dcherian mentioned this issue Dec 18, 2020

related upstream issues / projects ncar-xdev/xpersist#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opportunistic Caching #681

Opportunistic Caching #681

mrocklin commented Nov 16, 2016

mrocklin commented Jun 8, 2017

collinwo commented May 24, 2018

IPetrik commented Jun 13, 2019

TomAugspurger commented Jun 13, 2019 via email

GenevieveBuckley commented Mar 18, 2021

kephale commented Feb 24, 2023 •

edited

Loading

GenevieveBuckley commented Feb 28, 2023

kephale commented Feb 28, 2023

jakirkham commented Feb 28, 2023

Opportunistic Caching #681

Opportunistic Caching #681

Comments

mrocklin commented Nov 16, 2016

mrocklin commented Jun 8, 2017

collinwo commented May 24, 2018

IPetrik commented Jun 13, 2019

TomAugspurger commented Jun 13, 2019 via email

GenevieveBuckley commented Mar 18, 2021

kephale commented Feb 24, 2023 • edited Loading

GenevieveBuckley commented Feb 28, 2023

kephale commented Feb 28, 2023

jakirkham commented Feb 28, 2023

kephale commented Feb 24, 2023 •

edited

Loading