You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NOAA NCEI might not like it if we fire off hundreds of simultaneous requests to their servers. We would like to limit the concurrency of this step if possible.
From an API perspective, the question is:
Should a user have to specify concurrency limits as part of the recipe?
Alternatively, should we try to auto-detect if flows access certain resources (e.g. specific FTP servers) and then automatically enforce concurrency limits?
However this only works with cloud. Some questions about this option are:
Are we okay with getting locked into a prefect cloud feature?
What convention do we use for the task tags to indicate concurrency? For example, I could imagine a tag like www.ncei.noaa.gov, allowing us to limit concurrency for all requests to that server from all flows simultaneously! That would be pretty useful.
If we don't want to get locked into cloud features, @jcrist made the following suggestion on the Prefect slack:
I'd handle this with a distributed.Semaphore within your tasks for now. Alternatively, you could make use of dask's worker resources. Tasks tagged with tags of the form dask-resource:KEY=N will each take N amount of KEY resource. So you could limit active download tasks by creating a resource for downloading then tagging download tasks to mark that they require that resource. (edited)
That would mean that the total concurrency limit scales with the number of workers (so it isn't absolute across the whole run), but would also work and wouldn't block other tasks from running like the Semaphore would.
The text was updated successfully, but these errors were encountered:
The situation described in pangeo-forge/staged-recipes#108 (comment) adds another dimension to the concurrency story. That recipe pulls data over opendap. When using opendap, the data loading happens during the store_chunk stage, not the cache_input stage.
If we follow the path outlined in #245, we may end up making significant changes to how Pangeo Forge works internally. That should give us the ability to attach a concurrency restriction on any stage of the pipeline. In pseudocode it may look something like
NOAA NCEI might not like it if we fire off hundreds of simultaneous requests to their servers. We would like to limit the concurrency of this step if possible.
From an API perspective, the question is:
In terms of implementation, Prefect cloud has a prefect solution: https://docs.prefect.io/orchestration/concepts/task-concurrency-limiting.html
However this only works with cloud. Some questions about this option are:
www.ncei.noaa.gov
, allowing us to limit concurrency for all requests to that server from all flows simultaneously! That would be pretty useful.If we don't want to get locked into cloud features, @jcrist made the following suggestion on the Prefect slack:
The text was updated successfully, but these errors were encountered: