-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit concurrency for caching #389
Comments
Something that I just learned about in Beam is resource hints (https://beam.apache.org/documentation/runtime/resource-hints/). It sounds like this could pair really well with breaking out file caching. |
After TAL at |
Noting that pangeo-forge/paleo-pism-feedstock#2 is blocked by this. Looks like we will not be able to deploy a production run of that feedstock until we have some way to limit concurrency during the caching stage. cc @jkingslake |
Here's an example that could help: for this kind of problem we use this |
Belated thanks for sharing this example, @alxmrs. 🙏 |
Hi all, Any updates on this issue? As noted above, it is stopping progress on pangeo-forge/paleo-pism-feedstock#2 |
For those still following this, I am now working on a fix in #557 |
This is fixed by #557, @jkingslake please feel free to ping me on your recipe thread if you'd like to work together to revive it in light of this fix. And thanks for your patience! |
On the first production run of https://github.com/pangeo-forge/terraclimate-feedstock, Dataflow autoscaled the cluster to 1000 workers, in response to the slow throughput of caching ~882 inputs (totaling ~1.9 TB).
We should be able to limit concurrency for caching, given that the source file servers will generally be bandwidth-constrained. Dataflow provides a
max_num_workers
option to cap the size of the worker pool, but this issue is separate from that concern: concurrency should be limited only for the caching step, and then we should support larger scale-out after data is cached.There must be a more formal discussion of this somewhere in the Beam docs, but for now the most direct discussion I've found is in the replies to https://stackoverflow.com/a/65634538, which suggest
GroupByKey
might be used to achieve this.I believe this will require pulling caching out from
OpenURLWithFSSpec
. Currently, if acache
argument is provided toOpenURLWithFSSpec
, the input is cached and then immediately opened from the cachepangeo-forge-recipes/pangeo_forge_recipes/openers.py
Lines 31 to 32 in bdb32f2
In order to limit concurrency for the caching, but not for the opening, I believe caching will need to be its own transform, the output of which is then passed to
OpenURLWithFSSpec
, which does not do any caching.cc @rabernat @alxmrs, xref #376
The text was updated successfully, but these errors were encountered: