Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure to only use provenance copies when using product.wasderivedfrom() #1984

Merged
merged 5 commits into from
May 2, 2023

Conversation

schlunma
Copy link
Contributor

@schlunma schlunma commented Mar 22, 2023

Description

This PR makes sure to only use provenance copies (product.copy_provenance()) as input for product.wasderivedfrom(). This prevents recursively adding large amounts of ancestors which blows up memory usage and run time.

Closes #1981


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma schlunma added the enhancement New feature or request label Mar 22, 2023
@schlunma schlunma added this to the v2.9.0 milestone Mar 22, 2023
@schlunma schlunma self-assigned this Mar 22, 2023
@codecov
Copy link

codecov bot commented Mar 22, 2023

Codecov Report

Merging #1984 (491f7cb) into main (9406b7e) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1984   +/-   ##
=======================================
  Coverage   92.80%   92.80%           
=======================================
  Files         236      236           
  Lines       12445    12446    +1     
=======================================
+ Hits        11549    11550    +1     
  Misses        896      896           
Impacted Files Coverage Δ
esmvalcore/preprocessor/_mask.py 86.40% <100.00%> (+0.05%) ⬆️

@valeriupredoi
Copy link
Contributor

@schlunma cheers, Manu! I have just tested this with your example recipe in #1981 - varying the number of models leads to a very tiny increase in total max memory, this time round barely linear, if at all, with data, not to Powah 4 😁 ie 0.9G for 14 models that were otherwise needing 16G 🍺

@bouweandela
Copy link
Member

This could be implemented more efficiently by making fewer copies. For example, this implementation seems to copy all input file provenance for all output files, leading to n x (n-1) copies for the multi dataset mask (assuming all datasets are masked), where n is the number of datasets. However, only the input file provenance needs to be copied, which would be just n copies.

@valeriupredoi
Copy link
Contributor

@bouweandela would you be able to suggest a code snippet for @schlunma to use and expand on pls 🍺 I am not 100% sure of the severity of memory issue ie how many particular cases are affected, so I'd feel more comfortable if this was plugged in, even if not totally efficient, definitely waay more effeicient in terms of memory consumption compared to what we have now 😁

@schlunma
Copy link
Contributor Author

This could be implemented more efficiently by making fewer copies. For example, this implementation seems to copy all input file provenance for all output files, leading to n x (n-1) copies for the multi dataset mask (assuming all datasets are masked), where n is the number of datasets. However, only the input file provenance needs to be copied, which would be just n copies.

Makes sense, though the performance gain is probably not noticeable for common recipes. Implemented in 341c650.

@valeriupredoi
Copy link
Contributor

good work, gents! @bouweandela could you maybe pls check the new commits, review, and merge this then 🍺

@valeriupredoi
Copy link
Contributor

@bouweandela frenly yodeling ping
BouweYodel

@valeriupredoi
Copy link
Contributor

@bouweandela one for you here, bud - quick one, get you back to work after the Easter break 😁

Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now, thanks @schlunma!

@bouweandela bouweandela merged commit 6be36f7 into main May 2, 2023
@bouweandela bouweandela deleted the fix_wasderivedfrom branch May 2, 2023 09:59
@bouweandela bouweandela added bug Something isn't working preprocessor Related to the preprocessor and removed enhancement New feature or request labels May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working preprocessor Related to the preprocessor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provenance tracking in mask_multimodel is slow and needs a lot of memory
3 participants