Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change volume stats to handle and output masked array result #618

Merged
merged 12 commits into from
May 27, 2020

Conversation

valeriupredoi
Copy link
Contributor

@valeriupredoi valeriupredoi commented Apr 20, 2020

Before you start, please read CONTRIBUTING.md.

Tasks

  • Create an issue to discuss what you are going to do, if you haven't done so already (and add the link at the bottom)
  • This pull request has a descriptive title that can be used in a changelog
  • Add unit tests
  • Public functions should have a numpy-style docstring so they appear properly in the API documentation. For all other functions a one line docstring is sufficient.
  • If writing a new/modified preprocessor function, please update the documentation
  • Circle/CI tests pass. Status can be seen below your pull request. If the tests are failing, click the link to find out why.
  • Codacy code quality checks pass. Status can be seen below your pull request. If there is an error, click the link to find out why. If you suspect Codacy may be wrong, please ask by commenting.

no, it's just being petty, it can bugger off

  • Please use yamllint to check that your YAML files do not contain mistakes
  • If you make backward incompatible changes to the recipe format, make a new pull request in the ESMValTool repository and add the link below

If you need help with any of the tasks above, please do not hesitate to ask by commenting in the issue or pull request.


Closes #611

@valeriupredoi valeriupredoi added the enhancement New feature or request label Apr 20, 2020
@valeriupredoi valeriupredoi requested review from ledm and sloosvel April 20, 2020 12:49
@valeriupredoi valeriupredoi changed the title Change volume stats to handle and outpt masked array result Change volume stats to handle and output masked array result Apr 20, 2020
@sloosvel
Copy link
Contributor

Thanks @valeriupredoi . I'm still getting NaNs though. This is what the last elements of column and depth_volume look like respectively when debugging:

column:
72:array(0.13211952, dtype=float32)
73:array(0.13278724, dtype=float32)
74:masked_array(data=--, mask=True, fill_value=1e+20, dtype=float64)

depth_volume:
72:4726634600000000.0
73:3180468400000000.0
74:masked

I think it's because of the masked depth_volume, because I changed the last value to 0 just to test with a non-masked weight and the results I got are not nans anymore. So it looks like np.ma.average is not able to handle masked weights after all.

@valeriupredoi
Copy link
Contributor Author

@sloosvel indeed, np.ma.average() is being silly:

In [10]: np.ma.average(np.ma.array([1, 2, 3, 5]), weights=np.ma.array([2,2,2,2], mask=[True, True, True, False]))
Out[10]: 5.0

In [11]: np.ma.average(np.ma.array([1, 2, 3, 5]), weights=np.ma.array([2,2,2,2], mask=[True, True, True, True]))
Out[11]: nan

and it looks like the actual logic is correct from the documentation since the statistic is computed using sum(weights) as denominator (obv, weighted mean heh) so in case of a fully masked weights array we are a bit stuck - should probably perform a simple mean in that case since weights are completely irrelevant but that could assign a bit more weight to that particular column- I am not sure how to approach from the statistical point of vew - @ledm and advice?

@valeriupredoi
Copy link
Contributor Author

or we can just completely mask that column?

@valeriupredoi
Copy link
Contributor Author

OK I masked it completely to get rid of nan's

@sloosvel
Copy link
Contributor

I think the routine is computing the 2D average first and then the average over all levels. So the output of averaging column should be the total average. What these last commits seem to be doing is just set the total result from nan to masked, which would mean that the output would still be empty.

Anyway, I think it's a bit strange to have a completely masked level in volcello.

@valeriupredoi
Copy link
Contributor Author

so that's the average volume per time point with computed average over a set of columns at different depths, weighted by each column's volume; the code changes mask those values that have weights masked equivalent to saying if the column volume is not viable as a weight there (coz it's masked) it means I may not have enough information about that particular column, and extrapolating this - if all my columns have masked values I don't have enough information to compute an average volume at that time point so I am masking it - I think it's fine

@sloosvel
Copy link
Contributor

I don't know that's seems overly complicated to me. I took a closer look and it turns out that depth_volume is a list. I changed its type to np.ma.array and I don't get nans anymore.

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented Apr 23, 2020

@sloosvel this is the generalized approach, note that even if depth_volume is a masked array and its values are all masked you will still get nan's ie np.ma.average(np.ma.array([1, 2, 3, 5]), weights=np.ma.array([2,2,2,2], mask=[True, True, True, True])) returns nan

@valeriupredoi
Copy link
Contributor Author

any news on this @sloosvel 🍺

@sloosvel
Copy link
Contributor

I am not sure if a case in which all depth_volume values are masked should be taken into account. That would mean that the volcello files consist of only masked values. Which would make the data not really useful.

There are several models that have the last layer of the ocean data fully masked. This routine is not handling well this case because the depth_volume used in np.average (or np.ma.average) has a masked value in the last level and is declared as a list instead of being a masked array. If the type is changed to masked array before the final averaging, I get proper results instead of a result with just NaN values.

@valeriupredoi
Copy link
Contributor Author

cool! then apply whatever changes are needed here, job done by the sounds of it 🍺

@jvegreg
Copy link
Contributor

jvegreg commented May 5, 2020

@ledm @bouweandela Can you please review this? It is blocking us

…to change_volume_stats_to_masked_result
Copy link
Contributor Author

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok by me, shame I can't approve my own PR that is in fact not my PR 😁 @sloosvel does your implementation sort out all them nan'd?

@sloosvel
Copy link
Contributor

sloosvel commented May 5, 2020

I added a test for nans and looks like it's passing

@valeriupredoi
Copy link
Contributor Author

valeriupredoi commented May 5, 2020

Indeed it do 😁 What's with all the deleted blank lines? New Flake8 conventions for docstrings? - there has to be a blank line between docstring body and eg Parameters----

@sloosvel
Copy link
Contributor

sloosvel commented May 6, 2020

I copy-pasted code and probably messed the format while doing so, will correct now.

@bouweandela
Copy link
Member

bouweandela commented May 19, 2020

@ledm @bouweandela Can you please review this? It is blocking us

@jvegasbsc Please feel free to help out with reviewing, it is impossible for me to review every single pull request in a timely manner.

@sloosvel
Copy link
Contributor

sloosvel commented May 20, 2020

What if all values for a particular time step are masked? Are you sure that never happens?

@bouweandela Well if you want to consider that case I think the easiest is to save the result in line 275 in np.ma.array(result) instead of np.array(result), let me know what you think. I don't know whether or not this happens very often or rarely. It's not the case for which I opened the issue. The main problem that I was having is that the average was being computed with lists that where not handling well the presence of a masked value in the volume. So I changed the implementation to use masked arrays instead.

ya that was my previous implementation that looked a bit overkill (inc a test for it) but @sloosvel assures me that can't happen

@valeriupredoi Your implementation was not working for the datasets that I was testing. Because the average was still being computed with lists instead of masked arrays, so the result was a NaN that was getting masked at the very end. I cannot have average_volume = NaN (which is what happens when using master) or average_volume = masked (which is what happens when using your implementation) for all ocean variables in all of our EC-Earth experiments, it does not make sense and it's not the result that we get using other old packages in the department.

@bouweandela
Copy link
Member

Well if you want to consider that case I think the easiest is to save the result in line 275 in np.ma.array(result) instead of np.array(result), let me know what you think.

Yes, that would be a good solution.

@bouweandela
Copy link
Member

@mattiarighi Could you please test and merge when you're happy with the result?

@mattiarighi
Copy link
Contributor

@sloosvel can you provide a test case (specific model and variable(s) to test)?

@sloosvel
Copy link
Contributor

@mattiarighi Sorry for the delay, our data it's on its way to be published but it's not available yet. The HadGEM3-GC31-LL historical data could be another example of a volcello with a fully masked level, but I'm having concatenation errors with this dataset.

@mattiarighi
Copy link
Contributor

The changes in code look fine to me.
If this solves your issue please approve the PR and I will merge.

@mattiarighi mattiarighi merged commit c80c8e4 into master May 27, 2020
@mattiarighi mattiarighi deleted the change_volume_stats_to_masked_result branch May 27, 2020 08:55
@bouweandela bouweandela added the preprocessor Related to the preprocessor label Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request preprocessor Related to the preprocessor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty cubes after volume_statistics for certain datasets
5 participants