-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy computation for integration with xarray / dask #56
Comments
Is the calculation applied elementwise (e.g. could I loop through all positions of an array and that would result in the same output as if I feed the full array)? |
This is a lovely interim solution, @jbusecke, thanks! Here's a short example with a simple wrapper function for PyCO2SYS, where
This could be generalized to output the full "results" dictionary as an xarray Dataset, but just need a bit of work to carry over names and dimensions etc. |
Hey @gmacgilchrist, sorry for the delay. I'm a fairly novice xarray user and don't have a better solution than what you and @jbusecke have already come up with. I think it would be a great thing to add to the package. I understand your example code above well enough to generalise it to the full results dict but am lacking in time to implement this at the moment, so if you do have time to get it set up or started that would be very welcome. Otherwise, I'll get round to it... eventually! |
I'm happy to help out here! @gmacgilchrist, have you made any headway on this? |
I haven't revisited this yet, and don't expect that I will get to it in the next couple of weeks. But I would definitely value your help @lukegre when I get back to it. Do you have any sense for how many people are using this package to work with model data? I'm interested to know in general if other folk have come up against memory/computation issues. I know that @ognancy4life came up against some similar issues, but I'd like to get a sense for how broadly useful it would be to make the code workable with big data, or if ~99% of users are working with observations and would likely never come up against such problems. In the end, even though I got the computation to run lazily (based on @jbusecke's solution) so that it played nicely with |
Hi All, Thanks for following up on this! My understanding is that most CO2 system calculations done in earth system models are done using an incomplete or approximated CO2 system for this very reason. While I hope we can arrive at a solution for running PyCO2SYS using xarray data arrays because I do like to have the full system and all the bells and whistles like uncertainty estimates, doing a full CO2 system calculation may not be necessary for most cases. Have you all checked if any of the other "lighter" CO2 system calculators like mocsy might be available in python and/or more easily adapted to play nice with xarray? I know there are others I've run across when reading ocean BGC model documentation papers but their names are escaping me. |
Hi @gmacgilchrist and others,
I still don't know the answer to this.
Although you can cut the number of variables calculated by only returning the outputs you're interested in, probably at least half of the calculated variables are essential for most calculations (e.g. total salt contents and equilibrium constants) and so unavoidable. Could the fact that these intermediate variables are stored within dicts be contributing to the issue? Anyway, if you are in fact only interested in one or two output variables, it will probably be most efficient to write new functions that only do the bare minimum calculation and can be used with xarray's pH = xr.apply_ufunc(
pyco2.minimal.pH_from_alkalinity_dic,
ds.alkalinity,
ds.dic,
kwargs={"temperature": ds.temperature, "salinity": ds.salinity},
dask="parallelized",
) The function PS: you can roughly halve the computation time for your example script (from 23 Aug) by using |
I am attempting to do computations of the carbonate system in model output. I use
xarray
to load and analyze these data with lazy computations, meaning that data are only loaded into memory as a final step, and then only the data required for the computation. This makes parallelization of the computation easy when working with large datasets.This functionality doesn't seem to play nicely with PyCO2SYS (or vice versa!), which instead loads all the data into memory during the application of the
sys
command. I'm not sure how to put together a good minimum working example to show the issue here, because it relies on loading a dataset. I can try to put something together if the issue is not clear.I'm wondering if there is a plan or interest to integrate the functionality of PyCO2SYS with that of
xarray
? That is, to allow the calculations inherent in PyCO2SYS to be performed lazily. I would be happy to help implement this, although I am no expert. On the other hand, I would be happy to hear if there is a workaround that I am missing.Tagging @ognancy4life and @jbusecke as folk who might be interested in this problem.
P.S. Thanks so much for implementing this package - it's a great resource!
The text was updated successfully, but these errors were encountered: