-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove use of CDO for selecting the right variable in native model fixes #1180
Comments
@pp-mo wrote: Hi @senesis @zklaus I was prompted to investiagte that approach by @abooton, who I think discussed it with @zklaus . |
@senesis replied: Sorry for this late reply. I am not sure I fully understand everything in SciTools/iris#4176. Assuming that the question still is "would a simple (and fast) load of the single targeted variable fit the need ?" , I am afraid the answer is no, because the variables which are auxiliary coordinates for the targeted variables should be loaded too, as they might be useful at a later stage (and by the way, I agree that using CDO for selecting the variable is then not correct either). Further, @valeriupredoi wrote in #ESMValTool/2141 that the use of iris.load_raw is a strong design principle for ESMValTool, and, if I understand well, this seems to complicate finding a working solution. |
@pp-mo wrote:
That is absolutely not what this is doing : The approach in SciTools/iris#4176 does still analyse the whole file, but it limits which actual cube(s) are generated -- which basically is the slow bit. So I do think that this approach (if done right) should be a safe bet, and a great fit for your needs as described. To be clear on the method, each loaded Iris cube relates to a specific CF data-variable, and it is those which we are selecting on -- so cubes will still have all the usual information from other (types of) file variables, such as coordinates, ancillary variables etc etc. The remaining questions for me are ...
|
@senesis wrote: These are good news ! From my point of view, :
|
@bouweandela wrote:
If this is a feature you need, could you please open an issue for it? A discussion in a merged pull request is easily lost over time. Regarding the relative priority of new features: from experience, I can tell that those features that have someone who is actively working on them, tend to get the highest priority. |
@pp-mo wrote:
If it helps, I can affirm that 'iris.load_raw' is almost never any different from 'iris.load', for netcdf source data. Aside : it might seem that 'iris.load' should be capable of merging e,g, data supplied in several files each containing a 'month' of daily data. But it can't. because that would be a concatenate and not a merge operation. We do have quite a few outstanding issues suggesting this could/should change, but no progress at present : SciTools/iris#3344 SciTools/iris#2587 SciTools/iris#3234 |
To move forward with this: @senesis, what are the load times you experience with Iris 3.0.2 for a single variable (in the data-variable sense, i.e. including all coordinates, auxiliary variables, etc.) from a multi-variable file? How does that compare with CDO? What speed do we need to achieve? |
Using a typical IPSL-CM multi-variable file (the one with 314 variables used by pp-mo there) in a basic recipe (namely a simplified example_python.yml, with a single dataset, and only the map part) , I get :
The figures quoted by pp-mo there, namely dividing by 90 the load time for Iris, would be excellent. Staying above 1 s would be a burden : because they are monthly files, a lot of files have to loaded for each recipe. |
Great, thanks. @pp-mo, sorry if you already said this, but things have become a bit scattered and I may have missed something. Is there an easy way for us to test your suggested fix? |
Well, there's nothing in a channel ATM. As given, it can only give speedup for a load with a single NameConstraint, |
Sounds good. I will have a look at this after the pending release of ESMValTool at the end of this month. |
N.B. @senesis @bouweandela @zklaus we just cut the Iris 3.1 release candidate. The SciTools/iris#4176 solution is in there ... So please check out : This should appear in Iris 3.1, which we will finalise in a couple of weeks' time. |
Moving this to v2.6 since there is not open PR yet. |
According to this Iris issue comment (and the few following ones), there is no solution left for optimizing Iris load on multi-variable files. So, regarding IPSLCM fixes, I propose to carry on using CDO but using its python API, through this patch : use_cdo_api.patch.txt , which I tested. Unfortunately, the execution time is 50% to 100% larger when using the CDO API (w.r.t. launching CDO as a subprocess) |
Hm. This execution time penalty sounds less than ideal. Let's keep this issue open with no concrete plans for now and see if we cannot coax better performance out of iris. |
I'm a little confused, are you still not benefitting from SciTools/iris#4176 function ? If I understood the problem right, that really ought to "just work" ™️ + fix this problem. |
Thanks, @pp-mo, for pointing that out. I had not done the time testing myself before, but did so now, using the notebook and file that @senesis posted in the iris issue discussion. It was indeed rapid at 0.5s. (I used iris version 3.2.1). Since I cannot exclude the possibility that I did something strange with an oldish notebook or mixed up the files, @senesis, could you please check once more the timing? |
There is still a debate regarding the applicability of using iris.load with constraints in ESMValCore, between @valeriupredoi , who raised concerns here and @zklaus , who answered here |
Using your notebook, @senesis, I confirmed the fast times for all three functions ( |
Not sure I understand your goal : is it
|
My main goal is to get rid of CDO. As I understand it, the only reason to introduce it, was a performance problem within Iris. This does not seem to exist any longer, hence no need to resort to command line trickery. Can you confirm that the problem is gone? |
Using Iris3.1 and the quoted notebook, the load time for a single var from an IPSL-CM-like multi-var file is :
Using ESMValTool 2.5, the time for running a simple recipe loading a variable from an actual IPSL-CM mult-var file is
The explanation for the difference in the penalty between the notebook context and the ESMValTool real use context is that the notebook case uses a one-month long file |
Let's focus first on Iris itself, and then see if ESMValTool throws further spanners into the machine. This may be an opportunity to improve the experience for all users of ESMValTool. I just repeated the test with the same iris version of 3.1.0, using the file you provided in the description of SciTools/iris#4134 (this one). So overall, it seems iris and CDO performance are more or less on par. At these timescales, I think it is a bit difficult to say more. It might be interesting to look at a more realistic test file to get a better understanding. Your numbers, @senesis, seem to suggest that the constraint is quite helpful, but I don't really understand what you mean by "without constraint". Isn't the whole point to use a constraint to get only the part you are interested in? Could you elaborate a bit on what you mean by "without constraint"? |
@zklaus wrote :
Settings : working on
This means that, for IPSL operations, either on Ciclad cluster or on a system closer to the data store, we could get rid of using CDO if using Iris with a By the way : loading with no constraint takes 11s at first load, and ~9.5 s on next loads; using the constraint is thus certainly highly valuable on such large files; thanks to @pp-mo and the Iris team !
@valeriupredoi wrote there :
So until further notice I take for granted that using a |
Thanks, @senesis, I think things are clear enough now. Iris is performing well enough, we would need to change parts of ESMValTool to take full advantage of this. I believe that may be worthwhile, since the use-case of extracting variables from large files with may variables is common enough, usually occurring for native model output at least in all of the models that we are starting to support (IPSL, ICON, EC-Earth) and also in some observational datasets, so that support for this might allow native obs support for those. |
I changed the title to be a bit more specific, because I believe CDO is also used for other purposes. Note that in some cases all cubes need to be loaded from a file and not just the requested variable. This happens if files do not follow the CF conventions and a coordinate is interpreted as a cube. In the fix preprocessor step, the coordinate cube is then changed to a coordinate and attached to the cube containing the variable. Therefore we would need something a bit smarter than just applying a constraint at load time. |
Out of performance concerns about Iris' loading of individual variables from multi-variable files, the IPSLCM support as implemented in the corresponding fix file, makes optional use of CDO via a pipe interface.
Ideally, we would achieve acceptable performance with Iris alone and remove CDO from ESMValCore. If that turns out to be impossible, we might rewrite the use of CDO with their Python bindings to address concerns about the use of the shell type interface.
This issue is precipitated by a discussion in the PR that introduced IPSLCM support to ESMValCore, #1153.
The text was updated successfully, but these errors were encountered: