Implement importable config object #785

stefsmeets · 2020-09-16T09:53:59Z

Hi all, we started working on an importable config object. This PR can be used to track the progress. For more discussion see this issue: #498

Tasks

Create an issue to discuss what you are going to do, if you haven't done so already (and add the link at the bottom)
This pull request has a descriptive title that can be used in a changelog
Add unit tests
Public functions should have a numpy-style docstring so they appear properly in the API documentation. For all other functions a one line docstring is sufficient.
~~- [ ] If writing a new/modified preprocessor function, please update the documentation~~
Circle/CI tests pass. Status can be seen below your pull request. If the tests are failing, click the link to find out why.
Codacy code quality checks pass. Status can be seen below your pull request. If there is an error, click the link to find out why. If you suspect Codacy may be wrong, please ask by commenting.
Please use yamllint to check that your YAML files do not contain mistakes
~~- [ ] If you make backward incompatible changes to the recipe format, make a new pull request in the ESMValTool repository and add the link below~~

The development in this PR is quite complete. Tests for all new functionality has been added. Here is a list of things still to do:

To do

Documentation
Merge master 😬
Fix Codacy errors
Fix CI errors
Check config-default.yml for which variables we want to keep (would be nice if someone could have a look at this)
Consider renaming config_user variable to session to fit the config API

stefsmeets · 2020-09-16T09:57:31Z

This enables the following:

from esmvalcore import config

# the next line is optional, but could be used to overwrite the default config
config.load('~/.esmvaltool/config-user.yml')

config.plot_dir
>> PosixPath('/home/stef/r/esmvalcore/esmvaltool_output/session_20200916_095523/plots')

The config is loaded whenever ESMValCore is imported, but in this way the keys can be easily accessed. The config is initialized upon import, so that the state is shared between modules.

valeriupredoi · 2020-09-16T17:07:39Z

careful with this - heavy object construction may lead to the objects being un-picklable and the mpi will either have a hard time sending data to processes or it will just not be able to pickle - configuration is done only once but if the config object is sent over to multiple processes and accessed by them it's definitely something to be careful with (a dictionary is easier sent than a class)

stefsmeets · 2020-09-17T07:45:40Z

Thanks for the heads-up @valeriupredoi , that is good to keep in mind. I'm not too worried about that at the moment, because most of the state is stored as a dictionary and many python objects are easily pickled/unpickled.

valeriupredoi · 2020-09-17T08:28:55Z

yeah sure - me neither, but just something we should keep in mind before we make too many objects and too many recursive calls, I have just went through the hoops with another code that was 100% recursive objects that was impossible to parallelize w/o major code changes 😁

stefsmeets · 2020-09-17T10:13:17Z

I have been looking at the matplotlib source code for some inspiration of how to approach this. Essentially, they define their config (rcParams) as follows:

configDefault = load_config_from_file(`default_config.yaml`)
configDefault.update(_hardcoded_defaults)

config = Config()  # first initialize as an empty mapping
config.update(configDefault)
config.update(load_config_from_file(`config.yaml`)

configOrig= rcParams.copy()  # keep a reserve copy

Config is defined as a MutableMapping (https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/__init__.py#L639):

class Config(MutableMapping, dict):  # mapping to the configuration file.
    validators = _validators
    def __setitem__(self, item)  # validate items on the way in
    def __getitem__(self, item)  # return items from the mapping

def load_config_from_file(path='path_to.yaml') -> Config:
    ...

Each item is validated through __setitem__, where _validators is a look-up table with the function names to validate (https://github.com/matplotlib/matplotlib/blob/a4128a71374a51eee6a2fac226e3ac09fcc2acd9/lib/matplotlib/rcsetup.py#L1008):

_validators = {
   'variable_name`: callable,
   'nested.variable_name`: callable,
}

Nested variable names are a bit of an issue that I'm not sure how to handle. I think we could simply flatten them for internal use.

Additionally, I think we should consider implementing another class named Locations or something to generate things such as the work/plot/session/run/etc directories and make the Config object solely responsible for the configuration.

stefsmeets · 2020-09-17T15:05:05Z

I've been trying to see how far I get with this idea of having an importable config object (esmvalcore.config).

At the moment I have based my implementation on a mutable mapping mimicking rcParams in matplotlib. Every item added to the mapping is validated/cleaned before the variable is updated (esmvalcore._config_validators). This really cleans up the config loading, because it offloads all the validation to the config object.

The nice thing is that the CLI parameters can go through the same interface, making use of the same validators if necessary.

For the nested hierarchy that we have in the configs, I had to flatten the dictionary loaded from the yaml (i.e. dct['a']['b'] -> dct['a.b']). I actually quite like the way this works, and it also makes accessing items a bit easier. The downside is that every reference to the config has to be updated (but that probably needs to happen anyway with this PR 😅). Another option is to recreate the old config from the new config object and pass that around.

I have also added a separate importable locations class (esmvalcore.locations), making use of the pathlib library. The idea is that the user locations are isolated from the config, and can change if a new session is initiated.

stefsmeets · 2020-09-18T08:16:34Z

I added the options from the CLI to the config. Having the validators is prevents you from making mistakes if you use a wrong command line option or variable. For example, esmvaltool run recipe.yml --asdf now results in KeyError.

stefsmeets · 2020-09-18T09:29:43Z

Got the code to a working state with a basic recipe. The main difficulty was in the rootpath that is being passed around when looking for files. I added a function to recreate the rootpath dictionary group to get it to work.

jvegreg · 2020-10-05T11:14:58Z

Actually, since all the command line variables are added to the config, should we set their defaults in the config file directly? This maybe more elegant than checking if a certain variable exists in the config. We could instead just set their defaults False or None

Good idea. It also goes in line with the old idea of having a esmvaltool debug command that is just a run with different flags.

stefsmeets · 2020-10-05T14:27:15Z

The last commit fixes the remaining failing tests (mostly path issues). Thanks @bouweandela !

@bouweandela had a nice idea how to distiguish between the config and a session. The difference is that config is a global and thus importable. A session object inherits all the variables from the config, but in addition it has a name (i.e. that of the recipe) and has additional information about the session directories (i.e. run_dir / preproc_dir).

The config can be used to start a new session:

from esmvalcore import config

session = config.start_session('recipe_name')

session == config  # True, since both inherit from dict

session['remove_preproc_dir'] = False
session == config  # False

This means that many session objects can exist, but they must all be started from the global config. This has the advantage that changes to a session variables can be independent from the global config, i.e. when running multiple recipes at once.

…/ ._utils.py

stefsmeets · 2020-10-06T14:59:36Z

Some good progress on this PR, it is nearly complete. I added tests for the new functionality, and failing tests are now passing on my PC.

There was no test to check that the latest version is being resolved correctly if 'latest' is not in the path

Peter9192 · 2020-10-07T14:53:03Z

I noticed that ESMValTool prints the config file it uses to the command line as the first log message when you run the tool. It would be nice if this would be the full path, for the default config-user.yml may be located in several different places. Would it be possible to change this here?

stefsmeets · 2020-10-09T13:11:57Z

I noticed that ESMValTool prints the config file it uses to the command line as the first log message when you run the tool. It would be nice if this would be the full path, for the default config-user.yml may be located in several different places. Would it be possible to change this here?

It does not do this for me. Could you explain what you mean or copy the output here?

bouweandela · 2020-10-12T17:25:55Z

Hi @stefsmeets, I tried to review this, but the number of changes is so large that it becomes very difficult to review:

43 files changed, 2338 insertions(+), 1238 deletions(-)

Do you think it would be possible to implement this with a smaller number of changes?

I noticed that there are a lot of formatting changes too, so maybe you could first make a pull request to just reformat the files you plan to change in this pull request, to at least get rid of all the cosmetic changes?

stefsmeets · 2020-10-13T08:18:28Z

Hi @stefsmeets, I tried to review this, but the number of changes is so large that it becomes very difficult to review:
43 files changed, 2338 insertions(+), 1238 deletions(-)
Do you think it would be possible to implement this with a smaller number of changes?

I noticed that there are a lot of formatting changes too, so maybe you could first make a pull request to just reformat the files you plan to change in this pull request, to at least get rid of all the cosmetic changes?

I was a little bit worries about that too, but I think we missed that opportunity to implement this in smaller steps once we started working on the config file parameters. Many things are so tightly coupled, that it is difficult to split this up.

With cosmetic changes, do you mean the auto-formatting? There may have been some 'collateral damage' from yapf/isort, is this what you mean?

I can split it up in several pull requests, but it would take me a couple of days. It would have to be then at least 4 pull requests, probably in this order:

Reformat relevant files
Refactor data finder / drs structure
Add validated config object
Update config file parameters

Peter9192

Hey @stefsmeets here's my "high-level" feedback to the PR :-)

I very much like the improved DRS handling!!

One suggestion would be to split the input and output data in the DRS. Then you could change 'ProjectData' to 'InputData', which I think is more intuitive. output_file could have a single default for all projects, or a dict with editable output_names for different projects (not sure whether that's necessary/desired functionality).

The API works nicely, but it does require some documentation in order to know how to use it. I think it would be nice if we could make it a little bit more intuitive. Here's some things I noticed when playing around with the API:

If you print the configuration, it would be nice if all ProjectData objects are grouped together.
It think if the ProjectData objects would have some str or repr method, that would make it more intuitive to understand what they are, and how to use/update them.
Should we expose all the methods of the cfg object? E.g. clear. I used this, and now I cannot get the default cfg back. Why would someone want to use pop or popitem? And what is the use case for find_all? I suppose it would be better to add things later than to have to remove them again.
I'm not so keen on the from _projects import searchlocation. Perhaps we could add a method to projectdata like cmip6.searchlocations.add() that does the same thing?
Why does get_input_filelists return 3 lists?
It might also be nice if it could return smt like: 'No input data found for ... in ....'
The projectdata has a method called get_cmor_table, but then you get CMIPInfo object, which again has a method get_table. That's not very intuitive. Perhaps we can make this an attribute or property instead of a method?

And here's some questions/comments related to the changes in the code:

I like how you split up the cfg developer. But considering that some people apparently used to change all of these things sometimes, would it make sense to move cmor_config.yml (and perhaps also institutes.yml) from cmor to the new config directory?
_exceptions.py: since this came up in a recent discussion, do you think this can help to get cleaner error messages for missing data?
esmvalcore._main.py: here you update the session with a couple of 'standard' command line options. But when we started using fire, I think the idea was that you could pass any parameter that's in the config file also via the command line, and it will be updated accordingly. Does that still work?
Related: I can't help but feeling that some of the config options are really very specific runtime arguments. If the session is very tightly coupled to a specific run, would it make sense to pass those runtime options only to the session, and not to the config?
I wonder how the refactored data finder functions impact cases like Where to find the correct frequency for ERA5 hourly variables? ESMValTool#1884

One final remark: I really like the overview of the main changes in this PR that you provided in a comment on your example notebook. Perhaps we can change the top post of this PR to include that summary? Then it will be easier to find how/why this changed for people who are tracing the changes back to this PR.

stefsmeets · 2020-11-06T08:13:23Z

Just a note to keep an eye on ESMValGroup/ESMValTool#1707 , which may benefit from this PR

stefsmeets · 2020-11-20T13:52:05Z

This PR is going nowhere, so I'm closing it and moving forward with a different attempt. Closing this in favour of #868.
I may split other useful bits in different PRs.

First try at making an importable config item

d72f9be

stefsmeets changed the title ~~First try at making an importable config item~~ Implement importable config object Sep 16, 2020

stefsmeets added 8 commits September 17, 2020 15:09

Implement config object with validators

7048db2

Implement locations object which defines the session/sub directories

08b988f

Fix import

dba667a

Add missing file with locations

0c15d66

Refactor read_config_user_file

5e3f40f

Clean _config.py

47bba25

Make it possible to update the config file

468647b

Add validators for CLI and modify ESMValTool.run to use config object

7fbaf22

Offload more validators and add CLI options to config

5c0ccf6

stefsmeets added 4 commits September 18, 2020 10:59

Fix bugs

825dc2f

Comment out example variables in config

be3bde4

Make sure files get found

6398e0b

Update syntax

a7d9b8a

stefsmeets mentioned this pull request Sep 18, 2020

Add YAML schema for config-user.yml #125

Closed

stefsmeets added 2 commits September 18, 2020 13:48

Re-use matplotlib validation code

10a5130

Bug fixes and tweaks to validation code

7668fe5

nielsdrost mentioned this pull request Sep 18, 2020

Config check subcommand #789

Open

stefsmeets added 3 commits September 18, 2020 15:13

Use locations object where possible

ccafb60

Rename Config (for cli utilities) -> ConfigUtils to avoid confusion

b407007

Use pathlib to find recipe

03ab9e1

stefsmeets added 5 commits October 5, 2020 13:34

Add tests for deprecate function

f85892a

Add string conversion for Path object for failing tests

c9ac0f7

Fix typo

16abf43

Merge session and config object and fix related tests

aa0a726

Add more tests for validators / config / session

60db027

stefsmeets added 8 commits October 6, 2020 14:46

Add test for config class

df5e41e

Pass config_user instead of importing it after main

442f612

Refactor data_finder

23bedb3

Refactor _config.py, move configure_logging to its own file

fd31a72

Fix KeyError (grab output_file from correct location)

0cc586a

Refactor _config.py, split remaining functions into ._diagnostics.py …

9a63212

…/ ._utils.py

Rename config -> CFG, configuration -> config

56157dc

Add hidden defaults to config

a7b57c5

stefsmeets added 2 commits October 7, 2020 09:25

Add test for resolving latest version

3add954

There was no test to check that the latest version is being resolved correctly if 'latest' is not in the path

Choose better names for variables

cafa861

stefsmeets requested a review from nielsdrost October 7, 2020 07:53

stefsmeets mentioned this pull request Oct 22, 2020

Have a checker that inspects auxiliary_data_dir and exit if desired data is not there #842

Open

Peter9192 reviewed Oct 30, 2020

View reviewed changes

stefsmeets mentioned this pull request Nov 20, 2020

Implement importable config object in experimental API submodule #868

Merged

8 tasks

stefsmeets closed this Nov 20, 2020

stefsmeets mentioned this pull request Jan 18, 2021

How do we treat reanalysis with members? #945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement importable config object #785

Implement importable config object #785

stefsmeets commented Sep 16, 2020 •

edited

Loading

stefsmeets commented Sep 16, 2020 •

edited

Loading

valeriupredoi commented Sep 16, 2020

stefsmeets commented Sep 17, 2020

valeriupredoi commented Sep 17, 2020

stefsmeets commented Sep 17, 2020 •

edited

Loading

stefsmeets commented Sep 17, 2020 •

edited

Loading

stefsmeets commented Sep 18, 2020

stefsmeets commented Sep 18, 2020

jvegreg commented Oct 5, 2020

stefsmeets commented Oct 5, 2020 •

edited

Loading

stefsmeets commented Oct 6, 2020 •

edited

Loading

Peter9192 commented Oct 7, 2020

stefsmeets commented Oct 9, 2020

bouweandela commented Oct 12, 2020

stefsmeets commented Oct 13, 2020

Peter9192 left a comment

stefsmeets commented Nov 6, 2020

stefsmeets commented Nov 20, 2020

Implement importable config object #785

Implement importable config object #785

Conversation

stefsmeets commented Sep 16, 2020 • edited Loading

stefsmeets commented Sep 16, 2020 • edited Loading

valeriupredoi commented Sep 16, 2020

stefsmeets commented Sep 17, 2020

valeriupredoi commented Sep 17, 2020

stefsmeets commented Sep 17, 2020 • edited Loading

stefsmeets commented Sep 17, 2020 • edited Loading

stefsmeets commented Sep 18, 2020

stefsmeets commented Sep 18, 2020

jvegreg commented Oct 5, 2020

stefsmeets commented Oct 5, 2020 • edited Loading

stefsmeets commented Oct 6, 2020 • edited Loading

Peter9192 commented Oct 7, 2020

stefsmeets commented Oct 9, 2020

bouweandela commented Oct 12, 2020

stefsmeets commented Oct 13, 2020

Peter9192 left a comment

Choose a reason for hiding this comment

stefsmeets commented Nov 6, 2020

stefsmeets commented Nov 20, 2020

stefsmeets commented Sep 16, 2020 •

edited

Loading

stefsmeets commented Sep 16, 2020 •

edited

Loading

stefsmeets commented Sep 17, 2020 •

edited

Loading

stefsmeets commented Sep 17, 2020 •

edited

Loading

stefsmeets commented Oct 5, 2020 •

edited

Loading

stefsmeets commented Oct 6, 2020 •

edited

Loading