Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement importable config object #785

Closed
wants to merge 113 commits into from
Closed

Conversation

stefsmeets
Copy link
Contributor

@stefsmeets stefsmeets commented Sep 16, 2020

Hi all, we started working on an importable config object. This PR can be used to track the progress. For more discussion see this issue: #498

Tasks

  • Create an issue to discuss what you are going to do, if you haven't done so already (and add the link at the bottom)
  • This pull request has a descriptive title that can be used in a changelog
  • Add unit tests
  • Public functions should have a numpy-style docstring so they appear properly in the API documentation. For all other functions a one line docstring is sufficient.
    - [ ] If writing a new/modified preprocessor function, please update the documentation
  • Circle/CI tests pass. Status can be seen below your pull request. If the tests are failing, click the link to find out why.
  • Codacy code quality checks pass. Status can be seen below your pull request. If there is an error, click the link to find out why. If you suspect Codacy may be wrong, please ask by commenting.
  • Please use yamllint to check that your YAML files do not contain mistakes
    - [ ] If you make backward incompatible changes to the recipe format, make a new pull request in the ESMValTool repository and add the link below

The development in this PR is quite complete. Tests for all new functionality has been added. Here is a list of things still to do:

To do

  • Documentation
  • Merge master 😬
  • Fix Codacy errors
  • Fix CI errors
  • Check config-default.yml for which variables we want to keep (would be nice if someone could have a look at this)
  • Consider renaming config_user variable to session to fit the config API

@stefsmeets stefsmeets changed the title First try at making an importable config item Implement importable config object Sep 16, 2020
@stefsmeets
Copy link
Contributor Author

stefsmeets commented Sep 16, 2020

This enables the following:

from esmvalcore import config

# the next line is optional, but could be used to overwrite the default config
config.load('~/.esmvaltool/config-user.yml')

config.plot_dir
>> PosixPath('/home/stef/r/esmvalcore/esmvaltool_output/session_20200916_095523/plots')

The config is loaded whenever ESMValCore is imported, but in this way the keys can be easily accessed. The config is initialized upon import, so that the state is shared between modules.

@valeriupredoi
Copy link
Contributor

careful with this - heavy object construction may lead to the objects being un-picklable and the mpi will either have a hard time sending data to processes or it will just not be able to pickle - configuration is done only once but if the config object is sent over to multiple processes and accessed by them it's definitely something to be careful with (a dictionary is easier sent than a class)

@stefsmeets
Copy link
Contributor Author

Thanks for the heads-up @valeriupredoi , that is good to keep in mind. I'm not too worried about that at the moment, because most of the state is stored as a dictionary and many python objects are easily pickled/unpickled.

@valeriupredoi
Copy link
Contributor

yeah sure - me neither, but just something we should keep in mind before we make too many objects and too many recursive calls, I have just went through the hoops with another code that was 100% recursive objects that was impossible to parallelize w/o major code changes 😁

@stefsmeets
Copy link
Contributor Author

stefsmeets commented Sep 17, 2020

I have been looking at the matplotlib source code for some inspiration of how to approach this. Essentially, they define their config (rcParams) as follows:

configDefault = load_config_from_file(`default_config.yaml`)
configDefault.update(_hardcoded_defaults)

config = Config()  # first initialize as an empty mapping
config.update(configDefault)
config.update(load_config_from_file(`config.yaml`)

configOrig= rcParams.copy()  # keep a reserve copy

Config is defined as a MutableMapping (https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/__init__.py#L639):

class Config(MutableMapping, dict):  # mapping to the configuration file.
    validators = _validators
    def __setitem__(self, item)  # validate items on the way in
    def __getitem__(self, item)  # return items from the mapping

def load_config_from_file(path='path_to.yaml') -> Config:
    ...

Each item is validated through __setitem__, where _validators is a look-up table with the function names to validate (https://github.com/matplotlib/matplotlib/blob/a4128a71374a51eee6a2fac226e3ac09fcc2acd9/lib/matplotlib/rcsetup.py#L1008):

_validators = {
   'variable_name`: callable,
   'nested.variable_name`: callable,
}

Nested variable names are a bit of an issue that I'm not sure how to handle. I think we could simply flatten them for internal use.

Additionally, I think we should consider implementing another class named Locations or something to generate things such as the work/plot/session/run/etc directories and make the Config object solely responsible for the configuration.

@stefsmeets
Copy link
Contributor Author

stefsmeets commented Sep 17, 2020

I've been trying to see how far I get with this idea of having an importable config object (esmvalcore.config).

At the moment I have based my implementation on a mutable mapping mimicking rcParams in matplotlib. Every item added to the mapping is validated/cleaned before the variable is updated (esmvalcore._config_validators). This really cleans up the config loading, because it offloads all the validation to the config object.

The nice thing is that the CLI parameters can go through the same interface, making use of the same validators if necessary.

For the nested hierarchy that we have in the configs, I had to flatten the dictionary loaded from the yaml (i.e. dct['a']['b'] -> dct['a.b']). I actually quite like the way this works, and it also makes accessing items a bit easier. The downside is that every reference to the config has to be updated (but that probably needs to happen anyway with this PR 😅). Another option is to recreate the old config from the new config object and pass that around.

I have also added a separate importable locations class (esmvalcore.locations), making use of the pathlib library. The idea is that the user locations are isolated from the config, and can change if a new session is initiated.

@stefsmeets
Copy link
Contributor Author

I added the options from the CLI to the config. Having the validators is prevents you from making mistakes if you use a wrong command line option or variable. For example, esmvaltool run recipe.yml --asdf now results in KeyError.

@stefsmeets
Copy link
Contributor Author

Got the code to a working state with a basic recipe. The main difficulty was in the rootpath that is being passed around when looking for files. I added a function to recreate the rootpath dictionary group to get it to work.

@jvegreg
Copy link
Contributor

jvegreg commented Oct 5, 2020

Actually, since all the command line variables are added to the config, should we set their defaults in the config file directly? This maybe more elegant than checking if a certain variable exists in the config. We could instead just set their defaults False or None

Good idea. It also goes in line with the old idea of having a esmvaltool debug command that is just a run with different flags.

@stefsmeets
Copy link
Contributor Author

stefsmeets commented Oct 5, 2020

The last commit fixes the remaining failing tests (mostly path issues). Thanks @bouweandela !

@bouweandela had a nice idea how to distiguish between the config and a session. The difference is that config is a global and thus importable. A session object inherits all the variables from the config, but in addition it has a name (i.e. that of the recipe) and has additional information about the session directories (i.e. run_dir / preproc_dir).

The config can be used to start a new session:

from esmvalcore import config

session = config.start_session('recipe_name')

session == config  # True, since both inherit from dict

session['remove_preproc_dir'] = False
session == config  # False

This means that many session objects can exist, but they must all be started from the global config. This has the advantage that changes to a session variables can be independent from the global config, i.e. when running multiple recipes at once.

@stefsmeets
Copy link
Contributor Author

stefsmeets commented Oct 6, 2020

Some good progress on this PR, it is nearly complete. I added tests for the new functionality, and failing tests are now passing on my PC.

There was no test to check that the latest version is being resolved
correctly if 'latest' is not in the path
@stefsmeets stefsmeets requested a review from nielsdrost October 7, 2020 07:53
@Peter9192
Copy link
Contributor

I noticed that ESMValTool prints the config file it uses to the command line as the first log message when you run the tool. It would be nice if this would be the full path, for the default config-user.yml may be located in several different places. Would it be possible to change this here?

@stefsmeets
Copy link
Contributor Author

I noticed that ESMValTool prints the config file it uses to the command line as the first log message when you run the tool. It would be nice if this would be the full path, for the default config-user.yml may be located in several different places. Would it be possible to change this here?

It does not do this for me. Could you explain what you mean or copy the output here?

@bouweandela
Copy link
Member

Hi @stefsmeets, I tried to review this, but the number of changes is so large that it becomes very difficult to review:

43 files changed, 2338 insertions(+), 1238 deletions(-)

Do you think it would be possible to implement this with a smaller number of changes?

I noticed that there are a lot of formatting changes too, so maybe you could first make a pull request to just reformat the files you plan to change in this pull request, to at least get rid of all the cosmetic changes?

@stefsmeets
Copy link
Contributor Author

Hi @stefsmeets, I tried to review this, but the number of changes is so large that it becomes very difficult to review:

43 files changed, 2338 insertions(+), 1238 deletions(-)

Do you think it would be possible to implement this with a smaller number of changes?

I noticed that there are a lot of formatting changes too, so maybe you could first make a pull request to just reformat the files you plan to change in this pull request, to at least get rid of all the cosmetic changes?

I was a little bit worries about that too, but I think we missed that opportunity to implement this in smaller steps once we started working on the config file parameters. Many things are so tightly coupled, that it is difficult to split this up.

With cosmetic changes, do you mean the auto-formatting? There may have been some 'collateral damage' from yapf/isort, is this what you mean?

I can split it up in several pull requests, but it would take me a couple of days. It would have to be then at least 4 pull requests, probably in this order:

  • Reformat relevant files
  • Refactor data finder / drs structure
  • Add validated config object
  • Update config file parameters

Copy link
Contributor

@Peter9192 Peter9192 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @stefsmeets here's my "high-level" feedback to the PR :-)

I very much like the improved DRS handling!!

One suggestion would be to split the input and output data in the DRS. Then you could change 'ProjectData' to 'InputData', which I think is more intuitive. output_file could have a single default for all projects, or a dict with editable output_names for different projects (not sure whether that's necessary/desired functionality).

The API works nicely, but it does require some documentation in order to know how to use it. I think it would be nice if we could make it a little bit more intuitive. Here's some things I noticed when playing around with the API:

  • If you print the configuration, it would be nice if all ProjectData objects are grouped together.
  • It think if the ProjectData objects would have some str or repr method, that would make it more intuitive to understand what they are, and how to use/update them.
  • Should we expose all the methods of the cfg object? E.g. clear. I used this, and now I cannot get the default cfg back. Why would someone want to use pop or popitem? And what is the use case for find_all? I suppose it would be better to add things later than to have to remove them again.
  • I'm not so keen on the from _projects import searchlocation. Perhaps we could add a method to projectdata like cmip6.searchlocations.add() that does the same thing?
  • Why does get_input_filelists return 3 lists?
  • It might also be nice if it could return smt like: 'No input data found for ... in ....'
  • The projectdata has a method called get_cmor_table, but then you get CMIPInfo object, which again has a method get_table. That's not very intuitive. Perhaps we can make this an attribute or property instead of a method?

And here's some questions/comments related to the changes in the code:

  • I like how you split up the cfg developer. But considering that some people apparently used to change all of these things sometimes, would it make sense to move cmor_config.yml (and perhaps also institutes.yml) from cmor to the new config directory?
  • _exceptions.py: since this came up in a recent discussion, do you think this can help to get cleaner error messages for missing data?
  • esmvalcore._main.py: here you update the session with a couple of 'standard' command line options. But when we started using fire, I think the idea was that you could pass any parameter that's in the config file also via the command line, and it will be updated accordingly. Does that still work?
  • Related: I can't help but feeling that some of the config options are really very specific runtime arguments. If the session is very tightly coupled to a specific run, would it make sense to pass those runtime options only to the session, and not to the config?
  • I wonder how the refactored data finder functions impact cases like Where to find the correct frequency for ERA5 hourly variables? ESMValTool#1884

One final remark: I really like the overview of the main changes in this PR that you provided in a comment on your example notebook. Perhaps we can change the top post of this PR to include that summary? Then it will be easier to find how/why this changed for people who are tracing the changes back to this PR.

@stefsmeets
Copy link
Contributor Author

Just a note to keep an eye on ESMValGroup/ESMValTool#1707 , which may benefit from this PR

@stefsmeets
Copy link
Contributor Author

This PR is going nowhere, so I'm closing it and moving forward with a different attempt. Closing this in favour of #868.
I may split other useful bits in different PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants