-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement importable config object #785
Conversation
This enables the following: from esmvalcore import config
# the next line is optional, but could be used to overwrite the default config
config.load('~/.esmvaltool/config-user.yml')
config.plot_dir
>> PosixPath('/home/stef/r/esmvalcore/esmvaltool_output/session_20200916_095523/plots') The config is loaded whenever ESMValCore is imported, but in this way the keys can be easily accessed. The config is initialized upon import, so that the state is shared between modules. |
careful with this - heavy object construction may lead to the objects being un-picklable and the mpi will either have a hard time sending data to processes or it will just not be able to pickle - configuration is done only once but if the |
Thanks for the heads-up @valeriupredoi , that is good to keep in mind. I'm not too worried about that at the moment, because most of the state is stored as a dictionary and many python objects are easily pickled/unpickled. |
yeah sure - me neither, but just something we should keep in mind before we make too many objects and too many recursive calls, I have just went through the hoops with another code that was 100% recursive objects that was impossible to parallelize w/o major code changes 😁 |
I have been looking at the matplotlib source code for some inspiration of how to approach this. Essentially, they define their config ( configDefault = load_config_from_file(`default_config.yaml`)
configDefault.update(_hardcoded_defaults)
config = Config() # first initialize as an empty mapping
config.update(configDefault)
config.update(load_config_from_file(`config.yaml`)
configOrig= rcParams.copy() # keep a reserve copy Config is defined as a MutableMapping (https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/__init__.py#L639): class Config(MutableMapping, dict): # mapping to the configuration file.
validators = _validators
def __setitem__(self, item) # validate items on the way in
def __getitem__(self, item) # return items from the mapping
def load_config_from_file(path='path_to.yaml') -> Config:
... Each item is validated through _validators = {
'variable_name`: callable,
'nested.variable_name`: callable,
} Nested variable names are a bit of an issue that I'm not sure how to handle. I think we could simply flatten them for internal use. Additionally, I think we should consider implementing another class named |
I've been trying to see how far I get with this idea of having an importable config object ( At the moment I have based my implementation on a mutable mapping mimicking The nice thing is that the CLI parameters can go through the same interface, making use of the same validators if necessary. For the nested hierarchy that we have in the configs, I had to flatten the dictionary loaded from the yaml (i.e. I have also added a separate importable locations class ( |
I added the options from the CLI to the config. Having the validators is prevents you from making mistakes if you use a wrong command line option or variable. For example, |
Got the code to a working state with a basic recipe. The main difficulty was in the |
Good idea. It also goes in line with the old idea of having a |
The last commit fixes the remaining failing tests (mostly path issues). Thanks @bouweandela ! @bouweandela had a nice idea how to distiguish between the The config can be used to start a new session: from esmvalcore import config
session = config.start_session('recipe_name')
session == config # True, since both inherit from dict
session['remove_preproc_dir'] = False
session == config # False This means that many session objects can exist, but they must all be started from the global config. This has the advantage that changes to a session variables can be independent from the global config, i.e. when running multiple recipes at once. |
Some good progress on this PR, it is nearly complete. I added tests for the new functionality, and failing tests are now passing on my PC. |
There was no test to check that the latest version is being resolved correctly if 'latest' is not in the path
I noticed that ESMValTool prints the config file it uses to the command line as the first log message when you run the tool. It would be nice if this would be the full path, for the default config-user.yml may be located in several different places. Would it be possible to change this here? |
It does not do this for me. Could you explain what you mean or copy the output here? |
Hi @stefsmeets, I tried to review this, but the number of changes is so large that it becomes very difficult to review:
Do you think it would be possible to implement this with a smaller number of changes? I noticed that there are a lot of formatting changes too, so maybe you could first make a pull request to just reformat the files you plan to change in this pull request, to at least get rid of all the cosmetic changes? |
I was a little bit worries about that too, but I think we missed that opportunity to implement this in smaller steps once we started working on the config file parameters. Many things are so tightly coupled, that it is difficult to split this up. With cosmetic changes, do you mean the auto-formatting? There may have been some 'collateral damage' from yapf/isort, is this what you mean? I can split it up in several pull requests, but it would take me a couple of days. It would have to be then at least 4 pull requests, probably in this order:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @stefsmeets here's my "high-level" feedback to the PR :-)
I very much like the improved DRS handling!!
One suggestion would be to split the input and output data in the DRS. Then you could change 'ProjectData' to 'InputData', which I think is more intuitive. output_file
could have a single default for all projects, or a dict with editable output_names for different projects (not sure whether that's necessary/desired functionality).
The API works nicely, but it does require some documentation in order to know how to use it. I think it would be nice if we could make it a little bit more intuitive. Here's some things I noticed when playing around with the API:
- If you print the configuration, it would be nice if all ProjectData objects are grouped together.
- It think if the ProjectData objects would have some str or repr method, that would make it more intuitive to understand what they are, and how to use/update them.
- Should we expose all the methods of the cfg object? E.g.
clear
. I used this, and now I cannot get the default cfg back. Why would someone want to usepop
orpopitem
? And what is the use case forfind_all
? I suppose it would be better to add things later than to have to remove them again. - I'm not so keen on the
from _projects import searchlocation
. Perhaps we could add a method to projectdata likecmip6.searchlocations.add()
that does the same thing? - Why does
get_input_filelists
return 3 lists? - It might also be nice if it could return smt like: 'No input data found for ... in ....'
- The projectdata has a method called
get_cmor_table
, but then you get CMIPInfo object, which again has a methodget_table
. That's not very intuitive. Perhaps we can make this an attribute or property instead of a method?
And here's some questions/comments related to the changes in the code:
- I like how you split up the cfg developer. But considering that some people apparently used to change all of these things sometimes, would it make sense to move cmor_config.yml (and perhaps also institutes.yml) from
cmor
to the newconfig
directory? _exceptions.py
: since this came up in a recent discussion, do you think this can help to get cleaner error messages for missing data?esmvalcore._main.py
: here you update the session with a couple of 'standard' command line options. But when we started usingfire
, I think the idea was that you could pass any parameter that's in the config file also via the command line, and it will be updated accordingly. Does that still work?- Related: I can't help but feeling that some of the config options are really very specific runtime arguments. If the session is very tightly coupled to a specific run, would it make sense to pass those runtime options only to the session, and not to the config?
- I wonder how the refactored data finder functions impact cases like Where to find the correct frequency for ERA5 hourly variables? ESMValTool#1884
One final remark: I really like the overview of the main changes in this PR that you provided in a comment on your example notebook. Perhaps we can change the top post of this PR to include that summary? Then it will be easier to find how/why this changed for people who are tracing the changes back to this PR.
Just a note to keep an eye on ESMValGroup/ESMValTool#1707 , which may benefit from this PR |
This PR is going nowhere, so I'm closing it and moving forward with a different attempt. Closing this in favour of #868. |
Hi all, we started working on an importable config object. This PR can be used to track the progress. For more discussion see this issue: #498
Tasks
- [ ] If writing a new/modified preprocessor function, please update the documentationyamllint
to check that your YAML files do not contain mistakes- [ ] If you make backward incompatible changes to the recipe format, make a new pull request in the ESMValTool repository and add the link belowThe development in this PR is quite complete. Tests for all new functionality has been added. Here is a list of things still to do:
To do
config-default.yml
for which variables we want to keep (would be nice if someone could have a look at this)config_user
variable tosession
to fit the config API