Streamline observations download #1657

jvegreg · 2020-05-13T10:29:45Z

The aim of this pull request is to streamline the procedure to download and reformat observational datasets for ESMValTool. I do not aim to have automated downloads for every dataset we are currently supporting, but to provide the framework and a good number of examples of implementation so it becomes easy to implement in the near future.

In particular, I have created three basic downloader classes for HTPPS (using wget), FTP (using the python library) and Climate Data Store (using cdsapi) downloads along with some example implementations for some datasets. I also provide some derived classes to simplify getting dataset from specific sources, like the NASA storage or the ESA-CCI CEDA anonymous FTP.

Regarding the interface, I plan to use the reworked CLI interface in ESMValGroup/ESMValCore#605 to refactor it so we get something similar to:

esmvaltool data prepare DATASET1,DATASET2 --start 1990 --end 2015
esmvaltool data download DATASET1,DATASET2 --start 1990 --end 2015
esmvaltool data format DATASET1,DATASET2 --start 1990 --end 2015

where prepare will download and format the data in one go. Hopefully, this will make life easier for those user that do not have access to JASMIN nor DKRZ.

Tasks

Create an issue to discuss what you are going to do, if you haven't done so already (and add the link at the bottom)
Give this pull request a descriptive title that can be used as a one line summary in a changelog
Add documentation
Make sure your code is composed of functions of no more than 50 lines and uses meaningful names for variables
Circle/CI tests pass. Status can be seen below your pull request. If the tests are failing, click the link to find out why.
Preferably Codacy code quality checks pass, however a few remaining hard to solve Codacy issues are still acceptable. Status can be seen below your pull request. If there is an error, click the link to find out why. If you suspect Codacy may be wrong, please ask by commenting.
Please use yamllint to check that your YAML files do not contain mistakes
(Only if really necessary) Add any additional dependencies needed for the diagnostic script to setup.py, esmvaltool/install/R/r_requirements.txt or esmvaltool/install/Julia/julia_requirements.txt (depending on the language of your script) and also to package/meta.yaml for conda dependencies (includes Python and others, but not R/Julia)
If new dependencies are introduced, check that the license is compatible with Apache2.0

bouweandela · 2020-05-13T11:24:09Z

Cool idea @jvegasbsc! Maybe we could choose a different name than obs or observations, because not everything we cmorize are observations? See also #1407.

I'm not too sure about inclusion as a subcommand in the esmvaltool command, because that is part of ESMValCore while the CMORizers live in ESMValTool.

jvegreg · 2020-05-13T13:37:21Z

I'm not too sure about inclusion as a subcommand in the esmvaltool command, because that is part of ESMValCore while the CMORizers live in ESMValTool.

I think the discoverability we will get with a single entry point is worth it. But my idea will be to find a way to allow ESMValTool to add commands to the interface without ESMValCore knowing anything about them, like a plug-in or similar

valeriupredoi · 2020-05-13T13:48:18Z

@jvegasbsc this is very cool! But I have an important question (maybe @mattiarighi can also shed more light on it too): what are the licensing troubles we may get into if some random user downloads Tier 3 data via ESMValTool and starts redistributing it against the rules?

jvegreg · 2020-05-13T13:58:13Z

what are the licensing troubles we may get into if some random user downloads Tier 3 data via ESMValTool and starts redistributing it against the rules?

In that case, the illegal part is to redistribute, not to download. We are not supporting that part, just the downloading procedure, which is completely fine

valeriupredoi · 2020-05-13T15:08:42Z

what are the licensing troubles we may get into if some random user downloads Tier 3 data via ESMValTool and starts redistributing it against the rules?

In that case, the illegal part is to redistribute, not to download. We are not supporting that part, just the downloading procedure, which is completely fine

there is a fine line between download and redistribution - we should make sure that the tool doesn't write any freshly downloaded Tier 3 files to any location before CMORizing them but rather downloads and CMORizes them on the fly; there is also the issue of clogging the disk if the tool keeps the originally downloaded files 🍺

jvegreg · 2020-05-14T08:57:34Z

I'm not too sure about inclusion as a subcommand in the esmvaltool command, because that is part of ESMValCore while the CMORizers live in ESMValTool.

I found a way to allow adding subcommands in esmvaltool without ESMValCore knowing anything about it. It is one of the ways that pytest uses to discover plugins.

Step 1: define a esmvaltool_commands entry point in setup.py:

'esmvaltool_commands': [
    'data = esmvaltool.cmorizers.obs.cmorize_obs:DataCommand'
]

In ESMValCore, scan for 'esmvaltool_commands' entry points and add the classes to the Fire interface:

for ep in iter_entry_points('esmvaltool_commands'):
    if hasattr(self, ep.name):
        logger.error('Registered command %s already exists', ep.name)
        continue
    self.__setattr__(ep.name, ep.load())

so we can decouple completely ESMValCore from this.

bouweandela · 2020-08-19T18:18:50Z

Would it be an idea to add a command like

esmvaltool download recipe_example.yml

that would download and reformat all data needed to run recipe_example.yml? That way users would not need to bother with figuring out what data is needed exactly to run a recipe.

Maybe that could be simplified even further to

esmvaltool run --download recipe_example.yml

to combine all steps of obtaining the data, formatting it and running into a single command.

remi-kazeroni · 2022-02-18T14:52:07Z

Thanks a lot for your suggestions and your help in getting rid of the Codacy issues @zklaus! Please let me know if you are fine with the changes. There is a final issue to fix but it depends on how we want the prepare command to work (see discussion). Both choices would reduce the number of Codacy issues,

I have rerun many test cases and everything works fine. Therefore, I would really like to have this PR merged soon and makig the new cmorizer interface available in v2.5.

Last point for @zklaus: Shall we revert the changes regarding the pylint&co configuration files? It seems to me that further discussion is needed in #2527 before this part can be finalized.

remi-kazeroni · 2022-02-22T16:53:36Z

Hi @zklaus, I have finished fixing the last (problematic) Codacy issue after deciding on the behaviour of the prepare command. I checked that the latest changes do not break anything w.r.t. the cmorizers. Do you think you would have a chance to look at the last changes over the next couple of days and approve if fine for you? Could you please also let me know if I can revert the changes regarding the configuration files for the linting tools?

We would still like to merge this PR this week, say Thursday, so that it lives in the main branch for a few days before the v2.5 release and we can see how the automatic testing engines react to that.

zklaus · 2022-02-23T11:48:56Z

@remi-kazeroni, nice work! I'll have very few minor comments, but regarding the reverting of the linter configs: Absolutely, do revert them. In this branch, they were only intended to help filter the barrage of codacy problems so that we could see what really needs addressing.

esmvaltool/cmorizers/data/cmorizer.py

esmvaltool/cmorizers/data/formatters/datasets/esacci_sst.py

Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>

zklaus · 2022-02-24T15:54:48Z

Stellar work, @remi-kazeroni !

remi-kazeroni · 2022-02-24T16:06:15Z

Thanks a lot @zklaus! All credits should go to @jvegreg for doing all the coding! 🍻

But I still wanted to address your final comment on the timerange selection. I agree the tool should return an error if the time range selection is wrong (wrong argument in the cli). The current assumption that a wrong timerange leads to the downloading/formatting of the whole datasets is not ideal in my opinion.

Please nobody merge now 😆

zklaus

Time range issue as described by @remi-kazeroni

remi-kazeroni · 2022-02-25T09:05:05Z

I have done some further tests on possible misuse of the time ranges arguments (--start, --end). For example, what happens with --start=1250? The download does not start but the error message is not uniform across datasets. This depends on the data source, the file naming (year included in the file name, folder name,...). Or what happens if a timerange is specified for a dataset where this is not needed (e.g. just one global file and no yearly files)? In these cases, the global files are downloaded.

In short: there may be room for further enhancements but I would leave that for future PRs since I have not found any problematic cases with the current handling of the time range.

remi-kazeroni · 2022-02-25T09:15:52Z

Not sure how to display the codacy report of the second to last commit, but the number of issue was 7 without any problematic issue left. The number grew to 310 after reverting the changes on .pylintrc and .prospector.yml made by @zklaus to help with the handling of Codacy issues.

To be able to merge this, I guess we just need a final approval from @zklaus and coordinate the merge button hitting with @schlunma (I'd need a bit of time to draft a message to open a Github discussion about the changes related to the new cmorizer interface).

Thanks to all who contributed, reviewed, tested!! 🥳 🍻

schlunma · 2022-02-25T09:19:16Z

Sounds great, @remi-kazeroni 👍 I will merge this PR shortly after you opened the discussion.

Thanks to everyone who contributed to this, this is an awesome addition 🎉

esmvaltool/cmorizers/data/cmorizer.py

zklaus · 2022-02-25T09:32:15Z

Not sure how to display the codacy report of the second to last commit, [...]

I'm not sure it's possible with Codacy. In principle, you can click on the little icon in front of the commit hash in the listing above (a green check mark, a red x, or a yellow dot for successful, failed, and ongoing checks, respectively). This will pop up a little dialog where you can click on details to see the specifics of a particular check. Have a look at the following screenshot.

The problem with Codacy is that it seems to show only the details for the last run, no matter which one you select. That's a bug in Codacy, I think.

zklaus · 2022-02-25T10:07:00Z

Great, thanks to everyone, particularly @remi-kazeroni for pushing this over the finish line! 🎉

There certainly is room for improvement, but as you said, let's take that up later.

remi-kazeroni · 2022-02-25T10:27:25Z

Thanks! @schlunma: could you please merge this now? It would help me to get the right links in the documentation to write the discussion.

valeriupredoi · 2022-02-25T10:45:47Z

@remi-kazeroni and the Streamliners band, legends! 🍺

jvegreg added enhancement is-enes observations labels May 13, 2020

jvegreg requested review from bouweandela, mattiarighi and valeriupredoi May 13, 2020 10:29

jvegreg mentioned this pull request May 14, 2020

Improve command line interface ESMValGroup/ESMValCore#605

Merged

9 tasks

This was referenced May 26, 2020

Remove _obs_ from cmorizers #1665

Closed

Add colortables as esmvaltool subcommand #1666

Merged

This was referenced Jun 3, 2020

Obs data for recipe_anav13jclim.yml #1669

Closed

Mention tools for downloading datasets in documentation #1311

Closed

Remove the word 'obs' from CMORizer scripts #1407

Closed

bouweandela mentioned this pull request Jul 22, 2020

Make all functions from cmorizers/obs/utilities.py public #1645

Closed

jvegreg force-pushed the refactor_downloads branch from c3d4dde to 9492d91 Compare August 17, 2020 11:08

jvegreg force-pushed the refactor_downloads branch from 5dec1b8 to 028d098 Compare August 20, 2020 08:55

Javier Vegas-Regidor added 4 commits August 20, 2020 10:59

Update docker and create action

2975dcb

Remove Julia

aee817c

Update dockerfiles

336eb1f

Update dockerfile

812d63d

bouweandela mentioned this pull request Sep 11, 2020

Monthly ESMValtool meeting September #1801

Closed

Javier Vegas-Regidor added 3 commits September 16, 2020 08:57

First refactor of download scripts

13c54c0

Add support for ESACCI data through ftp

c133be8

Add progress bar to ftp and esacci_fire

8194493

fix codacy esacci_watervapour

50b20ed

remi-kazeroni added 2 commits February 18, 2022 15:56

Merge branch 'main' into refactor_downloads

16c9645

change prepare behaviour

6db278d

zklaus requested changes Feb 23, 2022

View reviewed changes

esmvaltool/cmorizers/data/cmorizer.py Show resolved Hide resolved

esmvaltool/cmorizers/data/cmorizer.py Show resolved Hide resolved

esmvaltool/cmorizers/data/formatters/datasets/esacci_sst.py Outdated Show resolved Hide resolved

remi-kazeroni and others added 3 commits February 24, 2022 16:18

Update esmvaltool/cmorizers/data/formatters/datasets/esacci_sst.py

acacebb

Co-authored-by: Klaus Zimmermann <klaus.zimmermann@smhi.se>

fix and align exceptions

c6d715a

warning message added

7e8b14c

zklaus approved these changes Feb 24, 2022

View reviewed changes

zklaus requested changes Feb 24, 2022

View reviewed changes

remi-kazeroni added 2 commits February 25, 2022 09:30

exception for unsupported datetime formats addedd

1390702

revert pylint and prospector changes

1a2d1a7

zklaus requested changes Feb 25, 2022

View reviewed changes

esmvaltool/cmorizers/data/cmorizer.py Outdated Show resolved Hide resolved

fix exception

3dd2e1c

zklaus approved these changes Feb 25, 2022

View reviewed changes

schlunma merged commit 5aef74c into main Feb 25, 2022

schlunma deleted the refactor_downloads branch February 25, 2022 10:34

remi-kazeroni mentioned this pull request Mar 2, 2022

Update episode 9 to refer to the new cmorizer interface ESMValGroup/ESMValTool_Tutorial#232

Closed

remi-kazeroni mentioned this pull request Dec 1, 2022

Make CMORizer documentation available on readthedocs #1219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline observations download #1657

Streamline observations download #1657

jvegreg commented May 13, 2020 •

edited by remi-kazeroni

Loading

bouweandela commented May 13, 2020

jvegreg commented May 13, 2020

valeriupredoi commented May 13, 2020

jvegreg commented May 13, 2020

valeriupredoi commented May 13, 2020

jvegreg commented May 14, 2020

bouweandela commented Aug 19, 2020

remi-kazeroni commented Feb 18, 2022

remi-kazeroni commented Feb 22, 2022

zklaus commented Feb 23, 2022

zklaus commented Feb 24, 2022

remi-kazeroni commented Feb 24, 2022

zklaus left a comment

remi-kazeroni commented Feb 25, 2022

remi-kazeroni commented Feb 25, 2022

schlunma commented Feb 25, 2022

zklaus commented Feb 25, 2022 •

edited

Loading

zklaus commented Feb 25, 2022

remi-kazeroni commented Feb 25, 2022

valeriupredoi commented Feb 25, 2022

Streamline observations download #1657

Streamline observations download #1657

Conversation

jvegreg commented May 13, 2020 • edited by remi-kazeroni Loading

bouweandela commented May 13, 2020

jvegreg commented May 13, 2020

valeriupredoi commented May 13, 2020

jvegreg commented May 13, 2020

valeriupredoi commented May 13, 2020

jvegreg commented May 14, 2020

bouweandela commented Aug 19, 2020

remi-kazeroni commented Feb 18, 2022

remi-kazeroni commented Feb 22, 2022

zklaus commented Feb 23, 2022

zklaus commented Feb 24, 2022

remi-kazeroni commented Feb 24, 2022

zklaus left a comment

Choose a reason for hiding this comment

remi-kazeroni commented Feb 25, 2022

remi-kazeroni commented Feb 25, 2022

schlunma commented Feb 25, 2022

zklaus commented Feb 25, 2022 • edited Loading

zklaus commented Feb 25, 2022

remi-kazeroni commented Feb 25, 2022

valeriupredoi commented Feb 25, 2022

jvegreg commented May 13, 2020 •

edited by remi-kazeroni

Loading

zklaus commented Feb 25, 2022 •

edited

Loading