Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple cache mechanism to GoogleWTS class #1533

Merged
merged 10 commits into from
Nov 18, 2020

Conversation

adrien-berchet
Copy link
Contributor

@adrien-berchet adrien-berchet commented Apr 23, 2020

Rationale

During my development I made many tests implying fetching tiles from Google. In the end, my IP address got banned for some time...
I propose a very simple cache mechanism to store tiles so that it is not needed to request tiles from the server at each test.
With this, it is possible to write:
gt_cache = cimgt.GoogleTiles(cache="\tmp\cartopy_cache")
and all tiles downloaded downloaded by this GoogleWTS object will be stored in the given directory for later use.

Implications

  • Avoid being banned from tile providers during heavy load tests.
  • Allow to work offline if the tiles have already been cached.

lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/tests/test_img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/tests/test_img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/tests/test_img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/tests/test_img_tiles.py Outdated Show resolved Hide resolved
@SciTools-assistant SciTools-assistant added Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form and removed Blocked: CLA needed See https://scitools.org.uk. Submit the form at: https://scitools.org.uk/cla/v4/form labels Apr 23, 2020
@adrien-berchet
Copy link
Contributor Author

I don't get why coverage dropped so much between the commits b86702c and dddbb63?

@greglucas
Copy link
Contributor

@adrien-berchet, thanks for the PR!
I think this has been a highly desired thing for quite some time. Linking in issue #732, that looks like it has a potential solution as well. I haven't had the chance to review that solution or yours though to see what the differences are.

Coverage has been pretty flaky recently and should hopefully be more reliable when we drop Python 2 support. So, I wouldn't read into it too much right now.

@adrien-berchet
Copy link
Contributor Author

Thanks for your feedback.
I did not see #732 which provides a possible solution for this issue, sorry about that. The idea of relying on file names is interesting so I updated this PR to consider this solution and the cache mechanism is now much more simple.
Also, I now save the Numpy array instead of the image to avoid some conversions.

lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
@adrien-berchet
Copy link
Contributor Author

@greglucas I rebased the branch on master

Copy link
Contributor

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this looks really good. One question is whether we want to have a cache_path at all? Would it make sense to just put a kwarg cache=True and then use the data_dir that is already in Cartopy config:
https://scitools.org.uk/cartopy/docs/latest/cartopy.html#cartopy.config
Maybe make a new cache dir for your saved files underneath that? Then a user could easily clean up all local Cartopy files and not need to remember what cache_path they used.

@greglucas greglucas added this to the 0.19 milestone May 20, 2020
@QuLogic
Copy link
Member

QuLogic commented May 20, 2020

There was also #795, which doesn't seem to have been completed. I have not compared the implementations.

@adrien-berchet
Copy link
Contributor Author

Overall, I think this looks really good. One question is whether we want to have a cache_path at all? Would it make sense to just put a kwarg cache=True and then use the data_dir that is already in Cartopy config:
https://scitools.org.uk/cartopy/docs/latest/cartopy.html#cartopy.config
Maybe make a new cache dir for your saved files underneath that? Then a user could easily clean up all local Cartopy files and not need to remember what cache_path they used.

Indeed it would be possible to store the cache in data_dir but I am afraid that the cache size explodes if it is global and if we don't implement a kind of rotation mechanism (keep only the 1000 last tiles and remove older ones for example).
For this reason I feel it is more relevant to have project specific caches. But we could setup a temporary cache when no cache directory is specified, so the global cache is automatically cleared at each reboot.
What do you prefer?

@adrien-berchet
Copy link
Contributor Author

There was also #795, which doesn't seem to have been completed. I have not compared the implementations.

Indeed, I did not know it.
As far as I can see, it uses shelf to store the tiles while here we use numpy I/O functions. I never used shelf so I don't know which is better but I guess it is better to use numpy functions to store numpy arrays.

@greglucas
Copy link
Contributor

I have also never used shelve, so don't really have anything to compare it against, but it looks like it is more for an object store that you could reload and immediately access attributes/methods. Since you're just loading the data and not objects, I think the numpy save/load would be fine here.

What about adding a cache_dir key into the config dictionary? Each user could set that to whatever directory they'd like then, but maybe have a sensible default under or next to the data directory similar to Natural Earth?

config = {'pre_existing_data_dir': '',
'data_dir': _data_dir,
'repo_data_dir': os.path.join(os.path.dirname(__file__), 'data'),
'downloaders': {},
}

Then we could change the keyword argument input to the function to just be cache and allow cache=True which would go to the global location, or an explicitly mentioned location too cache=/tmp/dir. I'm just thinking that some people may not want to explicitly put the directory every time? But, it would also be much safer to require the explicit handling than blowing up someones disk with a large cache too.

@adrien-berchet adrien-berchet force-pushed the add_cache branch 3 times, most recently from 9757d12 to 3832c59 Compare July 15, 2020 17:40
@adrien-berchet
Copy link
Contributor Author

I rebased on master and added a cache_dir entry in the config dict. I set a default value to a temp directory, so the cache is cleared at each reboot when no directory is specified. WDYT about it @greglucas ?

Copy link
Contributor

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a pretty nice way to do it! The only downside is that I think it will use a new tempdir for each virtual environment a user would create I think because this is creating a random tempdir at each installation. That is one thing I like about the current Natural Earth data being downloaded to my user space rather than per each venv I create.

I'm fine with it as is, just want to make sure that is what you want as a user and this makes sense to others too.

I was also wondering why the new cache keyword argument didn't have any docstrings with it in all of these new places and it turns out we just don't have any docstrings there currently :(

"""Load the cache"""
if self.cache_path is not None:
if not os.path.exists(self._cache_dir):
os.makedirs(self._cache_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want emit a warning message telling them we are creating the cache_dir in case a user didn't realize they were making this dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could indeed. But maybe only in the case where cache=True, because when the user explicitly sets a directory path we can suppose he knows what he is doing.


config = {'pre_existing_data_dir': '',
'data_dir': _data_dir,
'cache_dir': _cache_dir,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an entry into the docstring below describing what this new entry is for.

@adrien-berchet
Copy link
Contributor Author

In fact, when cache=True, it will create only one cache directory that is common to all sessions in which cache=True since I just get the path of global tmp dir (/tmp, C:\temp or whatever) and create the cartopy_cache_dir inside it. So only one is created and the cache is shared among different sessions (except for those in which a custom directory is specified). I think this way we have a good balance between ease of use and disk space.

If you are ok with that I will add the docstrings (I was waiting the final version to write the doc).

@adrien-berchet
Copy link
Contributor Author

adrien-berchet commented Jul 16, 2020

Oh and I have another question @greglucas : since we provide a default path to a temp directory, should we set the default value of the cache parameter to True?

lib/cartopy/io/img_tiles.py Outdated Show resolved Hide resolved
lib/cartopy/tests/test_img_tiles.py Outdated Show resolved Hide resolved
Copy link
Contributor

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this out locally in two different virtual environments and you're right it does maintain the same temporary cache_dir in both of them.

I think that we should keep this as an additional option for now, and potentially default to it in a follow-up PR? It may be nice to set a global default somehow though, so you don't have to specify the cache keyword with every single call... Something along the lines of the mpl.rcParams way of doing things.

For now, I think this looks great! So I'll approve it and wait to see if anyone else has any feedback on it for a bit before merging.

@adrien-berchet
Copy link
Contributor Author

Ok, thanks! Let's wait for other feedbacks then :)

@greglucas
Copy link
Contributor

Thank you for rebasing, I just reran the doc CI build that was failing and everything looks good now.

@QuLogic
Copy link
Member

QuLogic commented Aug 20, 2020

I'm not so sure about tempdir, as that is shared by users, and this doesn't use a user-specific name, nor an exclusive creation mode. I think the user cache may be a better location? Though it's not consistent across platforms (~/.cache/cartopy on Linux, somewhere else on Windows.)

@adrien-berchet
Copy link
Contributor Author

adrien-berchet commented Aug 20, 2020

Hi @QuLogic
You're right, I didn't think about that. But is there any issue to share this cache among users? Anyway, if someone is concerned about cache privacy, it is possible to set a custom cache path to a private location. Maybe we could just speak about this issue in the doc?

@adrien-berchet
Copy link
Contributor Author

adrien-berchet commented Oct 19, 2020

Hi there, I just realized that this PR was still pending. I rebased and added a note about cache privacy in __init__.py.

@greglucas
Copy link
Contributor

👍 My approval still stands. I think this is good with the note you've added, especially since this isn't on by default and the user has to opt-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants