Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make fsspec_open_kwargs, query_string_secrets, & is_opendap attributes of FilePattern #167

Merged
merged 107 commits into from
Sep 2, 2021

Conversation

cisaacstern
Copy link
Member

@cisaacstern cisaacstern commented Jul 16, 2021

Proposed feature description

Users may source data from APIs which authenticate via parameters (e.g., token, username, etc.) appended to a url path. This PR provides DropAPIParamsCache, a subclass of CacheFSSpecTarget that removes any of these (likely secret) API parameters from the cache filenames and logs generated at execution time.

An assumption here is that there are situations where someone may have license to build a zarr store from password-protected data, but they don't want their credentials being reproduced on the execution worker system(s). See below for motivating recipe.

If we want to move forward with this feature, the following TODO items remain (assuming we preserve all existing logging):

  • storage.file_opener would need a conditional public_fname assignment; something like this
  • recipes.xarray_zarr.open_input also needs a public_fname assignment before logging here

Motivating recipe

This feature draft is motivated by @paigem's recipe pangeo-forge/staged-recipes#56, which as described in the linked issue here pangeo-forge/staged-recipes#46, sources files from NCAR's Climate Data Gateway via an HTTP API.

Once signing into the web GUI with valid credentials, the user is given an API token to append to the data url. For example, the first source url for the above-linked recipe is:

https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc?api-token=<TOKEN>

where <TOKEN> is a secret.

@cisaacstern cisaacstern requested review from rabernat and removed request for rabernat July 16, 2021 00:40
@cisaacstern cisaacstern marked this pull request as ready for review July 16, 2021 01:21
@cisaacstern
Copy link
Member Author

This recipe.py for pangeo-forge/staged-recipes#56 shows one way these secret API parameters could be appended to the file paths at execution time. Note the usage pattern for this recipe would be something like:

from recipe import instantiate_recipe

api_token = "my-secret-token"
authenticated_recipe = instantiate_recipe(api_params=api_token)

# etc.

Automating this would presumably require an update to the meta.yaml spec recipes section to permit something like:

recipes:
  - callable: "recipe:instantiate_recipe"
    args: "api_params:{{ pointer to GitHub Secrets }}" 

AFAIK, the "recipe secrets" discussed in #79 have mostly been thought of as usernames and passwords which can be passed as fsspec_open_kwargs at execution time. This scenario is a little different, because the credentials need to be appended to the url paths.

cc @sharkinsspatial

@rabernat
Copy link
Contributor

rabernat commented Jul 16, 2021

This scenario is a little different, because the credentials need to be appended to the url paths.

Just curious if it would also work if you put the token in the HTTP Authorization header, rather than in a query string. That is a very common way to do token-based API access.

If so, we could use fsspec_open_kwargs after all.

@cisaacstern
Copy link
Member Author

Just curious if it would also work if you put the token in the HTTP Authorization header, rather than in a query string.

As far as I can tell, this doesn't work for this particular source (NCAR Climate Data Gateway). I made a few attempts to pass this token to fsspec with various iterations of {"headers": {"Authorization": "<type> <credentials>"}}, none of which worked. I could do a more thorough job of documenting the Tracebacks I received if it would be useful, but perhaps a further clue resides in the fact that, when attempting to navigate to one of the source data urls without appending a query string, we are redirected to a GUI login page; as seen, e.g., here: https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc

This does not seem to match the usual pattern described in the MDN Web Docs, where:

usually, but not necessarily, after the server has responded with a 401 Unauthorized status and the WWW-Authenticate header.

@cisaacstern
Copy link
Member Author

If we do want to move forward with a version of this feature, the implementation could alternatively be as an AddAPIParamsCache subclass of CacheFSSpecTarget, which appends the query string to the filename before caching. Assuming this does also work, it has the advantage of not requiring the divergent meta.yaml pattern described above.

@rabernat
Copy link
Contributor

Let's talk this through at today's upcoming call.

@rabernat
Copy link
Contributor

From my point of view, it seems much simpler to define the FilePattern without these secrets in the query string, and then figure out a way to pass extra secrets (key-value pairs) to be encoded into the URL.

@cisaacstern
Copy link
Member Author

From my point of view, it seems much simpler to define the FilePattern without these secrets in the query string, and then figure out a way to pass extra secrets (key-value pairs) to be encoded into the URL.

Agreed. I think a subclass of CacheFSSpecTarget is the best way to do this, but open to all suggestions.

@rabernat
Copy link
Contributor

rabernat commented Jul 19, 2021

An alternative would be to put this in the file_opener function.

@contextmanager
def file_opener(
fname: str,
cache: Optional[CacheFSSpecTarget] = None,
copy_to_local: bool = False,
**open_kwargs,
) -> Iterator[Union[OpenFileType, str]]:

via an extra argument like query_string_secrets.

All remote file opening goes through that function, so it would make sense.

@rabernat
Copy link
Contributor

All remote file opening goes through that function

This is wrong. We also open files here:

class CacheFSSpecTarget(FlatFSSpecTarget):
"""Alias for FlatFSSpecTarget"""
def cache_file(self, fname: str, **open_kwargs) -> None:

@cisaacstern cisaacstern marked this pull request as draft July 20, 2021 17:14
@cisaacstern cisaacstern changed the title Proposed feature: option to drop API params from logs + cached filenames Add query_string_secrets to fname if provided Jul 29, 2021
@cisaacstern cisaacstern marked this pull request as ready for review July 31, 2021 22:35
@cisaacstern
Copy link
Member Author

@rabernat, this PR is ready for your re-review. Following reflection on your #167 (comment), I believe this feature can be implemented with the very light touch represented here. AFAICT, declaration of CacheFSSpecTarget.cache_file.input_opener is actually the only line where we need to add in query string secrets, so defining an input_fname to be passed to that line only, looks like it allows us to authenticate in the source file context, without secrets leaking to pangeo-forge-recipes logs or cached filenames.

This PR required some small changes to the test suite, to get test_storage.py to pass. This is my first foray into the tests, so certainly won't be surprised if you have some notes there.

I do wonder if there may be a way to bring some of the new names into closer alignment with the spirit of "Call things the same thing", as I have ended up with

  • query_string_secrets as the user-facing parameter in XarrayZarrRecipe (for expressiveness)
  • secrets as the inner implementation of that parameter in storage.py and test_storage.py (for conciseness)
  • fake_secrets for the fixture name in conftest.py (seemed descriptive)

...but maybe these are sufficient justifications for the naming heterogeneity.

And a final note: defining an _add_query_string_secrets function in storage.py is obviously a bit heavy-duty just to concatenate two strings, but I figured it's worth having so that we have a place to build out other ways of adding query string secrets in the future. In general, I think (?) it should not matter what order query string parameters are passed, so even if a particular url specifies other parameters aside from secrets, it will be fine to simply append the secrets at the end.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great progress.

A few comments about the implementation and test logic.

pangeo_forge_recipes/storage.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/storage.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/storage.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
tests/test_storage.py Outdated Show resolved Hide resolved
tests/test_storage.py Outdated Show resolved Hide resolved
@cisaacstern cisaacstern marked this pull request as draft August 2, 2021 16:55
@cisaacstern
Copy link
Member Author

Thanks for these thoughtful suggestions. For clarity, I've converted this back to a draft, and will re-mark it as 'Ready for review' once I've incorporated this round of feedback.

@cisaacstern cisaacstern changed the title Add query_string_secrets to fname if provided Make fsspec_open_kwargs, query_string_secrets, & is_opendap attributes of FilePattern Sep 1, 2021
@cisaacstern
Copy link
Member Author

cisaacstern commented Sep 1, 2021

I believe all of the items from #167 (comment) are now complete. A few notes:

Add additional attributes to the FilePattern class ... related to file opening / access / secrets.

AFAICT without changing the implementation of *combine_dims as *args in the FilePattern.__init__ call signature, our options for adding these kwargs are:
1. Declare them explicitly by name, requiring that they go before *combine_dims in the call signature (doesn't seem right; *combine_dims should come first)
2. Pass them after *combine_dims as **kwargs

I've opted for the latter, but reject all but the three supported kwargs, as shown here. Given that these are passed as **kwargs, is it inadvisable to list them as function params, as I've done here? From the user perspective, these act like explicit parameters, but from a Python internals standpoint, that may be a bit of a misnomer.

Edit: These questions resolved in #167 (comment)

Add a test for these new attributes in test_patterns.py.

I'm testing this within test_file_pattern_concat_merge here to ensure that instantiating and/or updating these kwargs isn't affecting the core functionality of FilePattern.

Update the narrative documentation for patterns to explain how to use the new attributes.

☑️ here

Refactor the XarrayZarrRecipe class to access these attributes from the FIlePattern

☑️ here

Update conftest.py and test_recipes.py to leverage this information, passing the secrets via the FilePattern fixtures

  • conftest.py: path fixtures (netcdf_local_paths and netcdf_http_paths) now return 5-tuples, with the added item being these new kwargs. They are always empty for local paths and populated according to the server's parametrization for http paths. These are then passed to the FilePattern instance here.
  • test_recipes.py: the beauty of putting this all into FilePattern is that the footprint in test_recipes.py is essentially nonexistent. All of the recipes already get a FilePattern, of course. And if that object comes from the netcdf_http_file_pattern fixture (as opposed to the local pattern fixture) and requires some form of authentication, then it comes prepackaged with all the information it needs. (Whether that be fsspec_open_kwargs or query_string_secrets.)

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great Charles! We are super close on this very complicated and much needed test refactor.

I just have one more round of comments and then I'm sure we'll be ready to merge. Thanks for your patience and perseverance here.

docs/recipe_user_guide/file_patterns.md Outdated Show resolved Hide resolved
docs/recipe_user_guide/file_patterns.md Outdated Show resolved Hide resolved
pangeo_forge_recipes/patterns.py Outdated Show resolved Hide resolved
pangeo_forge_recipes/patterns.py Show resolved Hide resolved
tests/test_patterns.py Outdated Show resolved Hide resolved
tests/test_patterns.py Outdated Show resolved Hide resolved
tests/test_recipes.py Outdated Show resolved Hide resolved
tests/test_recipes.py Outdated Show resolved Hide resolved
tests/test_references.py Outdated Show resolved Hide resolved
tests/test_references.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants