Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting object_pairs_hook for json.loads #313

Closed
l0b0 opened this issue Apr 27, 2021 · 3 comments · Fixed by #309
Closed

Setting object_pairs_hook for json.loads #313

l0b0 opened this issue Apr 27, 2021 · 3 comments · Fixed by #309

Comments

@l0b0
Copy link
Contributor

l0b0 commented Apr 27, 2021

I'm working on a project which reads, parses and validates STAC files on AWS S3 recursively, creates a list of asset and metadata files, and copies all of them across to some reference storage. In other words, if a file is not mentioned in the metadata, it will not be copied.

The STAC documentation specifies that assets in collections have unique keys. Unfortunately, the STAC JSON schema can't encode this and the Python JSON parser does not treat duplicate keys as an error.

Combining the above means the following scenario is likely to happen (and has already happened during a pre-production test):

  1. A dataset maintainer creates a STAC assets property with two or more identical names, for example, "assets": {"foo": {"href": "s3://bucket/first"}, "foo": {"href": "s3://bucket/second"}}.
  2. They import the dataset.

At this point, there is no indication that anything went wrong, but in fact the way the JSON parser works means it's silently dropped one of the assets. Detecting and reporting such issues is highly desirable, and can be done by setting the object_pairs_hook in json.loads (implementation, test).

We're evaluating whether to use pystac for this project, and this is one of the features of our current code base we would like to keep. Would you be interested in a PR to enable this? It would presumably be very similar to the read_text_method mechanism you already have in STAC_IO.

@lossyrob
Copy link
Member

Thanks for pointing this issue out. It's an interesting edge case; I'm curious how the dataset maintainer ended up with JSON with duplicate keys in the first place as most implementations of JSON won't allow this.

There's a change that's coming to STAC_IO to refactor in order to be more flexible, after some difficulties in dealing with authorization headers in pystac-client. This change will likely move to a object-based IO class that can be overridden from the default, instead a top-level package STAC_IO. Not sure that's applicable here, but something to keep in mind while thinking through changes to the current STAC_IO structure.

The duplicate_object_names_report_builder looks like it does a fair bit of work, and would be a performance hit in some cases where the user might be more certain there isn't duplicate keys - so I'd prefer not to have this enabled by default. On the performance front, I'd also like to enable users to utilize orjson where it is installed - we've seen a lot of serialization/deserialization speedups using it, which can add up when processing millions of items.

My current thinking around the STAC_IO refactor is to have an optional IO object passed into any read or write operation. If the user doesn't provide the optional implementation, then the operation will try to use A. the IO implementation that was defined for a previous IO operation against a parent/root (which means stac objects will keep track of and reuse the IO implementation that was used to create them, e.g. for reading links), or a default implementation, which can be passed into the root catalog or perhaps set at the package level if a user wants to set it once and forget about it.

In the above setting, we could provide a few standard implementations, e.g. one that uses orjson instead of json, and also one that uses object_hooks to avoid duplicate keys.

This might be a bit hard to parse through without code to look at, but I was planning on tackling the IO refactor as part of a larger set of refactors in #309, which is making some bigger changes to the library in preparation for the final STAC 1.0 release. I can take this object_hooks feature into consideration as that progresses and report here some more concrete ideas on how to enable this as an option for IO in PySTAC.

@l0b0
Copy link
Contributor Author

l0b0 commented Apr 27, 2021

Thanks for pointing this issue out. It's an interesting edge case; I'm curious how the dataset maintainer ended up with JSON with duplicate keys in the first place as most implementations of JSON won't allow this.

It was a case of copy-pasting an asset entry, which I guess is pretty common in manually edited metadata.

The duplicate_object_names_report_builder looks like it does a fair bit of work, and would be a performance hit in some cases where the user might be more certain there isn't duplicate keys - so I'd prefer not to have this enabled by default.

For sure. It's also an application-specific handler. That said, it could easily be generalised to just throw an exception when detecting a duplicate.

Looking forward to seeing the final version of #309; let me know if you want a review of it.

@lossyrob
Copy link
Member

lossyrob commented May 1, 2021

@l0b0 this is implemented in #309, particularly d0c6cc5. I still need to add a test case (or accept a PR into that branch with a test case :-)), but you should be able to do this now:

import pystac as ps 

class ReportingStacIO(DefaultStacIO, DuplicateKeyReportingMixin):
    pass

ps.StacIO.set_default(ReportingStacIO)

Thanks fore reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants