Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage profiles #148

Closed
mojodna opened this issue Aug 14, 2018 · 14 comments
Closed

Storage profiles #148

mojodna opened this issue Aug 14, 2018 · 14 comments

Comments

@mojodna
Copy link
Collaborator

mojodna commented Aug 14, 2018

Static STAC Catalog entries include a provider element which isn't particularly well-specified. The current example is S3-specific:

    "provider": {
        "scheme": "s3",
        "region": "us-east-1",
        "requesterPays": "true"
    }

I propose that storage profiles be defined for each type of object store (or just plain HTTP) so that keys have more meaningful values. The above would then be:

"provider": {
  "s3:region": "us-east-1",
  "s3:requesterPays": true
}

However, this presumes that all Items falling beneath such a parent catalog assume the same attributes, which may not be correct. Instead, perhaps these should be asset properties and the provider block omitted:

"assets": [
  {
    "href": "s3://.../...",
    "type": "...",
    "name": "...",
    "s3:region": "us-east-1", // can be inferred from the bucket name, but explicit is better than implicit
    "s3:requesterPays": true
  }
]

Proposed keys:

  • s3:region - name of the AWS region that a bucket is in, e.g. us-west-2 (enum)
  • s3:requesterPays - whether the S3 Requester Pays feature is enabled for this item; implies that AWS credentials must be used to access it (boolean)
  • s3:public - whether the item can be retrieved without S3 credentials; this allows clients to create HTTP(S) URLs and make items downloadable directly (boolean)
@matthewhanson
Copy link
Collaborator

@mojodna a few things

  • We discussed having object fields but decided to keep things simple originally and currently we don't have any fields that are objects under properties, which simplifies searching. Object fields, such as links and assets, we moved out of properties to the top level for this region. I think there are multiple fields that make sense to make object types, but we need to define how a user searches these fields.

  • This potentially duplicates a lot of info across items. In sat-api collections don't include assets, and there isn't anything specifically that excludes parts of assets being defined, but we need to specify how these would end up getting merged. I will start a new ticket for this.

  • Looks like we need to specify an s3 extension that adds these additional fields, and likewise create one for google storage.

@jeffnaus
Copy link
Contributor

Can this be extended to include the access control that private data set come with. For example DG can catalog assets like our images, but we will not be making these publicly accessible. Can I make my custom "DG" provider? If so what would something like this look like?

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 19, 2018

I like the idea of profiles for storage. We were thinking about storage for the datasets, too, but we had defined a fixed set, which I am not so happy with. What do you think @simonff? Would that be a more flexible way to go for datasets, too?

@simonff
Copy link

simonff commented Aug 20, 2018

Can we first list the top possibilities for storage? S3, GCS (Google Cloud Storage), MS Azure Storage, FTP/HTTP/HTTPS, DG's GBDX, Earth Engine, ... what else?

@jeffnaus: how would DG's clients expect to see storage information?

Note that the dataset subteam seemed to agree that it's better to separate the notion of Provider (the author/producer of the dataset) from the notion of Host/Storage (the physical location where the bytes are stored). Eg, in the case of the HTTP storage the Provider section would point at the human-readable homepage of the dataset, while the Host section would contain download links.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 20, 2018

I'm not sure whether we need much information for hosts such as DG's GBDX, Earth Engine and openEO as they don't offer data for direct download, but only for processing purposes within their own infrastructure. I don't really know what I would expect there for openEO. What would you expect there for GEE @simonff?

#164 has information on how the dataset subteam defined Hosts and Provider in the datasets as of now.

@simonff
Copy link

simonff commented Aug 20, 2018

While EE etc don't offer data for direct download, it still seems valuable to support listing of assets in some standard format to make future catalog viewers more broadly useful and provide compliance with open standards. (Note that EE can already export any assets into geotiffs for download. Right now there is no foolproof way to export the exact original bytes with the exact original extent and projection - one would need to tune the export parameters just so. But we keep the original bytes and metadata internallt, so reconstructing an exact copy of the input asset is not impossible.)

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 20, 2018

So you would have stored a STAC catalog and would point to its location, but the STAC items would not refer to any downloadable assets? One asset would need to be referenced though (required by the spec). These catalogs would probably be stored on S3, GCS, HTTP and would not need a native GEE profile for storage, right? Or did I not get your point?

@simonff
Copy link

simonff commented Aug 20, 2018

Right, a static STAC catalog is a file or a family of files that can be browsed over HTTP, AFAIU. EE would not know about them, though EE might provide STAC API exposing the same information as static files.

Sure, it makes sense for each item to reference at least one asset, but why do these assets have to be downloadable? Eg, one could imagine some future tool comparing listings of Landsat collections in EE and in DG to see who has the most recent data.

@matthewhanson
Copy link
Collaborator

Same thing with DG, you can't directly download initially, but after you order it you can download it. I've implemented this for DG as STAC item that gets updated locally with the download information once the user orders it.

But I don't see why a STAC item should be required to include any assets, perhaps all the assets are only retrievable via some function call. Granted, providers should be strongly encouraged to provide a thumbnail, but for some datasets even thumbnails aren't all that useful.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 20, 2018

When I once asked about the requirement of at least one assets, I got those reponses: https://gitter.im/SpatioTemporal-Asset-Catalog/Lobby?at=5ae2f6b61130fe3d36219434
Also, the catalog spec claims:

the point of the SpatioTemporal Asset Catalog is to be link to actual data, not to just reference metadata

@matthewhanson
Copy link
Collaborator

My own thinking on this has changed after working with GBDX. There are likely to be even more platforms in the future where assets aren't directly downloadable but require API calls through a RESTful API or an API library.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 20, 2018

Sure, there are GEE, openEO, GDBX and more to come. So should we open an issue to remove this restriction?

Edit: Did so: #187

@matthewhanson
Copy link
Collaborator

+1

This was referenced Aug 20, 2018
@cholmes cholmes added the prio: should-have would be very good to have in the release label Aug 23, 2018
@cholmes cholmes added this to the 0.6.0-RC1 milestone Aug 24, 2018
m-mohr added a commit that referenced this issue Oct 5, 2018
@m-mohr m-mohr modified the milestones: 0.6.0-RC1, future Oct 9, 2018
@m-mohr m-mohr added new extension and removed stac-sprint-3-discuss prio: should-have would be very good to have in the release labels Jul 18, 2019
@m-mohr m-mohr linked a pull request Feb 19, 2021 that will close this issue
4 tasks
@cholmes cholmes modified the milestones: future, new extensions Feb 26, 2021
@matthewhanson
Copy link
Collaborator

This functionality is implemented in the storage extension:
https://github.com/stac-extensions/storage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants