You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Move the ISISDATA and ISISTESTDATA store from a distributed download to a centralized data store with an HTTPS endpoint.
Motivation
To comply with efforts to transition USGS/NASA data to the cloud, we moved to Amazon S3 to host a subset of ISISDATA. S3 has many benefits for hosting static content on the cloud, but it comes at a price every time data is downloaded from S3. Therefore, we decided to leverage data hosted by other public sources provided by NAIF, JAXA, and ESA. This way, we can minimize costs to the USGS by not hosting redundant information. This came with some setbacks:
User downloads dramatically increased because there were extra files in public sources.
Redundant files now exist since the ISISDATA structure was previously planned around using symlinks (every mission often has their own copies of kernels). S3 does not support symlinks.
Having ISISDATA be merged from multiple sources has been very confusing for both users and developers.
Files can disappear from public data stores, causing runtime errors for users.
There have been attempts to rectify some of these problems:
Using filters to prevent unnecessary files from downloading. This is cumbersome to maintain as the list of ignores will only get bigger. Filters meant for one could also affect others.
Using redirects in place of symlinks. These are counterintuitive to how s3 works and require more custom code in downloadIsisData.py.
After training new devs to maintain the ISISDATA store, there still seems to be a lot of confusion.
Address missing files by hosting archived versions ourselves in S3.
We can more permanently rectify these issues by moving to a more centralized solution using what we learned about AWS in the previous implementation to both support easier downloads from a centralized source and keep costs down for USGS.
Proposed Solution / Explanation
Terms used in this explanation:
AWS Simple Storage System (S3)
AWS solution for storing key/value pairs. Although it looks like a directory system, it is not a full-featured filesystem. It is useful when storing files in a publically accessible way using a structure similar to a filesystem. Amazon charges every time data is moved from S3 to some endpoint, like when downloading data.
S3 objects are stored in groups called buckets (e.g., ISISDATA is stored in a bucket called isis_data).
AWS CloudFront
AWS solution for creating a Content Delivery Network (CDN). A common use case is to cache an S3 Bucket. This allows for a fast HTTPS connection to an S3 bucket without paying on every download, only when the bucket is updated (e.g., ISISDATA public URL is updated to have new LRO kernels).
AWS Elastic File System (EFS)
AWS Solution for hosting a shareable drive that has no maximum size as it grows elastically with the size of your data. Unlike S3 buckets, there is no easy way to expose it publically. This is useful for mounting internally to live services that need fast access to the data (e.g., SpiceServer).
How these components solve the problem.
I propose we update the process to do the following in order:
We use the existing code in downloadIsisData.py to generate a superset of data from NAIF, ESA, JAXA, etc., into an S3 bucket.
We filter this S3 bucket using existing software in ISIS or SpiceQL (they generate a kernel database) to determine what kernels are used in USGS software. The reduced data set is stored in another bucket location.
A new CloudFront distribution caches the reduced dataset and provides a URL endpoint (possibly astrogeology.usgs.gov/isisdata/ and astrogeology.usgs.gov/isistestdata/).
Users then download from this source using rclone, rsync, downloadIsisData.py, etc.
How this will impact ISIS users
Data Reduction
Leverage existing USGS metadata used to search for kernels when running spiceinit or ALE's isd_generate to determine what kernels are used in the software. They are not included in the public bucket if not accessed in software to generate camera models or ISD. If we use SpiceQL's inventory system for this, it would also eliminate the need for duplicating kernels in different mission folders since SpiceQL's database is agnostic to filesytem structure.
Option to not use downloadIsisData.py
Other clients will work as you will no longer download directly from an S3 bucket. The script will be simplified but still distributed for users who have grown accustomed to using it.
Less likely for desyncs between ISISDATA and public sources
Problems of files missing should occur less often.
Downloading ISISTESTDATA will have the same process as ISISDATA since they can be hosted in the same place.
To have kernels included in the system, an update to SpiceQL will be necessary.
In order for the kernel database to be updated, we would need a change to spiceql's configs.
Drawbacks
We will have to maintain files also maintained by NAIF etc., to ensure they match what USGS software needs. If a new mission need to be added, we will have to have updates in our software to detect the new kernels before they can be distributed
Need new SOPs for adding support for missions not maintained by USGS Astrogeology.
Alternatives
Maintain the current system.
Unresolved Questions
Can we only host the latest versions of kernels? ISISDATA contains multiple versions of the same kernel, can we not include these in the downloads.
Future Possibilities
Versioned ISISDATA, do we want to update our release SOPs to have versioned data?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Summary
Move the ISISDATA and ISISTESTDATA store from a distributed download to a centralized data store with an HTTPS endpoint.
Motivation
To comply with efforts to transition USGS/NASA data to the cloud, we moved to Amazon S3 to host a subset of ISISDATA. S3 has many benefits for hosting static content on the cloud, but it comes at a price every time data is downloaded from S3. Therefore, we decided to leverage data hosted by other public sources provided by NAIF, JAXA, and ESA. This way, we can minimize costs to the USGS by not hosting redundant information. This came with some setbacks:
There have been attempts to rectify some of these problems:
We can more permanently rectify these issues by moving to a more centralized solution using what we learned about AWS in the previous implementation to both support easier downloads from a centralized source and keep costs down for USGS.
Proposed Solution / Explanation
Terms used in this explanation:
AWS solution for storing key/value pairs. Although it looks like a directory system, it is not a full-featured filesystem. It is useful when storing files in a publically accessible way using a structure similar to a filesystem. Amazon charges every time data is moved from S3 to some endpoint, like when downloading data.
S3 objects are stored in groups called buckets (e.g., ISISDATA is stored in a bucket called
isis_data
).AWS solution for creating a Content Delivery Network (CDN). A common use case is to cache an S3 Bucket. This allows for a fast HTTPS connection to an S3 bucket without paying on every download, only when the bucket is updated (e.g., ISISDATA public URL is updated to have new LRO kernels).
AWS Solution for hosting a shareable drive that has no maximum size as it grows elastically with the size of your data. Unlike S3 buckets, there is no easy way to expose it publically. This is useful for mounting internally to live services that need fast access to the data (e.g., SpiceServer).
How these components solve the problem.
I propose we update the process to do the following in order:
How this will impact ISIS users
Leverage existing USGS metadata used to search for kernels when running
spiceinit
or ALE'sisd_generate
to determine what kernels are used in the software. They are not included in the public bucket if not accessed in software to generate camera models or ISD. If we use SpiceQL's inventory system for this, it would also eliminate the need for duplicating kernels in different mission folders since SpiceQL's database is agnostic to filesytem structure.Other clients will work as you will no longer download directly from an S3 bucket. The script will be simplified but still distributed for users who have grown accustomed to using it.
Problems of files missing should occur less often.
Downloading ISISTESTDATA will have the same process as ISISDATA since they can be hosted in the same place.
To have kernels included in the system, an update to SpiceQL will be necessary.
In order for the kernel database to be updated, we would need a change to spiceql's configs.
Drawbacks
Alternatives
Unresolved Questions
Future Possibilities
Beta Was this translation helpful? Give feedback.
All reactions