From fd2609f4195687bd792efd9170679e2f3d271264 Mon Sep 17 00:00:00 2001 From: Default user Date: Fri, 16 Aug 2024 03:12:34 +0000 Subject: [PATCH 01/11] Modify positioning again and some minor text changes --- .../cloud-computing/01-cloud-computing.ipynb | 14 ++++++++++---- .../cloud-computing/02-cloud-data-access.ipynb | 10 ++++++---- .../03-cloud-optimized-data-access.ipynb | 12 +++++------- 3 files changed, 21 insertions(+), 15 deletions(-) diff --git a/book/tutorials/cloud-computing/01-cloud-computing.ipynb b/book/tutorials/cloud-computing/01-cloud-computing.ipynb index 43ed7d4..2ab47e3 100644 --- a/book/tutorials/cloud-computing/01-cloud-computing.ipynb +++ b/book/tutorials/cloud-computing/01-cloud-computing.ipynb @@ -9,7 +9,7 @@ "\n", "
\n", "\n", - "**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n", + "**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers (CSPs) such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n", "\n", "```{image} ./images/AWS_OurDataCenters_Background.jpg\n", ":width: 600px\n", @@ -20,7 +20,7 @@ "\n", ">Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider like Amazon Web Services (AWS). ([source](https://aws.amazon.com/what-is-cloud-computing/))\n", "\n", - "This tutorial will focus on AWS services and terminology, but Google Cloud and Microsoft Azure offer the same services.\n", + "This tutorial will focus on AWS services and terminology, as AWS is the cloud service provider used for NASA's Earthdata Cloud (more on that later). But Google Cloud and Microsoft Azure offer the same services. If you're interested, you can read more about the history of AWS on Wikipedia.\n", "\n", ":::{dropdown} 🏋️ Exercise: How many CPUs and how much memory does your laptop have? And how does that compare with CryoCloud?\n", ":open:\n", @@ -40,9 +40,15 @@ "Tip: When logged into cryocloud, you can click the ![kernel usage icon](./images/tachometer-alt_1.png) icon on the far-right toolbar.\n", ":::\n", "\n", - "**What did you find?** It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n", "\n", - ":::{dropdown} So why would we want to use the cloud and not our personal computers?\n", + ":::{dropdown} What did you notice?\n", + " It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n", + "\n", + " On CryoCloud, you may have also noticed different numbers for available memory from what you see at the bottom of the jupyterhub interface, where you might see something like `Mem: 455.30 / 3798.00 MB`.\n", + ":::\n", + "\n", + "\n", + ":::{dropdown} Why would we want to use the cloud and not our personal computers?\n", " 1. Because cryocloud has all the dependencies you need.\n", " 2. Because cryocloud is \"close\" to the data (more on this later).\n", " 3. Because you can use larger and bigger machines in the cloud (more on this later).\n", diff --git a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb index f5a8b3f..46b9701 100644 --- a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb +++ b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb @@ -44,7 +44,7 @@ ":align: center\n", "```\n", "\n", - ":::{dropdown} 🏋️ Exercise: Datasets on Earthdata Cloud\n", + ":::{dropdown} 🏋️ Exercise: Which ICESat-2 Datasets are on Earthdata Cloud?\n", ":open:\n", "\n", "Navigate [https://search.earthdata.nasa.gov](https://search.earthdata.nasa.gov), search for ICESat-2 and answer the following questions:\n", @@ -59,8 +59,8 @@ "Here are a likely few:\n", "1. Download data from a DAAC to your local machine.\n", "2. Download data from cloud storage to your local machine.\n", - "3. Login to a virtual machine in the cloud and download data from a DAAC (when would you do this?).\n", - "4. Login to a virtual machine in the cloud, like CryoCloud, and access data directly.\n", + "3. Login to a virtual machine in the cloud, like CryoCloud, and download data from a DAAC.\n", + "4. Login to a virtual machine in the cloud and access data directly using a cloud object protocol, like s3.\n", "\n", "```{image} ./images/different-modes-of-access.png\n", ":width: 1000px\n", @@ -70,7 +70,7 @@ ":::{dropdown} Which should you chose and why?\n", " You should use option 4 - direct access. Because S3 is a cloud service, egress (files being download outside of AWS services) is not free.\n", " **You can only directly access (both partial reading and download) files on S3 if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n", - " The good news is that cryointhecloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n", + " The good news is that CryoCloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n", "\n", " Of course, you may still need to access datasets from on-prem servers as well.\n", "\n", @@ -78,6 +78,8 @@ " \n", ":::\n", diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 526aa22..bb6c5df 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -34,29 +34,27 @@ "\n", "## Anatomy of a structured data file\n", "\n", - "Below are 2 similar images showing the structure of an HDF5 file.\n", - "\n", "```{image} ./images/hdf5-structure-2.png\n", ":width: 450px\n", - ":align: left\n", + ":align: center\n", "```\n", "\n", "

A structured data file is composed of two parts: metadata and the raw data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.

\n", "\n", "```{image} ./images/hdf5-structure-1.jpg\n", ":width: 600px\n", - ":align: left\n", + ":align: center\n", "```\n", "\n", - "

Image source: https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5

\n", + "

Image source: https://www.neonscience.org/resources/learning-hub/tutorials/about-hdf5

\n", "\n", "We can optimize this structure for reading from cloud storage.\n", "\n", "
\n", "\n", - "## How do we accomplish cloud-optimization?\n", + "## When optimizing for the cloud, what structure should be used?\n", "\n", - "### An analogy - Moving away from home\n", + "### Brainstorming by way of a \"moving away from home\" analogy\n", "\n", "Imagine when you lived at home with your parents. Everything was right there when you needed it (like local file storage). Let's say you're about to move away to college (the cloud), but you have decided to backpack there and so you can't bring any of your belongings with you. You put everything in your parent's (infinitely large) garage (cloud object storage). Given you would need to have things shipped to you, would it be better to leave everything unpacked? To put everything all in one box? A few different boxes? And what would be the most efficient way for your parents to know where things were when you asked for them?\n", "\n", From d30b80a01f63226b67a1779a0306249d03defd7d Mon Sep 17 00:00:00 2001 From: Default user Date: Fri, 16 Aug 2024 03:34:02 +0000 Subject: [PATCH 02/11] A few more text changes --- .../03-cloud-optimized-data-access.ipynb | 4 ++-- .../cloud-computing/04-cloud-optimized-icesat2.ipynb | 11 ++++++++++- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 09f7de9..9772556 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -52,7 +52,7 @@ "\n", "
\n", "\n", - "## When optimizing for the cloud, what structure should be used?\n", + "## How should we structure files for the cloud?\n", "\n", "### A \"moving away from home\" analogy\n", "\n", @@ -66,7 +66,7 @@ "\n", "You can actually make any common geospatial data formats (HDF5/NetCDF, GeoTIFF, LAS (LIDAR Aerial Survey)) \"cloud-optimized\" by:\n", "\n", - "1. Separate metadata from data and store it contiguously so it can be read with one request.\n", + "1. Separate metadata from data and store metadata contiguously so it can be read with one request.\n", "2. Store data in chunks, so the whole file doesn't have to be read to access a portion of the data, and it can be compressed.\n", "3. Make sure the chunks of data are not too small, so more data is fetched with each request.\n", "4. Make sure the chunks are not too large, which means more data has to be transferred and decompression takes longer.\n", diff --git a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb index 995c139..de6d542 100644 --- a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb +++ b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb @@ -31,7 +31,7 @@ "\n", "## Geoparquet\n", "\n", - "To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see [atl08_parquet.ipynb](./atl08_parquet_files/atl08_parquet.ipynb)) for the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, [`lonboard`](https://developmentseed.org/lonboard/latest/).\n", + "To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see [atl08_parquet.ipynb](./atl08_parquet_files/atl08_parquet.ipynb)) for a subset of the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, [`lonboard`](https://developmentseed.org/lonboard/latest/).\n", "\n", ":::{seealso} Resource on Geoparquet\n", "* https://guide.cloudnativegeo.org/geoparquet/\n", @@ -118,6 +118,15 @@ "m.set_view_state(zoom=2)\n", "m" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The End!\n", + "\n", + "What did you think? Have more questions? Come find me in slack or by email at aimee@ds.io." + ] } ], "metadata": { From ba86c37f077f57d3c5c4630de95094268f8b4ddc Mon Sep 17 00:00:00 2001 From: Default user Date: Fri, 16 Aug 2024 18:44:14 +0000 Subject: [PATCH 03/11] Add an image, some emojis, some definitions --- .../cloud-computing/01-cloud-computing.ipynb | 6 +- .../02-cloud-data-access.ipynb | 70 ++++++++++++++++-- .../03-cloud-optimized-data-access.ipynb | 25 +++++-- .../04-cloud-optimized-icesat2.ipynb | 9 ++- .../images/s3-bucket-with-objects.png | Bin 0 -> 18804 bytes 5 files changed, 89 insertions(+), 21 deletions(-) create mode 100644 book/tutorials/cloud-computing/images/s3-bucket-with-objects.png diff --git a/book/tutorials/cloud-computing/01-cloud-computing.ipynb b/book/tutorials/cloud-computing/01-cloud-computing.ipynb index 2ab47e3..31a1ccf 100644 --- a/book/tutorials/cloud-computing/01-cloud-computing.ipynb +++ b/book/tutorials/cloud-computing/01-cloud-computing.ipynb @@ -5,7 +5,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# What is cloud computing?\n", + "# ☁️ What is cloud computing?\n", "\n", "
\n", "\n", @@ -41,14 +41,14 @@ ":::\n", "\n", "\n", - ":::{dropdown} What did you notice?\n", + ":::{dropdown} 🤔 What did you notice?\n", " It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n", "\n", " On CryoCloud, you may have also noticed different numbers for available memory from what you see at the bottom of the jupyterhub interface, where you might see something like `Mem: 455.30 / 3798.00 MB`.\n", ":::\n", "\n", "\n", - ":::{dropdown} Why would we want to use the cloud and not our personal computers?\n", + ":::{dropdown} 🤔 Why do we want to use the cloud instead of our personal computers?\n", " 1. Because cryocloud has all the dependencies you need.\n", " 2. Because cryocloud is \"close\" to the data (more on this later).\n", " 3. Because you can use larger and bigger machines in the cloud (more on this later).\n", diff --git a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb index 46b9701..3f1c983 100644 --- a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb +++ b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb @@ -6,7 +6,7 @@ "id": "9310f818-bbfe-4cb3-8e84-f21beaca334e", "metadata": {}, "source": [ - "# Cloud Data Access\n", + "# 🪣 Cloud Data Access\n", "\n", "
\n", "\n", @@ -35,16 +35,23 @@ "\n", "## What is cloud object storage?\n", "\n", - "Cloud object storage stores and manages unstructured data in a flat structure (as opposed to a hierarchy as with file storage). Object storage is distinguished from a database, which requires software (a database management system) to store data and often has connection limits. Object storage is distinct from local file storage, because you access cloud object storage over a network connection, whereas local file storage is accessed by the central processing unit (CPU) of whatever server you are using.\n", + "Cloud object storage stores unstructured data in a flat structure, called a bucket in AWS, where each object is identified with a unique key. The simple design of cloud object storage enables near infinite scalability. Object storage is distinguished from a database which requires database management system software to store data and often has connection limits. Object storage is distinct from file storage because files are stored in a hierarchical format and a network is not always required. Read more about cloud object storage and how it is different from other types of storage [in the AWS docs](https://aws.amazon.com/what-is/object-storage/).\n", "\n", - "Cloud object storage is accessible using HTTP or a cloud-object storage protocol, such as AWS' Simple Storage Service (S3). Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n", + "```{image} ./images/s3-bucket-with-objects.png\n", + ":width: 150px\n", + ":align: left\n", + "```\n", + "\n", + "Cloud object storage is accessible over the internet. If the data is public, you can use an HTTP link, but more typically you will use the cloud object storage protocol, such as `s3://path/to/file.text` along with some credentials to access the data. Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n", + "\n", + "Popular libraries to access data on S3 are [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) and [s3fs](https://s3fs.readthedocs.io/).\n", "\n", "```{image} ./images/cloud-and-local.png\n", ":width: 500px\n", ":align: center\n", "```\n", "\n", - ":::{dropdown} 🏋️ Exercise: Which ICESat-2 Datasets are on Earthdata Cloud?\n", + ":::{dropdown} 🏋️‍♀️ Exercise: Which ICESat-2 Datasets are on Earthdata Cloud?\n", ":open:\n", "\n", "Navigate [https://search.earthdata.nasa.gov](https://search.earthdata.nasa.gov), search for ICESat-2 and answer the following questions:\n", @@ -67,9 +74,9 @@ ":align: center\n", "```\n", "\n", - ":::{dropdown} Which should you chose and why?\n", + ":::{dropdown} 🤔 Which should you chose and why?\n", " You should use option 4 - direct access. Because S3 is a cloud service, egress (files being download outside of AWS services) is not free.\n", - " **You can only directly access (both partial reading and download) files on S3 if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n", + " **For data on Earthdata Cloud, you can only directly access files on S3 if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n", " The good news is that CryoCloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n", "\n", " Of course, you may still need to access datasets from on-prem servers as well.\n", @@ -77,13 +84,62 @@ "

Caveats

\n", "
    \n", "
  • Direct S3 access could refer to either copying a whole file using the S3 protocol OR using lazy loading and reading just a portion of the file and the latter usually only performs well for cloud-optimized files.
  • \n", - "
  • Having local file system access will always be faster than reading all or parts of a file over a network, even in region (although S3 access is getting blazingly fast!) You can move data files onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).
  • \n", + "
  • Having local file system access will always be faster than reading all or part of a file over a network, even in region (although S3 access is getting blazingly fast!) But you have to download the data, which is slow. You can also download objects from object storage onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).
  • \n", "\n", " When might you use option 3?\n", " \n", "
\n", ":::\n", "\n", + ":::{dropdown} 🏋️‍♀️ Bonus Exercise: Comparing time to copy data from S3 to CryoCloud with time to download over HTTP to your personal machine\n", + "\n", + "Note: You will need URS credentials handy to do this exercise. You will need to store them in a local ~/.netrc file as instructed [here](https://urs.earthdata.nasa.gov/documentation/for_users/data_access/curl_and_wget)\n", + "\n", + "```python\n", + "earthaccess.login()\n", + "\n", + "aws_creds = earthaccess.get_s3_credentials(daac='NSIDC')\n", + "\n", + "s3 = s3fs.S3FileSystem(\n", + " key=aws_creds['accessKeyId'],\n", + " secret=aws_creds['secretAccessKey'],\n", + " token=aws_creds['sessionToken'],\n", + ")\n", + "\n", + "results = earthaccess.search_data(\n", + " short_name=\"ATL03\",\n", + " cloud_hosted=True,\n", + " count=1\n", + ")\n", + "\n", + "direct_link = results[0].data_links(access=\"direct\")[0]\n", + "direct_link\n", + "```\n", + "\n", + "Now time the download:\n", + "```python\n", + "%%time\n", + "s3.download(direct_link, lpath=direct_link.split('/')[-1])\n", + "```\n", + "\n", + "Compare this with the time to download from HTTPS to your personal machine.\n", + "\n", + "First, get the equivalent HTTPS URL:\n", + "\n", + "```python\n", + "http_link = results[0].data_links()[0]\n", + "http_link\n", + "```\n", + "\n", + "Then, copy and paste the following into a shell prompt, replacing `http_link` with the string from the last command. You will need to follow the instructions [here](https://urs.earthdata.nasa.gov/documentation/for_users/data_access/curl_and_wget) for this to work!\n", + "\n", + "```bash\n", + "$ time curl -O -b ~/.urs_cookies -c ~/.urs_cookies -L -n {http_link}\n", + "```\n", + "\n", + "For me, the first option, direct, in-region access took 11.1 seconds and HTTPS to personal machine took 1 minute and 48 seconds. The second value will depend on your wifi network.\n", + ":::\n", + "\n", "## Cloud vs Local Storage\n", "\n", ":::{list-table}\n", diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 9772556..effd579 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -4,16 +4,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Cloud-Optimized Data Access\n", + "# 🔧 Cloud-Optimized Data Access\n", "\n", "
\n", "\n", "Recall from the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb) that cloud object storage is accessed over the network. Local file storage access will always be faster but there are limitations. This is why the design of file formats in the cloud requires more consideration than local file storage.\n", "\n", - "## 🏋️ Exercise\n", - "\n", - ":::{dropdown} What are some limitations of local file storage?\n", - "See the table **Cloud vs Local Storage** in the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb).\n", + ":::{dropdown} 🤔 What are some limitations of local file storage?\n", + "See the table **Cloud vs Local Storage** in the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb#cloud-vs-local-storage).\n", ":::\n", "\n", "## Why you should care\n", @@ -22,7 +20,7 @@ "\n", "## What are we optimizing for and why?\n", "\n", - "The \"optimize\" in cloud-optimized is to **minimize data latency** and **maximize throughput** by:\n", + "The \"optimize\" in cloud-optimized is to **minimize data [latency](<#latency>)** and **maximize [throughput](<#throughput>)** by:\n", "\n", "* Making as few requests as possible;\n", "* Making even less for metadata, preferably only one; and\n", @@ -72,9 +70,20 @@ "4. Make sure the chunks are not too large, which means more data has to be transferred and decompression takes longer.\n", "5. Compress these chunks so there is less data to transfer over the network.\n", "\n", - ":::{note} Lazy loading\n", "\n", - "**Separating metadata from data supports lazy loading, which is key to working quickly when data is in the cloud.** Libraries, such as xarray, first read the metadata. They defer actually reading data until it's needed for analysis. When a computation of the data is called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the chunks required. This is also called \"lazy loading\" data. See also [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", + "## Glossary\n", + "\n", + "### latency\n", + "The time between when data is sent to when it is received. [Read more](https://aws.amazon.com/what-is/latency/).\n", + "\n", + "### throughput\n", + "The amount of data that can be transferred over a given time. [Read more](https://en.wikipedia.org/wiki/Network\\_throughput).\n", + "\n", + "
\n", + "\n", + ":::{note} 🛋️ 🥔 Lazy loading\n", + "\n", + "**Lazy loading** is deferring loading any data until required. Here's how it works: Metadata stores a mapping of chunk indices to byte ranges in files. Libraries, such as xarray, read only the metadata when opening a dataset. Libraries defer requesting any data until values are required for computation. When a computation of the data is finally called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the byte ranges associated with the data chunks required. See [the s3fs `cat_ranges` function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.cat_ranges) and [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", "\n", ":::\n", "\n", diff --git a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb index de6d542..3b997f2 100644 --- a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb +++ b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb @@ -10,7 +10,7 @@ "tags": [] }, "source": [ - "# Cloud-Optimized ICESat-2\n", + "# 🧊 Cloud-Optimized ICESat-2\n", "\n", "## Cloud-Optimized vs Cloud-Native\n", "\n", @@ -31,11 +31,14 @@ "\n", "## Geoparquet\n", "\n", + ">Apache Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet.\n", + "\n", + "From [https://geoparquet.org/](https://geoparquet.org/)\n", + "\n", "To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see [atl08_parquet.ipynb](./atl08_parquet_files/atl08_parquet.ipynb)) for a subset of the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, [`lonboard`](https://developmentseed.org/lonboard/latest/).\n", "\n", - ":::{seealso} Resource on Geoparquet\n", + ":::{seealso} Resources on Geoparquet\n", "* https://guide.cloudnativegeo.org/geoparquet/\n", - "* https://geoparquet.org/\n", ":::\n", "\n", "## Demo" diff --git a/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png b/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png new file mode 100644 index 0000000000000000000000000000000000000000..91a02291f1affa2954190469e72a837b65f8ef36 GIT binary patch literal 18804 zcmZ6zWmp{B)-?*Gkw$`B(BK-}g1c*w#v6Bs;K2#*?yijmcXxLJL4pT@1^2JB&pzjQ z@BPs~=&o9|s%p+P#~foU!c~-{(U9LE!@$6x$;wEm!N9;W{rw;zK)>c1(2?26&3H~@to@g>~2_kbCae>L4;?!@F0aUQ`pg-?;$pc~{W9#13n#|dM zEm##b=g;JQ1NK(YNV?zRpszrlp@O$UCQb{kQ0Nol1A@OjR38kM+9vb}0imP+K`A^A zoR=?M1hXZ9^N~MsW3Kfk0gd^A`9`J{04F`95-}J_)DKj7NZS??@UkKPo*2Pj%uz9B z&}U%65D5rWM6h9$*;#?-!5EU}8R{}5W^doFO#IMtTAuHeO|C(`@rB+P;bEYq@NI+K zx3WP|DNHu}$|z$sm4WR~*aQsbV)4A@mbsn?$ikjtMx=ol`~?QVfLin@p$iU*IU!t4 zZ^9pBoP`b-`kCzsw5d2bJL(O~JiDQ#rilwL**j3Sb8c`Rp%5X#oIq1pam~oxuHOm>W^H#$AOdZhXO~?XUqi zFq~R#pU=k+0zWBH6UEI2f%}6BykX=L;cw&dRVImq-ys=)6jSmjJu+jwTxEeZH=j9%$>9`T$1&abcDFHU zOeihaP=pKFaEYBHhBpvC6*?yL;CbL z-@=GsPxL2-fi1#8leei3J`J??)s&7&j=k1=uSFl}uIT-zC}&hL2bkLb=2iz1YYyyB z1I-|JmM(V}n~oo+)5>CiNZ4YN$x$+`^lF${LYRGx5~|trB0pL3DkRo!3#k15y9%f#fNZgynfpkETmjSOH~T26rsNfjSlxjto|gpT<`;P+Rcy+ACeQPJeGr$ zAv8RZOIr|o6d<=62hHXk4DdexFB*KsO-a+@QDHovcy&T19^srl8x>W8WjFi7|gK9#@Z*o z${lRWp7FpX_{zuKf|*v8u}$dB_xJ!=cOC|ba_@fX9Zb`M?C^7r!2A2HsH1;>^`~NW zfSagUwEN+D_s`EYIWAZEl1&gkhsAwoDnAzJt4inzKr*c@x2;Z?a6w)o(_-d(*lk-q zptk;JYR0nuioFqo?slgi<6`;XD=%sX8K+#9>>D!_>Y_AOQ=qSilV%IOqNS#@aq7g% zO1BnFVqoxXF+MYKh}Z1s@AT#2>;yY2)RuN4_rhe?5=1hvRMIaF zPuB@9rw=JXrm9U}8j1+GzJIrhP%mMIPOgsPe18VxGwNGdm1}G!4os!03W=sctjd;R zedzRLIN*Zj=T{b55g~h3gKb(}3xA4LJEMpox z>OvLZN;AX77J%>Ng}F6@-|N>*09=9uiC_Myh18N~e;@w=O)_$1INe9Y^V5-$FH>Q4 zU|@}>CQ1wwy;RpCp5)56O6awUumb|h={tTbe^q_`?X%eREj9=YY&LG~0D-(fUxb*P zC4D~Xe`VK_-SP*Nz=?JU@rEtc)gj6JdJkgE`Wg>hqdZ5*EZCX}^kEvEj4o>bO$nVo zy&RYUL*#OVE<;+Ht*`>qj1W~x!?51&w)%}*&ZT}|rq+iFHQY7WXrGw?kHD4&aovVfIchc*@?h06RlsaYGF_j-^ z5o5B5O|^H~!z}NiDIxL=gCH$uNLRf7Fe@!hYXsGm+asfL|D)#=Kcp7?oGp?%W%gU7 zn`LxF#O^a*1$#Z1jUI)KdipPBc!QDl7Gai6DCVZl38_~{#zdpD(a2`JdH)+`I=HHg zKSH4df1e$1^h{BrY-%i!l9nihP<69Ga$D>s9vW=~m`G&|pQDxYTzrVL6A_E2`AxqfaG#OqQOyg@sk5s@dO6G{IUC6pD$dbNZce zpw;!peDE%NZOzA%DqWfhr1(_}PrR`kA@3tr|Lhf}nuc+6A;Wyk7(~G|ZP!PO^9h>j zuLpeZ@%%>I0Mz+d;1#uCU}R%7aU{GkGbm;-f;dvUvLM5(u8&4*d-z>vgD@g& zo$_X!-@9#_Y>d*Mgu+8_URso*SZw1w5(tl76h9ra_%*sr-MZy`?CRPLE){f9>XD`b zz4Cf}ue&E)I^y8=e&I==ve&7*11E|J!_G>^U-9UWvDM|;HIa`Si>7=p0&17sY?>p* zCn{pi1o@X4v;IT}3mguwD8OL%iX5KGhJ890aVxw!CuJlO7|mO=%R8p@gS6Kb708aQMKf-rhOj|-ZYzx z+PEO&_oDo;I0Hp^5#|1W{BGS{e{@Xkw0l>lQQ$!|B=>5xOjlo>4xe>}xKd})UtW}Y z7JlkS3IT$(!es;dl?SvbKl%;_fNh4O$-8dx|S5rZ~YxyymeOh`}23U zO@0T6>{Ckcz7*$dG_Ohzzsz?ox=-shQMC29S!tZ)^U8 z8nwQ_Wg3AzV-GWk^WWHr;b6VLNV5BgG0?y;K=x1NLjdY zJr)+P6C~^lfUcm1SNTcr@q4#ljb-xF!J1gXpDw?Qp0HeCyeg(Tp=j`9n8`(bEzHSt zCkD-XnNS@EPD2_gf!IysA*p1?X)e-DMlXpKD6)Em`KeXkR1(y?gt*Foq<${h_JtZJ z&`rXL?Z|nfdm3MYgLNKsfr>30 z26o&L!}si3p07sH#D@>_L!W^^d%i!d3n!VryUGVt?8(R`0)?E`^A4rX-l~Z>JQ;FT6Ru>o8TlL$KHGYy|Q{K?=UB2{MR9Ua#uUi^7)toXfFiES8 z@Ykp#23YB0$>70F5KwR7Vsw`wEHdv0^rWGi@xfFS< zR#TL|?�RcEt~kWuTvxdnY}E5O(jb@S5u_B`65WEx*>d7h6x^ z>d>mq;^Ql0iRsv)3wj=C!I$W+YCj04rIe+@^R+iH8EJbcU`08 zGZFyiK-SA+m?lgk{xrjD#Go3z_D)~c_~4JiP97zjS>Bk~`Uo?lm(hEUX?6cEqAwHz_jp?R zUqra70{a+Yh$ST6va)cib=Dw#ieFC%fw_bU)9E=6Xwio62JYJ(cT^cr7SDWnp)air zmKSve`iJd@+~9Y6`2NZzdEHPEl1`t9D5?<03EXQT$zzFtiJ@nf)X+SvEPZR2)#F~S z@Y(}*tdAc!4r;aIozNyxU-&4gN(`trnK4E<66r`2QSBtCe#2u$p^d~kMDPPy3?)FP z(Ii<12cYxj(++=)XS>5fSa}$KIpL>`@S-hO+&j(yxvPjzzXvXz9Slc^cler{?^k4O zr9Xf^92|(=C9-;{7<|yg!ILr;eh`|-OuKV}6I64brm3p}&2iS5PnZ_GGoLrc)sq%s z#m zY$e(mU|q!HVwX@uRA|IU*1Hmf*fak4y9iM202yWagpiT?f|79YHwvsrl%xJFZq{`R zimvwk97Sd3ku*^yyyFo_L@O1iXt^d()C8AHLN1!d-oNEjr1hUU^73K>f^EikGgHBz z)moFo-PsB!0gXI80UI_=Coi||8B|jSOyVkp+f@mduyZB8#_Oct8+92K2<$D|1t-y^ zw>gY^v4%2^NU~w9(FxK8+ZeI;;pMsEh)4to;y5Xp z49l!@M{SR2xmMhho89PC>xSnXUbUt7Vl2!_`yp9nfw9qh z`U@|A(vSh?Ti*LMkisPVU&xP#fbqRMlrSM?IKFBK>Rvj~0Jp~jVGAE;yA)S5N5A1B zyr}g$W>1pdcGK0-DTbb-R0XX<65!G$mIXBSjst6nq^lU=zfK0DvYvQ1!x_UvR)DsM zy>CPHb&)Az2B!!7A}>Who0|qdI1btFq18EJZpl)F!4D*YGXNf0qWRL|kJba**$1*% z&*3zKb+rMtXn5S9>|lzAnPBt8phSlBp(JqarZxAiw9YT~B@-~VV&9%vadB~LRWfsNN76H z&UFW_KHzv2o}La1j;2OM@h$>g^&P!_C2b4CL0Woz=&(s3Rp>vY_kv8p?;~h7HkV$` zRFX%rZY#5YUZc5KV0408iElANtXH#C460Asve=`wnF>7=y-!5VltYa-w7V{5bCO1t zX^cT|8LaOQskX>z@RW}RE64!E2wC~>I5(H*HCTY8}Fv> zADCE)b{Vn2`NCl<)Ba*Mzrp-T!rDwH!Ec2yxrlDRS`~|KMNNX_`el0|gAj!u_A}N< z?}WTa14seL;=1SY?R-s$)xY_xICV3@lOY=EChqJ2X(QMUq^&z0p{+ZkZQi_d_s7)$ zW%LsmxoQ!%4Ofiz&2v#fW;p+3o_-{^5au$^ej%aHcP%nF<6Hc0eYY}|V|DB-s!~)%o(a-I<>(UQmHNyzdb0eeRB=9h@!hW zPsZB`&I8JE2ez-HNa4JYeiSLnO;$QM5Um}dxz=vIfttqWvFNXD{>^a>$Q z4pdb#F5n+yfFBy?6vBj8AL_Jw-7uvMORumIAjzKf$Sno)kx~@YhZH6(t759-ZPrpTb(}?e5v4L z>9uy|TMlqcMDp_#g?&4fYSLbtpLO6FyuTlsM#J0`AQ~pG2qROS&p;rIy%5c`Lzg6rcTkmLixnP%2O(e#HR9rf(Zp;{gRU<%Kwl@WEAJFT0CvE z;_WZxTQxP8cMTNCvrV2TOI7kIDbm-~j~abT|2EMfSc?G$nSk;rwtJUoq~|wz^)M@WfuPI<8WG5Qf z_(gI(Jz^9(eD&m1Up#f2T?JdN-Yy@Modb)&O}ky^xu`+Jmo~^cDteC}?|u69;=s>d z+OjuATw07IX=dC%AxB5|L+_`d3`zhFwWFaf|C99{d?-weql3d2x32OJGPS|;C0+-N z+eG=iLZx1qzs)awOyUn@q3mn~CP8|7Z(+Om8BgT9SG0|N_(>vWQ0=>85+WqmMx?`$ zZHk3`g6d$`2z@wq5?OLjB+l5Vo_S$bqVs-x<$6s{qb|GNz{OqX9P~aZ%%1Jvw!{YP zXNWMDlcN9dCI^?qyiVhrc>h`+S&*PfLL z)E6#(Klq%A>$`6(us7YbW&outOo-R81MHkN2bsQ3prp1oW=qOn&`zEVY>0{4PJWr$ z7@Am_+p-g~%D+FNfCl`OJn*g^$z0a{t)ArS2QaiuB#6Aj_GoR!w=>y!56575wsG|M zQ_Sg*0(6|Zn9Qh|*mNVbY290{p@9-DpYxZfl~vmM8KX~t4rpLP@;TCpJqr&hDT)M# zlfOI8<_9`P>|EN|fHjjshG?d95n<)sNRYe#OnPReca)Zy6dvl=XnahVBnQyEffZP& zlaCAy@V86)R`a8kqd|WPF5^q!p7T>cSQoi5UsD4Pu68;Kl6=lbL%1gKl>DdC)uwo> z`GtH@SD#A!r{C%oOelk&d>0=rOS{EKmwr zQOZHprUC#>`*3iNjj!){bWDgXM~%C;@ABbK8J^xYGtX#B%O=jqJmf0PEH$ngre$L0nbqb9?`#g9nF+1@eSHL1@*{0e-5~6B}%34kW$Q9bLXlX#29-I^%W-JQP7aXGa9RZGeJdi zRaDqYKh3E%`!arVK~hng2|`+G4nM@h+%>wt=Pe>Poc1WL;)5!>38^v^KwC0`p0$3; zd<=qa7w=4#We(DHc$?>lRgH*I$fE6sW{a(sKhAXzxd1%8>r7}w?lP4KEMRzjkPpaDR%Ie zakg=U$QgiZL@tH97K6PH^!c1Wem#S;#&6v+~m|GpgKsC|PaY|Ky7cFgFaV7tg zTb%B?!KQJiNq9V%uj|bW8pR532s|$xi}G1ZapyBshNfnTmcaBdaYS-JMYB33ymJXE zy}$Gt#=`Dh-E#gC1)PomGoF0*fpjQ9&k&>sc>_oIm`tAd1R5>Sy-LI`wNS6e(Q;DO zu49;J0^MlA)|Kv2m~{OGbi6f9QvW$ctrqo4EALxcxAoUM97QuB%k1Ub>W%k9615U> ziXRSWFqH6yOEokprp*`mKU+DF0d}>-JhiC;dwLddxKDu7H5};BXO*@f8=DU1Y9n68 zjMwaM?JC2V`T=iuml4KRRQR0rUoU`+T&`Meg7Zp(&|&k}&^lk%hP-S-cW#eNVZz8F zV_K*PP#Oo+K|45tFcE9$U`5^)Pon?x6v8DiJW482^bo~@$4C;hVMBCN znd%e|&Iw@mQ4@h=HRUmqPje+WOe3uz6=l zWYRde`5$;_Q%;NQVAN`N%`hCGr6?AB+#>?2aWo8SzJCG@Vwm(gM=YKiFY;U0*E^bq zXAMUzicGC02jbfF&AlK31vb~C*Rd1d_rlfKo!{BV;1;JLYwkpRW1`chlqfRIP~g>_ zG8wcczWsZO%fMVOBGQlLrMKM`*1~9ku^mG@-qf0L#ta~e!y&JON&p0F_&}W{aNRaW zjcyk`BnexO_bpA0x?JrWcfsY`PjDv+$3pgpmZ@0EL$ zC$EMfwLR-mXC?Sk{!dA=FeVh4fR_hs?mC~s7?tx1HJ7~8Or%fYD{i!-2U!U|IVpbW zI~x{uKUwl&_o!If8Sj#qXdkW}l&(D@>4rZ?32K|5P68+N44mBtj?$CiEotlr&5qEf3@)T4rhx`bQLE|(viV>}chnbHb@w)oe z{C;d%-HSr7BO}38hMHQ{AjQ?yXMsrgQs#;XHv=mhLuAd>x4u8HU$vwWhv4!Z{85Ff zdUw19mP7|6YbCH0KUi=ey>(*Z25~XUL0shJp_iC*UfIdHQeX{xT;uOE4*72>>d;sE z-4|X6=6gXZ7?)rePDv@CVC49$U0i1dy>o#(CxuQETU3^tn~;N=yFrASvqO)T*i0~2 zwX076&O?yQ(N+hv^e{z9f3Z3cht3MkQw;0o>blU=m2)?9_V9+gCQ_Q*2^(_#Rf8Wl z(ydgcV#a)sJ~{bfgyqBsYDZ~*uPoCSd$JmB)NOdF$lzyuqPZD#fkb5=SUhR6$o~yz9t0!{OhW%>#l~Kp#H~km6KFQlb@yoXww;fyH0E65e^JVl zmz1MTO3h!ekG1=3WvP#>`R=V()~C}ff?=2SPNSNR#gmKHZCo)ucpKl&9UJ6B9*Wjt zPfw5z2%|<|ekc|P(m;3=Ox;%H2Xht`HSzFTyf*J+))5$93107T9%a`ss#J2Jsw-^?(3Oj%ORxLEN&mpkbWV~+`ci*wct zz#aAN_GJD~71XHB*3QY)VUD~Ol%)sU#!8&TdyRP0^!1pX@$v2ddq(r`??D^o=R~{{ zxlZ1`e$f^pPWVg+mO}y!wyG>!!fS?V9o5YoPyq~Y?`>fVhAX$<+5)|_=L={{(dsuiN zts#W=d4=#3bxh_eNF;@0LvW`)jRey|oiN|nom|VahUa$RprL^%X|)ps$T>chIv6Or zG5NQ=C;^Itf&Vr+9#JqZp`*c2BAS-sHR+Nbo^da90RtF*t<`(}l#I27tL1QOrz`;m zVl&0tnsi?B!Np(>TWD%fKo@i7yLWMcjo{!psACCLd3vjEYU=z6cL=HxW=V+3)q?r# zsb;t+?kOW8R8lip--d73v>Vcc+9(yhgC_g?&ryFfk2rQilfj@Mk4n9kyM@?-V^*Bv ze)TsKv9?+Q-COz{Eo99x)M6tn&-`}8b{d+#i!%G>QZw;_K>0H0vOzZHLE5-g(YUX+yK!2 zJ&O*H{wMSba;Swy2?W=V7bDf{Is&LU&fY%hRBy2tP{~0txj~1z6t9JH5vtkQ+2lBL zRnWG=Cr)Jikj~)b7#)arS%_w^Nf=&~k6CFH44Wt{Hl8$Jd_m z&Q41rl#{JN$d{VlwLkL`aB?l+_1cb-9^)odB&1ez6s4MKIV$vDx==sMhWOUMTy$=k z=ykB$nof=qRbDVkicHt&-VfM5H76f?MJ@=| z-v5OL3^gVMRMNd?38eL6f?AY%Q~(CYCqWLjOg>+WUs`(S{{%@OT0%vCr~s)~m(j3_ zd#}j|wR9+O;{P(llJSO%{kIc4I{AmSchIYb3jgfk>Ff#O}%rS_!LAJHlnn@D% z@h{W?4b-gIsM{|$F}0xgV4~vRjP9qX4ylQ}ObA#^!(^ha=^{>g+Vo=_9N5y5UMYaN zte)x%r;eFdtzr`0l)~w{9n!RSF^1DnKXt>c^=Ou17{_>Rj1$1_WRbRHE}E#)V_F|hbclJ zRombR@sg$$!KEmJc4bCtsJJtUdEYm0Av;4 zL(7B;7)OcyEmW35`WM#M-@nYASar?OjS|V#LM0T_^osLiI3-yr1SJR^4>B#`SE@K} zz`$n%W8E5+Pb{CkbzY|bITnyalUhh75rQ1wL@Mty_Sfuo@g75+@V2S(gxO;c{SKvO zeWR;c$l0f0r5nG$Tr_dK6*Kr!a)eeZsgRD6ayII(G0hI?N(w-KrtQ^nQ@dqPL5x?;yp$|Q~^me=R;fM8wBs(qTqem+_HPAh{xj20W-Uk zzU45$j2k=oOsr*KPG!K3Cre%owEaFGvQb_h2QELdAZluQ8k7MNDb*6>L0eBGr0*r? z0(hgIOqBW@>?t_Z|3mHn76Y>)S2S7>oDy^iGJ>1J8Lng;Q;mZH-jY7!v_F4DxcL`m zjR4Z2`7x)yMYcsYl0ymGfQQnO^W7Y6-)-*ZVJ(Y>q-c83hv7zvD-Kv2Gb31`gj5So zd)2{0Q}g#Go`SaHntl6sNDA;FDu4duuiU-{x;t889F4HTb*`sv)Zr$s38r0i5+xgc z*M^c(J!|6PD?+@S-eV;8%9jTPe8Qx@X!M|CsFsP_#{nl|RzgC?DBtDD^ldfH^U7)D z<4^S!Ez-A#Dj#ThXJRi(b9?SSDT7NI`AC<)Ud2lRa(*32E$^DrnP_%z-!wfn&(84_ zf!Bd)t4AazfU#OSTA$=bf(qTp5`?A|v{GP!sqMC%4t8>w9*1W@Us2ohU^yP3omAM#HH4NR@>@_opkC1K<3Vhgq_{Qh~3RHM{nOy$R~SHueB>*HDv)=#H@~nI+C=#y&i(SkkYppK3I6eyLSqb z)R%nvsV+m!0xL z_9RQW@1WPK^HU&1!9z@>a%y5g)vP#l$6|~{;^%UW)c(&Zq}2agF?lFp`jJ!o7eQw}G5#h47ut~(ab|S{f%(tpg(}}prOE5R zV_f`Z3>&qp$-fOf3$QRpB{C^^!;zenf28$#neT*y&e_&wlNQ0iHq6P+5hU&I9{R-k zPial#&n|rvgp#r&1*cEx%>S(gsgW=IAkFQGp6=zw{UDsbt5mVG4oz^`5AkC+>m43xO;Go!WO=NoJy>@7_~TkJ}DuFGlXi z7YLPxxqUPfww@aXWmSdo!#fR=uXO{HHqYf(-hRCDmqB=SC6Xd#RXAx&L)#33%Cs4{ z=KHa2on5IW`?dk0Xv`prW@s#PVH^!!@GVAym3unk6Eb7RbYE+4W#{11X?L$&?{^{r z_WNwKc>{vC)yGEzyr2C}I8=uF*7PNf1x6pD9-lotmhG9X5tLg6yA-g#BG}3PTl)hy zRjCKdrlw-}$NuC#!!-q^4zAa1yuF1yel*WOYxdAS<2-1fO<_xur)2CF{4*nk`qCSx ze0MOxk?{6EzDZ#!&_pb^hp&Z%vq8ZFfgyAuT@uM}hbRBH`ApWhDYqy*w9y(~w18wy zHayf?ROKJl4Kge~uTBcyQsDT!M1(y%vxZGM_+kbK=C{YG+pyf;6xcB}1mV{|bRsy6 zD@SJ){)9gc-Q$ACtC#c+j<>1Xx{}mz+L2Ay2Dz>o=9><_R)5fxefxBZPvE;707~a+ z(5L2?)}>;ou#lLiG4#R{k!_^XhA}V-EeoY37JQ7Dg8Us%Jc$t~J>fb_9EBRWrYfem(A4^|zF_p^Bp+^SWT zGf)BND#O(?>$n(Y*PIr5a0Cx^mI+8HJKS*kA9}t(8gP=eA8RO!mk0BA@xilpes2g( z>naR9f&AO>n?IvpD@$ej*Vwhx^Z(%crQIeZS#7o-THj^-yo{phiK%1jSXsjt3dw~+ zHd-l{FBA_nU!l_3Yi!sH1B@>f4+z#8+)Arh znY&B2*tz)*E-9olKOPrmhptq8PXx(~_ya541zh53K zV7vW#o1(y5vrwqPo{3Uf+TO_=OZdL}&0ri(AT^eyE1!=<1Bf^hO>^?2`uPvCwBPen zqO-VKgucJ=a11|ovS@AG=~i(zlG+_Q`1PEx`_1>Oi`h*8f?B#e#UlgKHP+G+Jol!u z>(>NnIh&F&AkchJCigL*a?!J*A2Z>N)wt=u9s$*j)drPlfk_Nokd8(+-feiLb0uX% zu$9Fxl2?z*+Bn_3klSnS6y{vkpdQNu?>@KgG(%@^U+G_PbVE`J`)_Uf<%aRK^Vi^G zgD#!xI%-+DEhq36p7w;|23K&Eu4RPwy76?F${r?3Sn-eAKRf)MMVxkaw#!ZJ6p zd4<^^0iH|^M3o>oxQO=!hW9zmAuUZ-X349EHrXk}Pevp%nsCILxswFUooy1p_wAm1 zY}#YH+VuMLhlW*p*Aw6O-F@zZn8Kgqr<>HDXO_|7Cmsa8#<%vUQ^EV&yk$9SwN=#g zcn8$cU|_A{b&QffgT&0i@={;XA=#G=_*9Bl4#SvXu0%(&mnJ+{gXiSnVa!OFfyZ2p z=4WbibfuCGmcqy3$XXnBeWM~BAv5V%mo>Q zVzukLI}C1uzCsd}m)EN?(IP%-pQPs44b0#XCwf>?kmd}0b9)9?HCdQ`Ypf0j3e2x# zx3$AL*qq;K(#XG+r3M`zlNf#y>Be;YSC>7_8u$CmXp6|QxL&P_cG|T)ziM7pz5wBJ zD{G;?Y2UnC)1CFy__H3}EyNnu9QxJb>r<+|vddhOmLfxMo9AAFSw&wB%3Luacb1B3 z{^O?h-t;R`igo|VOqF*|J%S`XxNhKmvd;HdZt|VlyC0lauQ%pVf-loWea)@)*uL}i z8kUQZ>p3Y!^bg7Efz;Iz{})dleP{4<;%ADqzUH*D)a^X>3k%23<#LUh1z5O~ghSsM z#w0z;5fL%zLJHq`)X!8_D^COssJ3e7jbd>BKT<+2u;k3pn;6U9nU5zhuam>YS&}b; z{YpRj1NTCMI-$+i9FQ?d%L!hTP>DvqQn;q)xL>~-_9p*se)~({&)4TqYUjtBMqX#y7G@GA`+6u4tKoTh1}=A_nkV=`xAd}U+p~g= z4$W{?C8Y%-1Ps9zyEj8y7#jr0%y{NQXV8XPR)d`R9$?dY>6msA0b^i4 z-Q%{i(j<_e*`ojAg7hSt41rJG;p!*N+qDM?RWLKyqLhKZ?1`p+L%5lr@YmnBcSr$0Py z^%kb$zi$hl=jx+~jV0=E4ev=zvpuXOd`;HN4~EFVfA$WHPs|N&Hc8(N-Sl>%1I`@1 zGlY3q#DN0z^zb&FN$L`n*m){1 z$<-mtOISn&_qV^LN>}_B{ZGDE+y73_LX{jk()zKHvQksMm@ck{yp-)vbWblM^F4h-lPMkL!+w|U!E<}#!Wt%Tjnc- zo^@i2yFc0Io4fM+9gHA#Jg?e&Zbu)qLkSjnR}NBXaal$X(Gf`{VbKYaN_KL1p`5&1 zzd6GP$wfQAkeqR9R}>{$@f~FcvCZkm13cO9&Qq;0A%;)r{XTezL($5Yk`^85Eu!bDC{-V-i+VCO9cjMzkJ3 zY8gF5pI}~d>Ck?&uqGn?x)hsZwUByXWIdhxl? zEp7EiCyRy4_eZ+in96nKD{T7Y13^?BE~RbH5fNdp4YVaf5T4n(pgtKpEW(38)j47c zFBLr(x9{4#1w|6e<18k>kdvN4-qRn=Ru?kYW}iEE*ZUv?&u>ggUBuL)r8lAZFA#8M z)3q=kd2=z-kDI(0>b8+mQUrIG15VVf)%-8gU9%lZUyO7m`Z~lb1l^ zpt(C(ls1OcM2 z)t_CfJBg8~4!Dn@-9yxN|)()3&INb$B5L&KDV|sG8b=O+0K5s zITX!#x#4$ouFWmT|FMNm!fh1P)JWbO?fR&KpbK|SqoHITf5#kOwIBOO-pJX|vX2sY z$-s-Y7!NzzoRh^3s24CgSK}>|`^iAh(kulyyYVHHkDsAy8 z8DvnMm>ay#B@_eQuGG(V{>eV}Y=9R{cj8A;=zozd9TeH(Ge#ntASiG^=;&I0Bk$(a z(=>hAx6-$bTJ?SAl5O`f;r9}c5^`M^cxvBjV~!wPvhmqs7T+-T@TRSRDV-Z zVaNOa9YADJ>4> zr96};E~D$`FuhH}grG5!7@CII<0qkyj`tgdTx{k}sjGzjK=}~8_G9U)-NE>d%uned z{2Kkm%(O-bThh>-=nuNoblyM^rrn#Rd`BZel4yrBvYs_cN~2D(PeIMCqSBW@j%4;g zw@ZcttmKPoa&6oC?V_;(jRMZSeac73WnBG(LR!R??|a={o|g%lD8WDEB^xqh2~vn1 zE2HM(i#2Ed7_Y9Dg|ULc0c-ZNf{Tjn{A{e)4{f8O*HEtnGy35v6)eIhsaimL$$z-Z z@_)EXtlXTkJFxJ!CFm%tc8MwG0X@F7ly!aTZAtyYqBAUCTE21L4I@Oakn;}| zcFNh9ETP8}(Xqj|NxUj(N?g&=ADvt@sD2xR`kq$+)h!QydtLbAPPgM1OC}WtTEOY3 ztibvr$$tT{luK9qV@V1$#tji?Yv9j|@9#l6aSo<8?A)^CGyv!x!mY`CsJd`oiJFA1`wxT7Ud-Gq0Ldq1sB)`CM3*-Y0|TV|`+otjErY<_9eFT_2M5%> ztRooIu)atgT)e@{Uqi))2wqQaxJ#wNZa$d}Q$b~&wLcBc|u!Tg%${6EV^6J7A7wGCzHdww~%Q6&*bH*#4$_1fac=F(#3 z?=D^rPxQ3V{RL1{S2s-7`xxb#3W@5)3^@}Tmw3R9EZEZPU99_5O7CAw9g9=R9c8RgTM`bF z9qIU<4_0ueE_Y;heu$;nQOB(H)yN)Ug8vu5(9@Tb&fbgjDc+xFL4O4_?#K0C_zxXS z#ns>{*C!<8P6ux+=*G*#IV|xm9Ko8|MA~Am072}(?k#kViia<6qy7O@YM4r6$V;JK z!`29bT{WVWxGoF6axhBJg7-=$!y1fv~nGWD>)7U5rDu6<_L&Y~+31ZKJVc zl5qczS;Myu$6fa;E}pTI(&sD{d7gxxJ|HK0EO>Zk=IdF-hDpZr(OBrRu3Al7p+~y+ zPo!v@d}t~O_`k7lPk?&$W}e^{_DuuSh`u@@%1M0|ffF@vh>6**$~RI#I9NeUX4}8v zJ}6?CTfk#V-J!v#iZneRD0-6qqbM9XvtGIWB+$r5sBLW3snZH`C71UI=DE;l$jSQy*Q6%H~+sV`Xdj@tqHqTYzjL@3V4Kyii*rFJ{=tyWpxRBy;BWN#`EMAzh`b{8f7eQ52J z9=S3<2VqL>j-0^dv0&b&*qbFGM8CB_%sTYb+YLAbhwn0A6C$Xy30SuV$w0DeM5S-R0oprK5& z2{y%~-Hq8z3T!45tcpqd%+QWWTSsOl6GW|hbCTBv6HnO!;eRth6@FGR9SmYBsA5_T zhK>V4?P=VU_Vd(u($9<;V!m-K;QQ%yJ1~>LOD-X<(^9ZzNjI)=mXgBXWy0mN#r_g9 z3JG#53L08Tf!Rbbi%&&?LVyC&mS~t1f+>7F`*Xctw-h%uVtzPmuYe#| zgeLf+l)C+_0^!CEeLb!FtaXaPzW7g+Z(BGsL(Ig!**9IAs0_B*1tB z!2hs{KQ(M??nz)IhH@BI1gFU^lE488hp@id4Z;h^$J!?xzvC9(q( zkb*UjE2OdBaVi++062??4xWbzc}ltxU~UC)pQ@_=ni#?6>snjG`^u*)km&N3f)yRF z5A|-UUtCL!kE@`wiEuJN_63@hr>77B?Jf$m1o=xQ3)^)F=#7T-3pLbLC~9RZhQUmSK&8m$jKI$6s$Om z*e$NQx`0Z{p>vD2DahHe`ti-WDpy!)b)gEeA zO^X3u%tVX8Jf=lfmgZB`4mk=moYFoS2 z_fC=yeOiYt1uMmeUp&PGlMt_5Fc&hA$3(ectVwwS#r&KP6DG=OPho?&1bB#tr7pa~*Ofic3wBg7aXV3@^!D?FkwMnE(W5I>ti zPddM6m*SVQmYx)>bQoowbG)vuE+eBzv1O`?buu_viOX5<jHJv%5=a9A zQn1oskY$RJK&lXsf|V*GDf5s7(tv;ztTY&8nW7|+Dg>lprOHUkJS2fMARq-R4F*}J pC<&wr0V!CiGLkY6Ngxdf{D04)%cN~aB>VsX002ovPDHLkV1lCa=Ij6f literal 0 HcmV?d00001 From b729298ecf97ea56feaf58ec753a4f14cf477a88 Mon Sep 17 00:00:00 2001 From: Default user Date: Fri, 16 Aug 2024 18:50:41 +0000 Subject: [PATCH 04/11] Fix links --- .../cloud-computing/03-cloud-optimized-data-access.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index effd579..0f35b54 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -11,7 +11,7 @@ "Recall from the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb) that cloud object storage is accessed over the network. Local file storage access will always be faster but there are limitations. This is why the design of file formats in the cloud requires more consideration than local file storage.\n", "\n", ":::{dropdown} 🤔 What are some limitations of local file storage?\n", - "See the table **Cloud vs Local Storage** in the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb#cloud-vs-local-storage).\n", + "See the table **Cloud vs Local Storage** in the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb).\n", ":::\n", "\n", "## Why you should care\n", @@ -20,7 +20,7 @@ "\n", "## What are we optimizing for and why?\n", "\n", - "The \"optimize\" in cloud-optimized is to **minimize data [latency](<#latency>)** and **maximize [throughput](<#throughput>)** by:\n", + "The \"optimize\" in cloud-optimized is to **minimize data latency** and **maximize throughput** (see glossary below) by:\n", "\n", "* Making as few requests as possible;\n", "* Making even less for metadata, preferably only one; and\n", From a6e354cb02c3d4a1e4476bca4bdceda3f699c355 Mon Sep 17 00:00:00 2001 From: Default user Date: Sat, 17 Aug 2024 03:33:54 +0000 Subject: [PATCH 05/11] A few other minor changes --- .../cloud-computing/01-cloud-computing.ipynb | 7 +++--- .../02-cloud-data-access.ipynb | 25 ++++++------------- .../03-cloud-optimized-data-access.ipynb | 2 +- .../04-cloud-optimized-icesat2.ipynb | 2 +- 4 files changed, 13 insertions(+), 23 deletions(-) diff --git a/book/tutorials/cloud-computing/01-cloud-computing.ipynb b/book/tutorials/cloud-computing/01-cloud-computing.ipynb index 31a1ccf..196897f 100644 --- a/book/tutorials/cloud-computing/01-cloud-computing.ipynb +++ b/book/tutorials/cloud-computing/01-cloud-computing.ipynb @@ -26,7 +26,7 @@ ":open:\n", "If you have your laptop available, open the terminal app and use the appropriate commands to determine CPU and memory.\n", "\n", - "
\n", + "
\n", "\n", "| Operating System (OS) | CPU command | Memory Command |\n", "|-----------------------|-----------------------------------------------------------------------------------|----------------------------|\n", @@ -48,12 +48,11 @@ ":::\n", "\n", "\n", - ":::{dropdown} 🤔 Why do we want to use the cloud instead of our personal computers?\n", + "## 🤔 So why do we want to use the cloud instead of our personal computers?\n", " 1. Because cryocloud has all the dependencies you need.\n", " 2. Because cryocloud is \"close\" to the data (more on this later).\n", - " 3. Because you can use larger and bigger machines in the cloud (more on this later).\n", + " 3. Because you can leverage more resources (storage, compute, memory) in the cloud.\n", " 4. **Having the dependencies, data, and runtime environment in the cloud can simplify reproducible science.**\n", - ":::\n", "\n", ":::{admonition} Takeaways\n", "\n", diff --git a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb index 3f1c983..9570342 100644 --- a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb +++ b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb @@ -42,7 +42,7 @@ ":align: left\n", "```\n", "\n", - "Cloud object storage is accessible over the internet. If the data is public, you can use an HTTP link, but more typically you will use the cloud object storage protocol, such as `s3://path/to/file.text` along with some credentials to access the data. Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n", + "Cloud object storage is accessible over the internet. If the data is public, you can use an HTTP link to access data in cloud object storage, but more typically you will use the cloud object storage protocol, such as `s3://path/to/file.text` along with some credentials to access the data. Using the s3 protocol to access the data is commonly referred to as **direct access**. Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n", "\n", "Popular libraries to access data on S3 are [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) and [s3fs](https://s3fs.readthedocs.io/).\n", "\n", @@ -56,16 +56,15 @@ "\n", "Navigate [https://search.earthdata.nasa.gov](https://search.earthdata.nasa.gov), search for ICESat-2 and answer the following questions:\n", "\n", - "* Which DAAC hosts ICESat-2 datasets?\n", "* How many ICESat-2 datasets are hosted on the AWS Cloud and how can you tell?\n", + "* Which DAAC hosts ICESat-2 datasets?\n", ":::\n", "\n", "\n", "## There are different access patterns, it can be confusing! 🤯\n", "\n", - "Here are a likely few:\n", - "1. Download data from a DAAC to your local machine.\n", - "2. Download data from cloud storage to your local machine.\n", + "1. Download data from a DAAC to your personal machine.\n", + "2. Download data from cloud storage, say using HTTP, to your personal machine.\n", "3. Login to a virtual machine in the cloud, like CryoCloud, and download data from a DAAC.\n", "4. Login to a virtual machine in the cloud and access data directly using a cloud object protocol, like s3.\n", "\n", @@ -75,20 +74,12 @@ "```\n", "\n", ":::{dropdown} 🤔 Which should you chose and why?\n", - " You should use option 4 - direct access. Because S3 is a cloud service, egress (files being download outside of AWS services) is not free.\n", - " **For data on Earthdata Cloud, you can only directly access files on S3 if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n", + " You should use option 4: direct access. This is both because it is fastest overall and because of $$. You can download files stored in an S3 bucket using HTTPS, but this is not recommended. It is slow and, more importantly, egress - files being download outside of AWS services - is not free. **For data on Earthdata Cloud, you can use S3 direct access if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n", " The good news is that CryoCloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n", "\n", - " Of course, you may still need to access datasets from on-prem servers as well.\n", - "\n", - "

Caveats

\n", - "
    \n", - "
  • Direct S3 access could refer to either copying a whole file using the S3 protocol OR using lazy loading and reading just a portion of the file and the latter usually only performs well for cloud-optimized files.
  • \n", - "
  • Having local file system access will always be faster than reading all or part of a file over a network, even in region (although S3 access is getting blazingly fast!) But you have to download the data, which is slow. You can also download objects from object storage onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).
  • \n", - "\n", - " When might you use option 3?\n", - " \n", - "
\n", + " Caveats:\n", + " - Not all datasets are on Earthdata cloud, so you may still need to access datasets from on-prem servers as well.\n", + " - Having local file system access will always be faster than reading all or part of a file over a network, even in region (although S3 access is getting blazingly fast!) But you have to download the data, which is slow. You can also download objects from object storage onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).\n", ":::\n", "\n", ":::{dropdown} 🏋️‍♀️ Bonus Exercise: Comparing time to copy data from S3 to CryoCloud with time to download over HTTP to your personal machine\n", diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 0f35b54..20424db 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -10,7 +10,7 @@ "\n", "Recall from the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb) that cloud object storage is accessed over the network. Local file storage access will always be faster but there are limitations. This is why the design of file formats in the cloud requires more consideration than local file storage.\n", "\n", - ":::{dropdown} 🤔 What are some limitations of local file storage?\n", + ":::{dropdown} 🤔 What is one limitation of local file storage?\n", "See the table **Cloud vs Local Storage** in the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb).\n", ":::\n", "\n", diff --git a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb index 3b997f2..935debe 100644 --- a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb +++ b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb @@ -17,7 +17,7 @@ "Recall from [03-cloud-optimized-data-access.ipynb](./03-cloud-optimized-data-access.ipynb) that we can make any HDF5 file cloud-optimized by restructuring the file so that all the metadata is in one place and chunks are \"not too big\" and \"not too small\". However, as users of the data, not archivers, we don't control how the file is generated and distributed, so if we're restructuring the data we might want to go with something even better - a **\"cloud-native\"** format.\n", "\n", ":::{important} Cloud-Native Formats\n", - "Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access across multiple files. In other words, it is fast to open a large dataset and access just the parts of it that you need.\n", + "Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access. Data and metadata are not always stored in the same object or file in order to maximize the amount of data that can be lazily loaded and queried. Some examples of cloud-native formats are [Zarr](https://zarr.dev/) and GeoParquet, which is discussed below.\n", ":::\n", "\n", ":::{warning}\n", From d4bdb61c5b0ae59d0b03dab3e5e7e93b35e2dc73 Mon Sep 17 00:00:00 2001 From: Default user Date: Sat, 17 Aug 2024 03:40:02 +0000 Subject: [PATCH 06/11] Bold attention admonition title --- .../cloud-computing/03-cloud-optimized-data-access.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 20424db..94f6023 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -26,8 +26,8 @@ "* Making even less for metadata, preferably only one; and\n", "* Using a file layout that simplifies accessing data for parallel reads.\n", "\n", - ":::{attention} A future without file formats\n", - "I like to imagine a day when we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", + ":::{attention}\n", + "**A future without file formats:** I imagine someday we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", ":::\n", "\n", "## Anatomy of a structured data file\n", From 25537e392b1a5d5d29b093e24cc7d08f87f821cc Mon Sep 17 00:00:00 2001 From: Default user Date: Sat, 17 Aug 2024 03:48:03 +0000 Subject: [PATCH 07/11] Change formatting --- .../cloud-computing/03-cloud-optimized-data-access.ipynb | 8 +++++--- .../cloud-computing/04-cloud-optimized-icesat2.ipynb | 3 +-- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 94f6023..6871579 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -26,8 +26,9 @@ "* Making even less for metadata, preferably only one; and\n", "* Using a file layout that simplifies accessing data for parallel reads.\n", "\n", - ":::{attention}\n", - "**A future without file formats:** I imagine someday we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", + ":::{important}\n", + "### A future without file formats\n", + "I imagine someday we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", ":::\n", "\n", "## Anatomy of a structured data file\n", @@ -88,7 +89,8 @@ ":::\n", "\n", "\n", - ":::{attention} Opening Arguments\n", + ":::{important}\n", + "### Opening Arguments\n", "A few arguments used to open the dataset also make a huge difference, namely with how libraries, such as s3fs and h5py, cache chunks.\n", "\n", "For s3fs, use [`cache_type` and `block_size`](https://s3fs.readthedocs.io/en/latest/api.html?highlight=cache_type#s3fs.core.S3File).\n", diff --git a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb index 935debe..4af441b 100644 --- a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb +++ b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb @@ -16,9 +16,8 @@ "\n", "Recall from [03-cloud-optimized-data-access.ipynb](./03-cloud-optimized-data-access.ipynb) that we can make any HDF5 file cloud-optimized by restructuring the file so that all the metadata is in one place and chunks are \"not too big\" and \"not too small\". However, as users of the data, not archivers, we don't control how the file is generated and distributed, so if we're restructuring the data we might want to go with something even better - a **\"cloud-native\"** format.\n", "\n", - ":::{important} Cloud-Native Formats\n", + "### Cloud-Native Formats\n", "Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access. Data and metadata are not always stored in the same object or file in order to maximize the amount of data that can be lazily loaded and queried. Some examples of cloud-native formats are [Zarr](https://zarr.dev/) and GeoParquet, which is discussed below.\n", - ":::\n", "\n", ":::{warning}\n", "Generating cloud-native formats is non-trivial.\n", From fc1afcf10f3f122616da1619779456b7a71e5f21 Mon Sep 17 00:00:00 2001 From: Default user Date: Sat, 17 Aug 2024 03:53:05 +0000 Subject: [PATCH 08/11] change a few more admonitions --- .../cloud-computing/03-cloud-optimized-data-access.ipynb | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 6871579..a8703b6 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -26,8 +26,7 @@ "* Making even less for metadata, preferably only one; and\n", "* Using a file layout that simplifies accessing data for parallel reads.\n", "\n", - ":::{important}\n", - "### A future without file formats\n", + ":::{admonition} A future without file formats\n", "I imagine someday we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", ":::\n", "\n", @@ -82,15 +81,14 @@ "\n", "
\n", "\n", - ":::{note} 🛋️ 🥔 Lazy loading\n", + ":::{admonition} 🛋️ 🥔 Lazy loading\n", "\n", "**Lazy loading** is deferring loading any data until required. Here's how it works: Metadata stores a mapping of chunk indices to byte ranges in files. Libraries, such as xarray, read only the metadata when opening a dataset. Libraries defer requesting any data until values are required for computation. When a computation of the data is finally called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the byte ranges associated with the data chunks required. See [the s3fs `cat_ranges` function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.cat_ranges) and [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", "\n", ":::\n", "\n", "\n", - ":::{important}\n", - "### Opening Arguments\n", + ":::{admonition} Opening Arguments\n", "A few arguments used to open the dataset also make a huge difference, namely with how libraries, such as s3fs and h5py, cache chunks.\n", "\n", "For s3fs, use [`cache_type` and `block_size`](https://s3fs.readthedocs.io/en/latest/api.html?highlight=cache_type#s3fs.core.S3File).\n", From 3a961c545998173d15a2aa673fcbf5a4d36d9850 Mon Sep 17 00:00:00 2001 From: Default user Date: Sat, 17 Aug 2024 03:56:32 +0000 Subject: [PATCH 09/11] Reword --- .../cloud-computing/03-cloud-optimized-data-access.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index a8703b6..9d6acff 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -27,7 +27,7 @@ "* Using a file layout that simplifies accessing data for parallel reads.\n", "\n", ":::{admonition} A future without file formats\n", - "I imagine someday we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", + "I like to imagine a day when we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", ":::\n", "\n", "## Anatomy of a structured data file\n", From e377cbb54d57c86e71299ad3f924cff77a917666 Mon Sep 17 00:00:00 2001 From: Default user Date: Sat, 17 Aug 2024 03:59:48 +0000 Subject: [PATCH 10/11] Restyle lazy loading to be part of glossary --- .../cloud-computing/03-cloud-optimized-data-access.ipynb | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 9d6acff..5cc87c4 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -37,7 +37,7 @@ ":align: center\n", "```\n", "\n", - "

A structured data file is composed of two parts: metadata and the raw data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.

\n", + "

A structured data file is composed of two parts: metadata and the raw data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. It is also crucial to **lazy loading** (see glossary belo). Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.

\n", "\n", "```{image} ./images/hdf5-structure-1.jpg\n", ":width: 600px\n", @@ -79,13 +79,10 @@ "### throughput\n", "The amount of data that can be transferred over a given time. [Read more](https://en.wikipedia.org/wiki/Network\\_throughput).\n", "\n", - "
\n", - "\n", - ":::{admonition} 🛋️ 🥔 Lazy loading\n", + "### lazy loading\n", "\n", - "**Lazy loading** is deferring loading any data until required. Here's how it works: Metadata stores a mapping of chunk indices to byte ranges in files. Libraries, such as xarray, read only the metadata when opening a dataset. Libraries defer requesting any data until values are required for computation. When a computation of the data is finally called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the byte ranges associated with the data chunks required. See [the s3fs `cat_ranges` function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.cat_ranges) and [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", + "🛋️ 🥔 Lazy loading is deferring loading any data until required. Here's how it works: Metadata stores a mapping of chunk indices to byte ranges in files. Libraries, such as xarray, read only the metadata when opening a dataset. Libraries defer requesting any data until values are required for computation. When a computation of the data is finally called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the byte ranges associated with the data chunks required. See [the s3fs `cat_ranges` function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.cat_ranges) and [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", "\n", - ":::\n", "\n", "\n", ":::{admonition} Opening Arguments\n", From b8d8517bf01c574577df918a5d499c1de20de361 Mon Sep 17 00:00:00 2001 From: Default user Date: Sat, 17 Aug 2024 04:01:02 +0000 Subject: [PATCH 11/11] Add name --- book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb index 4af441b..37dcc71 100644 --- a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb +++ b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb @@ -127,7 +127,7 @@ "source": [ "# The End!\n", "\n", - "What did you think? Have more questions? Come find me in slack or by email at aimee@ds.io." + "What did you think? Have more questions? Come find me in slack (Aimee Barciauskas) or by email at aimee@ds.io." ] } ],