diff --git a/book/tutorials/cloud-computing/01-cloud-computing.ipynb b/book/tutorials/cloud-computing/01-cloud-computing.ipynb index 43ed7d4..196897f 100644 --- a/book/tutorials/cloud-computing/01-cloud-computing.ipynb +++ b/book/tutorials/cloud-computing/01-cloud-computing.ipynb @@ -5,11 +5,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# What is cloud computing?\n", + "# ☁️ What is cloud computing?\n", "\n", "
\n", "\n", - "**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n", + "**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers (CSPs) such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n", "\n", "```{image} ./images/AWS_OurDataCenters_Background.jpg\n", ":width: 600px\n", @@ -20,13 +20,13 @@ "\n", ">Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider like Amazon Web Services (AWS). ([source](https://aws.amazon.com/what-is-cloud-computing/))\n", "\n", - "This tutorial will focus on AWS services and terminology, but Google Cloud and Microsoft Azure offer the same services.\n", + "This tutorial will focus on AWS services and terminology, as AWS is the cloud service provider used for NASA's Earthdata Cloud (more on that later). But Google Cloud and Microsoft Azure offer the same services. If you're interested, you can read more about the history of AWS on Wikipedia.\n", "\n", ":::{dropdown} 🏋️ Exercise: How many CPUs and how much memory does your laptop have? And how does that compare with CryoCloud?\n", ":open:\n", "If you have your laptop available, open the terminal app and use the appropriate commands to determine CPU and memory.\n", "\n", - "
\n", + "
\n", "\n", "| Operating System (OS) | CPU command | Memory Command |\n", "|-----------------------|-----------------------------------------------------------------------------------|----------------------------|\n", @@ -40,14 +40,19 @@ "Tip: When logged into cryocloud, you can click the ![kernel usage icon](./images/tachometer-alt_1.png) icon on the far-right toolbar.\n", ":::\n", "\n", - "**What did you find?** It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n", "\n", - ":::{dropdown} So why would we want to use the cloud and not our personal computers?\n", + ":::{dropdown} 🤔 What did you notice?\n", + " It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n", + "\n", + " On CryoCloud, you may have also noticed different numbers for available memory from what you see at the bottom of the jupyterhub interface, where you might see something like `Mem: 455.30 / 3798.00 MB`.\n", + ":::\n", + "\n", + "\n", + "## 🤔 So why do we want to use the cloud instead of our personal computers?\n", " 1. Because cryocloud has all the dependencies you need.\n", " 2. Because cryocloud is \"close\" to the data (more on this later).\n", - " 3. Because you can use larger and bigger machines in the cloud (more on this later).\n", + " 3. Because you can leverage more resources (storage, compute, memory) in the cloud.\n", " 4. **Having the dependencies, data, and runtime environment in the cloud can simplify reproducible science.**\n", - ":::\n", "\n", ":::{admonition} Takeaways\n", "\n", diff --git a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb index f5a8b3f..9570342 100644 --- a/book/tutorials/cloud-computing/02-cloud-data-access.ipynb +++ b/book/tutorials/cloud-computing/02-cloud-data-access.ipynb @@ -6,7 +6,7 @@ "id": "9310f818-bbfe-4cb3-8e84-f21beaca334e", "metadata": {}, "source": [ - "# Cloud Data Access\n", + "# 🪣 Cloud Data Access\n", "\n", "
\n", "\n", @@ -35,51 +35,100 @@ "\n", "## What is cloud object storage?\n", "\n", - "Cloud object storage stores and manages unstructured data in a flat structure (as opposed to a hierarchy as with file storage). Object storage is distinguished from a database, which requires software (a database management system) to store data and often has connection limits. Object storage is distinct from local file storage, because you access cloud object storage over a network connection, whereas local file storage is accessed by the central processing unit (CPU) of whatever server you are using.\n", + "Cloud object storage stores unstructured data in a flat structure, called a bucket in AWS, where each object is identified with a unique key. The simple design of cloud object storage enables near infinite scalability. Object storage is distinguished from a database which requires database management system software to store data and often has connection limits. Object storage is distinct from file storage because files are stored in a hierarchical format and a network is not always required. Read more about cloud object storage and how it is different from other types of storage [in the AWS docs](https://aws.amazon.com/what-is/object-storage/).\n", "\n", - "Cloud object storage is accessible using HTTP or a cloud-object storage protocol, such as AWS' Simple Storage Service (S3). Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n", + "```{image} ./images/s3-bucket-with-objects.png\n", + ":width: 150px\n", + ":align: left\n", + "```\n", + "\n", + "Cloud object storage is accessible over the internet. If the data is public, you can use an HTTP link to access data in cloud object storage, but more typically you will use the cloud object storage protocol, such as `s3://path/to/file.text` along with some credentials to access the data. Using the s3 protocol to access the data is commonly referred to as **direct access**. Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n", + "\n", + "Popular libraries to access data on S3 are [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) and [s3fs](https://s3fs.readthedocs.io/).\n", "\n", "```{image} ./images/cloud-and-local.png\n", ":width: 500px\n", ":align: center\n", "```\n", "\n", - ":::{dropdown} 🏋️ Exercise: Datasets on Earthdata Cloud\n", + ":::{dropdown} 🏋️‍♀️ Exercise: Which ICESat-2 Datasets are on Earthdata Cloud?\n", ":open:\n", "\n", "Navigate [https://search.earthdata.nasa.gov](https://search.earthdata.nasa.gov), search for ICESat-2 and answer the following questions:\n", "\n", - "* Which DAAC hosts ICESat-2 datasets?\n", "* How many ICESat-2 datasets are hosted on the AWS Cloud and how can you tell?\n", + "* Which DAAC hosts ICESat-2 datasets?\n", ":::\n", "\n", "\n", "## There are different access patterns, it can be confusing! 🤯\n", "\n", - "Here are a likely few:\n", - "1. Download data from a DAAC to your local machine.\n", - "2. Download data from cloud storage to your local machine.\n", - "3. Login to a virtual machine in the cloud and download data from a DAAC (when would you do this?).\n", - "4. Login to a virtual machine in the cloud, like CryoCloud, and access data directly.\n", + "1. Download data from a DAAC to your personal machine.\n", + "2. Download data from cloud storage, say using HTTP, to your personal machine.\n", + "3. Login to a virtual machine in the cloud, like CryoCloud, and download data from a DAAC.\n", + "4. Login to a virtual machine in the cloud and access data directly using a cloud object protocol, like s3.\n", "\n", "```{image} ./images/different-modes-of-access.png\n", ":width: 1000px\n", ":align: center\n", "```\n", "\n", - ":::{dropdown} Which should you chose and why?\n", - " You should use option 4 - direct access. Because S3 is a cloud service, egress (files being download outside of AWS services) is not free.\n", - " **You can only directly access (both partial reading and download) files on S3 if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n", - " The good news is that cryointhecloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n", + ":::{dropdown} 🤔 Which should you chose and why?\n", + " You should use option 4: direct access. This is both because it is fastest overall and because of $$. You can download files stored in an S3 bucket using HTTPS, but this is not recommended. It is slow and, more importantly, egress - files being download outside of AWS services - is not free. **For data on Earthdata Cloud, you can use S3 direct access if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n", + " The good news is that CryoCloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n", + "\n", + " Caveats:\n", + " - Not all datasets are on Earthdata cloud, so you may still need to access datasets from on-prem servers as well.\n", + " - Having local file system access will always be faster than reading all or part of a file over a network, even in region (although S3 access is getting blazingly fast!) But you have to download the data, which is slow. You can also download objects from object storage onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).\n", + ":::\n", + "\n", + ":::{dropdown} 🏋️‍♀️ Bonus Exercise: Comparing time to copy data from S3 to CryoCloud with time to download over HTTP to your personal machine\n", + "\n", + "Note: You will need URS credentials handy to do this exercise. You will need to store them in a local ~/.netrc file as instructed [here](https://urs.earthdata.nasa.gov/documentation/for_users/data_access/curl_and_wget)\n", + "\n", + "```python\n", + "earthaccess.login()\n", + "\n", + "aws_creds = earthaccess.get_s3_credentials(daac='NSIDC')\n", + "\n", + "s3 = s3fs.S3FileSystem(\n", + " key=aws_creds['accessKeyId'],\n", + " secret=aws_creds['secretAccessKey'],\n", + " token=aws_creds['sessionToken'],\n", + ")\n", "\n", - " Of course, you may still need to access datasets from on-prem servers as well.\n", + "results = earthaccess.search_data(\n", + " short_name=\"ATL03\",\n", + " cloud_hosted=True,\n", + " count=1\n", + ")\n", + "\n", + "direct_link = results[0].data_links(access=\"direct\")[0]\n", + "direct_link\n", + "```\n", + "\n", + "Now time the download:\n", + "```python\n", + "%%time\n", + "s3.download(direct_link, lpath=direct_link.split('/')[-1])\n", + "```\n", + "\n", + "Compare this with the time to download from HTTPS to your personal machine.\n", + "\n", + "First, get the equivalent HTTPS URL:\n", + "\n", + "```python\n", + "http_link = results[0].data_links()[0]\n", + "http_link\n", + "```\n", + "\n", + "Then, copy and paste the following into a shell prompt, replacing `http_link` with the string from the last command. You will need to follow the instructions [here](https://urs.earthdata.nasa.gov/documentation/for_users/data_access/curl_and_wget) for this to work!\n", + "\n", + "```bash\n", + "$ time curl -O -b ~/.urs_cookies -c ~/.urs_cookies -L -n {http_link}\n", + "```\n", "\n", - "

Caveats

\n", - " \n", + "For me, the first option, direct, in-region access took 11.1 seconds and HTTPS to personal machine took 1 minute and 48 seconds. The second value will depend on your wifi network.\n", ":::\n", "\n", "## Cloud vs Local Storage\n", diff --git a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb index 09f7de9..5cc87c4 100644 --- a/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb +++ b/book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb @@ -4,15 +4,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Cloud-Optimized Data Access\n", + "# 🔧 Cloud-Optimized Data Access\n", "\n", "
\n", "\n", "Recall from the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb) that cloud object storage is accessed over the network. Local file storage access will always be faster but there are limitations. This is why the design of file formats in the cloud requires more consideration than local file storage.\n", "\n", - "## 🏋️ Exercise\n", - "\n", - ":::{dropdown} What are some limitations of local file storage?\n", + ":::{dropdown} 🤔 What is one limitation of local file storage?\n", "See the table **Cloud vs Local Storage** in the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb).\n", ":::\n", "\n", @@ -22,13 +20,13 @@ "\n", "## What are we optimizing for and why?\n", "\n", - "The \"optimize\" in cloud-optimized is to **minimize data latency** and **maximize throughput** by:\n", + "The \"optimize\" in cloud-optimized is to **minimize data latency** and **maximize throughput** (see glossary below) by:\n", "\n", "* Making as few requests as possible;\n", "* Making even less for metadata, preferably only one; and\n", "* Using a file layout that simplifies accessing data for parallel reads.\n", "\n", - ":::{attention} A future without file formats\n", + ":::{admonition} A future without file formats\n", "I like to imagine a day when we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n", ":::\n", "\n", @@ -39,7 +37,7 @@ ":align: center\n", "```\n", "\n", - "

A structured data file is composed of two parts: metadata and the raw data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.

\n", + "

A structured data file is composed of two parts: metadata and the raw data. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. It is also crucial to **lazy loading** (see glossary belo). Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.

\n", "\n", "```{image} ./images/hdf5-structure-1.jpg\n", ":width: 600px\n", @@ -52,7 +50,7 @@ "\n", "
\n", "\n", - "## When optimizing for the cloud, what structure should be used?\n", + "## How should we structure files for the cloud?\n", "\n", "### A \"moving away from home\" analogy\n", "\n", @@ -66,20 +64,28 @@ "\n", "You can actually make any common geospatial data formats (HDF5/NetCDF, GeoTIFF, LAS (LIDAR Aerial Survey)) \"cloud-optimized\" by:\n", "\n", - "1. Separate metadata from data and store it contiguously so it can be read with one request.\n", + "1. Separate metadata from data and store metadata contiguously so it can be read with one request.\n", "2. Store data in chunks, so the whole file doesn't have to be read to access a portion of the data, and it can be compressed.\n", "3. Make sure the chunks of data are not too small, so more data is fetched with each request.\n", "4. Make sure the chunks are not too large, which means more data has to be transferred and decompression takes longer.\n", "5. Compress these chunks so there is less data to transfer over the network.\n", "\n", - ":::{note} Lazy loading\n", "\n", - "**Separating metadata from data supports lazy loading, which is key to working quickly when data is in the cloud.** Libraries, such as xarray, first read the metadata. They defer actually reading data until it's needed for analysis. When a computation of the data is called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the chunks required. This is also called \"lazy loading\" data. See also [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", + "## Glossary\n", + "\n", + "### latency\n", + "The time between when data is sent to when it is received. [Read more](https://aws.amazon.com/what-is/latency/).\n", + "\n", + "### throughput\n", + "The amount of data that can be transferred over a given time. [Read more](https://en.wikipedia.org/wiki/Network\\_throughput).\n", + "\n", + "### lazy loading\n", + "\n", + "🛋️ 🥔 Lazy loading is deferring loading any data until required. Here's how it works: Metadata stores a mapping of chunk indices to byte ranges in files. Libraries, such as xarray, read only the metadata when opening a dataset. Libraries defer requesting any data until values are required for computation. When a computation of the data is finally called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the byte ranges associated with the data chunks required. See [the s3fs `cat_ranges` function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.cat_ranges) and [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n", "\n", - ":::\n", "\n", "\n", - ":::{attention} Opening Arguments\n", + ":::{admonition} Opening Arguments\n", "A few arguments used to open the dataset also make a huge difference, namely with how libraries, such as s3fs and h5py, cache chunks.\n", "\n", "For s3fs, use [`cache_type` and `block_size`](https://s3fs.readthedocs.io/en/latest/api.html?highlight=cache_type#s3fs.core.S3File).\n", diff --git a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb index 995c139..37dcc71 100644 --- a/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb +++ b/book/tutorials/cloud-computing/04-cloud-optimized-icesat2.ipynb @@ -10,15 +10,14 @@ "tags": [] }, "source": [ - "# Cloud-Optimized ICESat-2\n", + "# 🧊 Cloud-Optimized ICESat-2\n", "\n", "## Cloud-Optimized vs Cloud-Native\n", "\n", "Recall from [03-cloud-optimized-data-access.ipynb](./03-cloud-optimized-data-access.ipynb) that we can make any HDF5 file cloud-optimized by restructuring the file so that all the metadata is in one place and chunks are \"not too big\" and \"not too small\". However, as users of the data, not archivers, we don't control how the file is generated and distributed, so if we're restructuring the data we might want to go with something even better - a **\"cloud-native\"** format.\n", "\n", - ":::{important} Cloud-Native Formats\n", - "Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access across multiple files. In other words, it is fast to open a large dataset and access just the parts of it that you need.\n", - ":::\n", + "### Cloud-Native Formats\n", + "Cloud-native formats are formats that were designed specifically to be used in a cloud environment. This usually means that metadata and indexes for data is separated from the data itself in a way that allows for logical dataset access. Data and metadata are not always stored in the same object or file in order to maximize the amount of data that can be lazily loaded and queried. Some examples of cloud-native formats are [Zarr](https://zarr.dev/) and GeoParquet, which is discussed below.\n", "\n", ":::{warning}\n", "Generating cloud-native formats is non-trivial.\n", @@ -31,11 +30,14 @@ "\n", "## Geoparquet\n", "\n", - "To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see [atl08_parquet.ipynb](./atl08_parquet_files/atl08_parquet.ipynb)) for the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, [`lonboard`](https://developmentseed.org/lonboard/latest/).\n", + ">Apache Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet.\n", + "\n", + "From [https://geoparquet.org/](https://geoparquet.org/)\n", + "\n", + "To demonstrate one such cloud-native format, geoparquet, we have generated a geoparquet store (see [atl08_parquet.ipynb](./atl08_parquet_files/atl08_parquet.ipynb)) for a subset of the ATL08 dataset and will visualize it using a very performant geospatial vector visualization library, [`lonboard`](https://developmentseed.org/lonboard/latest/).\n", "\n", - ":::{seealso} Resource on Geoparquet\n", + ":::{seealso} Resources on Geoparquet\n", "* https://guide.cloudnativegeo.org/geoparquet/\n", - "* https://geoparquet.org/\n", ":::\n", "\n", "## Demo" @@ -118,6 +120,15 @@ "m.set_view_state(zoom=2)\n", "m" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# The End!\n", + "\n", + "What did you think? Have more questions? Come find me in slack (Aimee Barciauskas) or by email at aimee@ds.io." + ] } ], "metadata": { diff --git a/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png b/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png new file mode 100644 index 0000000..91a0229 Binary files /dev/null and b/book/tutorials/cloud-computing/images/s3-bucket-with-objects.png differ