Skip to content

Commit

Permalink
Merge pull request #25 from ICESAT-2HackWeek/cloud-computing-fixes
Browse files Browse the repository at this point in the history
Cloud computing fixes
  • Loading branch information
scottyhq authored Aug 18, 2024
2 parents 2aa88b8 + b8d8517 commit d447375
Show file tree
Hide file tree
Showing 5 changed files with 120 additions and 49 deletions.
21 changes: 13 additions & 8 deletions book/tutorials/cloud-computing/01-cloud-computing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# What is cloud computing?\n",
"# ☁️ What is cloud computing?\n",
"\n",
"<br />\n",
"\n",
"**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n",
"**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers (CSPs) such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n",
"\n",
"```{image} ./images/AWS_OurDataCenters_Background.jpg\n",
":width: 600px\n",
Expand All @@ -20,13 +20,13 @@
"\n",
">Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider like Amazon Web Services (AWS). ([source](https://aws.amazon.com/what-is-cloud-computing/))\n",
"\n",
"This tutorial will focus on AWS services and terminology, but Google Cloud and Microsoft Azure offer the same services.\n",
"This tutorial will focus on AWS services and terminology, as AWS is the cloud service provider used for NASA's Earthdata Cloud (more on that later). But Google Cloud and Microsoft Azure offer the same services. If you're interested, you can read more about the <a href=\"https://en.wikipedia.org/wiki/Amazon_Web_Services\">history of AWS on Wikipedia</a>.\n",
"\n",
":::{dropdown} 🏋️ Exercise: How many CPUs and how much memory does your laptop have? And how does that compare with CryoCloud?</h3>\n",
":open:\n",
"If you have your laptop available, open the terminal app and use the appropriate commands to determine CPU and memory.\n",
"\n",
"<div style=\"width:60%; padding: 30px;\">\n",
"<div style=\"padding: 30px;\">\n",
"\n",
"| Operating System (OS) | CPU command | Memory Command |\n",
"|-----------------------|-----------------------------------------------------------------------------------|----------------------------|\n",
Expand All @@ -40,14 +40,19 @@
"Tip: When logged into cryocloud, you can click the ![kernel usage icon](./images/tachometer-alt_1.png) icon on the far-right toolbar.\n",
":::\n",
"\n",
"**What did you find?** It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n",
"\n",
":::{dropdown} So why would we want to use the cloud and not our personal computers?\n",
":::{dropdown} 🤔 What did you notice?\n",
" It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n",
"\n",
" On CryoCloud, you may have also noticed different numbers for available memory from what you see at the bottom of the jupyterhub interface, where you might see something like `Mem: 455.30 / 3798.00 MB`.\n",
":::\n",
"\n",
"\n",
"## 🤔 So why do we want to use the cloud instead of our personal computers?\n",
" 1. Because cryocloud has all the dependencies you need.\n",
" 2. Because cryocloud is \"close\" to the data (more on this later).\n",
" 3. Because you can use larger and bigger machines in the cloud (more on this later).\n",
" 3. Because you can leverage more resources (storage, compute, memory) in the cloud.\n",
" 4. **Having the dependencies, data, and runtime environment in the cloud can simplify reproducible science.**\n",
":::\n",
"\n",
":::{admonition} Takeaways\n",
"\n",
Expand Down
91 changes: 70 additions & 21 deletions book/tutorials/cloud-computing/02-cloud-data-access.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"id": "9310f818-bbfe-4cb3-8e84-f21beaca334e",
"metadata": {},
"source": [
"# Cloud Data Access\n",
"# 🪣 Cloud Data Access\n",
"\n",
"<br />\n",
"\n",
Expand Down Expand Up @@ -35,51 +35,100 @@
"\n",
"## What is cloud object storage?\n",
"\n",
"Cloud object storage stores and manages unstructured data in a flat structure (as opposed to a hierarchy as with file storage). Object storage is distinguished from a database, which requires software (a database management system) to store data and often has connection limits. Object storage is distinct from local file storage, because you access cloud object storage over a network connection, whereas local file storage is accessed by the central processing unit (CPU) of whatever server you are using.\n",
"Cloud object storage stores unstructured data in a flat structure, called a bucket in AWS, where each object is identified with a unique key. The simple design of cloud object storage enables near infinite scalability. Object storage is distinguished from a database which requires database management system software to store data and often has connection limits. Object storage is distinct from file storage because files are stored in a hierarchical format and a network is not always required. Read more about cloud object storage and how it is different from other types of storage [in the AWS docs](https://aws.amazon.com/what-is/object-storage/).\n",
"\n",
"Cloud object storage is accessible using HTTP or a cloud-object storage protocol, such as AWS' Simple Storage Service (S3). Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n",
"```{image} ./images/s3-bucket-with-objects.png\n",
":width: 150px\n",
":align: left\n",
"```\n",
"\n",
"Cloud object storage is accessible over the internet. If the data is public, you can use an HTTP link to access data in cloud object storage, but more typically you will use the cloud object storage protocol, such as `s3://path/to/file.text` along with some credentials to access the data. Using the s3 protocol to access the data is commonly referred to as **direct access**. Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n",
"\n",
"Popular libraries to access data on S3 are [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) and [s3fs](https://s3fs.readthedocs.io/).\n",
"\n",
"```{image} ./images/cloud-and-local.png\n",
":width: 500px\n",
":align: center\n",
"```\n",
"\n",
":::{dropdown} 🏋️ Exercise: Datasets on Earthdata Cloud\n",
":::{dropdown} 🏋️‍♀️ Exercise: Which ICESat-2 Datasets are on Earthdata Cloud?\n",
":open:\n",
"\n",
"Navigate [https://search.earthdata.nasa.gov](https://search.earthdata.nasa.gov), search for ICESat-2 and answer the following questions:\n",
"\n",
"* Which DAAC hosts ICESat-2 datasets?\n",
"* How many ICESat-2 datasets are hosted on the AWS Cloud and how can you tell?\n",
"* Which DAAC hosts ICESat-2 datasets?\n",
":::\n",
"\n",
"\n",
"## There are different access patterns, it can be confusing! 🤯\n",
"\n",
"Here are a likely few:\n",
"1. Download data from a DAAC to your local machine.\n",
"2. Download data from cloud storage to your local machine.\n",
"3. Login to a virtual machine in the cloud and download data from a DAAC (when would you do this?).\n",
"4. Login to a virtual machine in the cloud, like CryoCloud, and access data directly.\n",
"1. Download data from a DAAC to your personal machine.\n",
"2. Download data from cloud storage, say using HTTP, to your personal machine.\n",
"3. Login to a virtual machine in the cloud, like CryoCloud, and download data from a DAAC.\n",
"4. Login to a virtual machine in the cloud and access data directly using a cloud object protocol, like s3.\n",
"\n",
"```{image} ./images/different-modes-of-access.png\n",
":width: 1000px\n",
":align: center\n",
"```\n",
"\n",
":::{dropdown} Which should you chose and why?\n",
" You should use option 4 - direct access. Because S3 is a cloud service, egress (files being download outside of AWS services) is not free.\n",
" **You can only directly access (both partial reading and download) files on S3 if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n",
" The good news is that cryointhecloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n",
":::{dropdown} 🤔 Which should you chose and why?\n",
" You should use option 4: direct access. This is both because it is fastest overall and because of $$. You can download files stored in an S3 bucket using HTTPS, but this is not recommended. It is slow and, more importantly, egress - files being download outside of AWS services - is not free. **For data on Earthdata Cloud, you can use S3 direct access if you are in the same AWS region as the data. This is so NASA can avoid egress fees 💸 but it also benefits you because this style of access is much faster.**\n",
" The good news is that CryoCloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n",
"\n",
" Caveats:\n",
" - Not all datasets are on Earthdata cloud, so you may still need to access datasets from on-prem servers as well.\n",
" - Having local file system access will always be faster than reading all or part of a file over a network, even in region (although S3 access is getting blazingly fast!) But you have to download the data, which is slow. You can also download objects from object storage onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).</li>\n",
":::\n",
"\n",
":::{dropdown} 🏋️‍♀️ Bonus Exercise: Comparing time to copy data from S3 to CryoCloud with time to download over HTTP to your personal machine\n",
"\n",
"Note: You will need URS credentials handy to do this exercise. You will need to store them in a local ~/.netrc file as instructed [here](https://urs.earthdata.nasa.gov/documentation/for_users/data_access/curl_and_wget)\n",
"\n",
"```python\n",
"earthaccess.login()\n",
"\n",
"aws_creds = earthaccess.get_s3_credentials(daac='NSIDC')\n",
"\n",
"s3 = s3fs.S3FileSystem(\n",
" key=aws_creds['accessKeyId'],\n",
" secret=aws_creds['secretAccessKey'],\n",
" token=aws_creds['sessionToken'],\n",
")\n",
"\n",
" Of course, you may still need to access datasets from on-prem servers as well.\n",
"results = earthaccess.search_data(\n",
" short_name=\"ATL03\",\n",
" cloud_hosted=True,\n",
" count=1\n",
")\n",
"\n",
"direct_link = results[0].data_links(access=\"direct\")[0]\n",
"direct_link\n",
"```\n",
"\n",
"Now time the download:\n",
"```python\n",
"%%time\n",
"s3.download(direct_link, lpath=direct_link.split('/')[-1])\n",
"```\n",
"\n",
"Compare this with the time to download from HTTPS to your personal machine.\n",
"\n",
"First, get the equivalent HTTPS URL:\n",
"\n",
"```python\n",
"http_link = results[0].data_links()[0]\n",
"http_link\n",
"```\n",
"\n",
"Then, copy and paste the following into a shell prompt, replacing `http_link` with the string from the last command. You will need to follow the instructions [here](https://urs.earthdata.nasa.gov/documentation/for_users/data_access/curl_and_wget) for this to work!\n",
"\n",
"```bash\n",
"$ time curl -O -b ~/.urs_cookies -c ~/.urs_cookies -L -n {http_link}\n",
"```\n",
"\n",
" <h3>Caveats</h3>\n",
" <ul>\n",
" <li>Direct S3 access could refer to either copying a whole file using the S3 protocol OR using lazy loading and reading just a portion of the file and the latter usually only performs well for cloud-optimized files.</li>\n",
" <li>Having local file system access will always be faster than reading all or parts of a file over a network, even in region (although S3 access is getting blazingly fast!) You can move data files onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).</li>\n",
" \n",
" </ul>\n",
"For me, the first option, direct, in-region access took 11.1 seconds and HTTPS to personal machine took 1 minute and 48 seconds. The second value will depend on your wifi network.\n",
":::\n",
"\n",
"## Cloud vs Local Storage\n",
Expand Down
32 changes: 19 additions & 13 deletions book/tutorials/cloud-computing/03-cloud-optimized-data-access.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cloud-Optimized Data Access\n",
"# 🔧 Cloud-Optimized Data Access\n",
"\n",
"<br />\n",
"\n",
"Recall from the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb) that cloud object storage is accessed over the network. Local file storage access will always be faster but there are limitations. This is why the design of file formats in the cloud requires more consideration than local file storage.\n",
"\n",
"## 🏋️ Exercise\n",
"\n",
":::{dropdown} What are some limitations of local file storage?\n",
":::{dropdown} 🤔 What is one limitation of local file storage?\n",
"See the table **Cloud vs Local Storage** in the [Cloud Data Access Notebook](./02-cloud-data-access.ipynb).\n",
":::\n",
"\n",
Expand All @@ -22,13 +20,13 @@
"\n",
"## What are we optimizing for and why?\n",
"\n",
"The \"optimize\" in cloud-optimized is to **minimize data latency** and **maximize throughput** by:\n",
"The \"optimize\" in cloud-optimized is to **minimize data latency** and **maximize throughput** (see glossary below) by:\n",
"\n",
"* Making as few requests as possible;\n",
"* Making even less for metadata, preferably only one; and\n",
"* Using a file layout that simplifies accessing data for parallel reads.\n",
"\n",
":::{attention} A future without file formats\n",
":::{admonition} A future without file formats\n",
"I like to imagine a day when we won't have to think about file formats. The geospatial software community is working on ways to make all collections appear as logical datasets, so you can query them without having to think about files.\n",
":::\n",
"\n",
Expand All @@ -39,7 +37,7 @@
":align: center\n",
"```\n",
"\n",
"<p style=\"margin-top:50px; float: left;\">A structured data file is composed of two parts: <b>metadata</b> and the <b>raw data</b>. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.</p>\n",
"<p style=\"margin-top:50px; float: left;\">A structured data file is composed of two parts: <b>metadata</b> and the <b>raw data</b>. Metadata is information about the data, such as the data shape, data type, the data variables, the data's coordinate system, and how the data is stored, such as chunk shape and compression. It is also crucial to **lazy loading** (see glossary belo). Data is the actual data that you want to analyze. Many geospatial file formats, such as GeoTIFF, are composed of metadata and data.</p>\n",
"\n",
"```{image} ./images/hdf5-structure-1.jpg\n",
":width: 600px\n",
Expand All @@ -52,7 +50,7 @@
"\n",
"<div style=\"clear: both;\"></div>\n",
"\n",
"## When optimizing for the cloud, what structure should be used?\n",
"## How should we structure files for the cloud?\n",
"\n",
"### A \"moving away from home\" analogy\n",
"\n",
Expand All @@ -66,20 +64,28 @@
"\n",
"You can actually make any common geospatial data formats (HDF5/NetCDF, GeoTIFF, LAS (LIDAR Aerial Survey)) \"cloud-optimized\" by:\n",
"\n",
"1. Separate metadata from data and store it contiguously so it can be read with one request.\n",
"1. Separate metadata from data and store metadata contiguously so it can be read with one request.\n",
"2. Store data in chunks, so the whole file doesn't have to be read to access a portion of the data, and it can be compressed.\n",
"3. Make sure the chunks of data are not too small, so more data is fetched with each request.\n",
"4. Make sure the chunks are not too large, which means more data has to be transferred and decompression takes longer.\n",
"5. Compress these chunks so there is less data to transfer over the network.\n",
"\n",
":::{note} Lazy loading\n",
"\n",
"**Separating metadata from data supports lazy loading, which is key to working quickly when data is in the cloud.** Libraries, such as xarray, first read the metadata. They defer actually reading data until it's needed for analysis. When a computation of the data is called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the chunks required. This is also called \"lazy loading\" data. See also [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n",
"## Glossary\n",
"\n",
"### latency\n",
"The time between when data is sent to when it is received. [Read more](https://aws.amazon.com/what-is/latency/).\n",
"\n",
"### throughput\n",
"The amount of data that can be transferred over a given time. [Read more](https://en.wikipedia.org/wiki/Network\\_throughput).\n",
"\n",
"### lazy loading\n",
"\n",
"🛋️ 🥔 Lazy loading is deferring loading any data until required. Here's how it works: Metadata stores a mapping of chunk indices to byte ranges in files. Libraries, such as xarray, read only the metadata when opening a dataset. Libraries defer requesting any data until values are required for computation. When a computation of the data is finally called, libraries use [HTTP range requests](https://http.dev/range-request) to request only the byte ranges associated with the data chunks required. See [the s3fs `cat_ranges` function](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.cat_ranges) and [xarray's documentation on lazy indexing](https://docs.xarray.dev/en/latest/internals/internal-design.html#lazy-indexing).\n",
"\n",
":::\n",
"\n",
"\n",
":::{attention} Opening Arguments\n",
":::{admonition} Opening Arguments\n",
"A few arguments used to open the dataset also make a huge difference, namely with how libraries, such as s3fs and h5py, cache chunks.\n",
"\n",
"For s3fs, use [`cache_type` and `block_size`](https://s3fs.readthedocs.io/en/latest/api.html?highlight=cache_type#s3fs.core.S3File).\n",
Expand Down
Loading

0 comments on commit d447375

Please sign in to comment.