Skip to content

Commit

Permalink
Merge pull request #21 from ICESAT-2HackWeek/cloud-computing-tutorial
Browse files Browse the repository at this point in the history
Cloud computing tutorial
  • Loading branch information
abarciauskas-bgse authored Aug 15, 2024
2 parents a456984 + 7443f26 commit ff33f45
Show file tree
Hide file tree
Showing 25 changed files with 1,227 additions and 7 deletions.
14 changes: 7 additions & 7 deletions .github/workflows/ensure_clean_notebooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,13 @@

results = []
for notebook in ipynbs:
#if not notebook in exclude_notebooks:
print(f'Checking {notebook}...')
nb = nbformat.read(notebook, as_version=nbformat.NO_CONVERT)
result = nbc.check_notebook(nb,
remove_empty_cells=False,
preserve_cell_metadata=True)
results.append(result)
if not notebook in exclude_notebooks:
print(f'Checking {notebook}...')
nb = nbformat.read(notebook, as_version=nbformat.NO_CONVERT)
result = nbc.check_notebook(nb,
remove_empty_cells=False,
preserve_cell_metadata=True)
results.append(result)

if False in results:
sys.exit(1)
2 changes: 2 additions & 0 deletions book/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ execute:
execute_notebooks: 'force'
exclude_patterns:
- "**/geospatial-advanced.ipynb"
- "cloud-computing/04-cloud-optimized-icesat2.ipynb"
- "cloud-computing/atl08_parquet_files/atl08_parquet.ipynb"
allow_errors: false
# Per-cell notebook execution limit (seconds)
timeout: 300
Expand Down
10 changes: 10 additions & 0 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,21 @@ parts:
- file: preliminary/checklist
- file: preliminary/git
- caption: Tutorials
maxdepth: 1
chapters:
- file: tutorials/index
sections:
- file: tutorials/example/tutorial-notebook
- file: tutorials/nb-to-package/index.md
- file: tutorials/cloud-computing/00-goals-and-outline
sections:
- file: tutorials/cloud-computing/01-cloud-computing
- file: tutorials/cloud-computing/02-cloud-data-access
- file: tutorials/cloud-computing/03-cloud-optimized-data-access
- file: tutorials/cloud-computing/04-cloud-optimized-icesat2
- file: tutorials/cloud-computing/atl08_parquet_files/atl08_parquet
options:
- titlesonly: true
- caption: Projects
chapters:
- file: projects/index
Expand Down
76 changes: 76 additions & 0 deletions book/tutorials/cloud-computing/00-goals-and-outline.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a888b10c-1f9f-406e-8d14-9648be234d44",
"metadata": {},
"source": [
"# Cloud Computing Tutorial\n",
"\n",
"<br />\n",
"\n",
"```{image} ./images/cloud.gif\n",
":width: 200px\n",
":align: center\n",
"```\n",
"\n",
"**Welcome to the Cloud Computing Tutorial!**\n",
"\n",
"This tutorial is just the tip of the ice[SAT-2]berg (😬) of cloud computing. It focuses on accessing data stored in the cloud. An understanding of the difference between the \"download to local\" and \"direct from cloud\" methods of data access will explain how and why the cloud facilitates the scaling and reproducibility of your science.\n",
"\n",
":::{admonition} Learning Goals\n",
"\n",
"**At the conclusion of this tutorial, you should be able to answer:**\n",
"1. What is cloud computing?\n",
"2. What is cloud object storage and the difference between data stored in the cloud, data on a local file system and data stored in \"on-premise\" data centers.\n",
"3. How to optimize data for reading from cloud object storage.\n",
"\n",
":::\n",
"\n",
"## Outline\n",
"\n",
"1. [What is cloud computing?](./01-cloud-computing.ipynb)\n",
" 1. Definition of cloud computing\n",
" 2. Exercise: Difference between resources on your local machine and resources in the cloud\n",
" 3. Why you might use cloud computing\n",
"2. [Accessing data in the cloud](./02-cloud-data-access.ipynb)\n",
" 1. Definition of cloud object storage\n",
" 2. Exercise: How many NASA datasets (aka collections) are in the cloud? How many ICESat-2 datasets are in the cloud? Which DAAC manages ICESast-2 data?\n",
" 3. Difference between data stored in the cloud, data on a local file system and data stored in \"on-premise\" data centers\n",
" 4. Why you might use cloud object storage\n",
"3. [Cloud-Optimized Data](./03-cloud-optimized-data-access.ipynb)\n",
" 1. What are we optimizing for and why?\n",
" 2. Anatomy of a structured data file\n",
" 3. Thought Exercise: Garage analogy\n",
" 4. How do we optimize data for reading from cloud object storage?\n",
"4. [Cloud-Optimized ICESat-2 Demo](./04-cloud-optimized-icesat2.ipynb)\n",
" 1. Cloud-Optimized vs Cloud-Native \n",
" 1. Creating an ICESat-2 GeoParquet\n",
" 3. Plot the data with lonboard\n",
"\n",
"Or simply: Cloud -> Cloud data access -> Optimized cloud data access -> Demo with ICESat-2"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
81 changes: 81 additions & 0 deletions book/tutorials/cloud-computing/01-cloud-computing.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# What is cloud computing?\n",
"\n",
"<br />\n",
"\n",
"**Cloud computing is compute and storage as a service.** The term \"cloud computing\" is typically used to refer to commercial cloud service providers such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure (Azure). These cloud service providers all offer a wide range of computing services, only a few of which we will cover today, via a pay-as-you-go payment structure.\n",
"\n",
"```{image} ./images/AWS_OurDataCenters_Background.jpg\n",
":width: 600px\n",
":align: center\n",
"```\n",
"\n",
"<p style=\"font-size: 10px;\">image src: https://aws.amazon.com/compliance/data-center/data-centers/</p>\n",
"\n",
">Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider like Amazon Web Services (AWS). ([source](https://aws.amazon.com/what-is-cloud-computing/))\n",
"\n",
"This tutorial will focus on AWS services and terminology, but Google Cloud and Microsoft Azure offer the same services.\n",
"\n",
":::{dropdown} πŸ‹οΈ Exercise: How many CPUs and how much memory does your laptop have? And how does that compare with CryoCloud?</h3>\n",
":open:\n",
"If you have your laptop available, open the terminal app and use the appropriate commands to determine CPU and memory.\n",
"\n",
"<div style=\"width:60%; padding: 30px;\">\n",
"\n",
"| Operating System (OS) | CPU command | Memory Command |\n",
"|-----------------------|-----------------------------------------------------------------------------------|----------------------------|\n",
"| MacOS | `sysctl -a \\| grep hw.ncpu` | `top -l 1 \\| grep PhysMem` |\n",
"| Linux (cryocloud) | `lscpu \\| grep \"^CPU\\(s\\):\"` | `free -h` | \n",
"| Windows | https://www.top-password.com/blog/find-number-of-cores-in-your-cpu-on-windows-10/ | |\n",
"</div>\n",
"\n",
"Now do the same but on hub.cryointhecloud.com.\n",
"\n",
"Tip: When logged into cryocloud, you can click the ![kernel usage icon](./images/tachometer-alt_1.png) icon on the far-right toolbar.\n",
":::\n",
"\n",
"**What did you find?** It's possible you found that your machine has **more** CPU and/or memory than cryocloud!\n",
"\n",
":::{dropdown} So why would we want to use the cloud and not our personal computers?\n",
" 1. Because cryocloud has all the dependencies you need.\n",
" 2. Because cryocloud is \"close\" to the data (more on this later).\n",
" 3. Because you can use larger and bigger machines in the cloud (more on this later).\n",
" 4. **Having the dependencies, data, and runtime environment in the cloud can simplify reproducible science.**\n",
":::\n",
"\n",
":::{admonition} Takeaways\n",
"\n",
"* The cloud allows you to access many computing and storage services over the internet. Most cloud services are offered via a \"pay as you go\" model.\n",
"* Hubs like CryoCloud provide a virtual environment which simplifies reproducible science. You should use them whenever you can!\n",
":::"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
140 changes: 140 additions & 0 deletions book/tutorials/cloud-computing/02-cloud-data-access.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "9310f818-bbfe-4cb3-8e84-f21beaca334e",
"metadata": {},
"source": [
"# Cloud Data Access\n",
"\n",
"<br />\n",
"\n",
"## NASA's migration from \"on-premise\" to cloud\n",
"\n",
"```{image} ./images/DAAC_map_new.jpg\n",
":width: 700px\n",
":align: center\n",
"```\n",
"<p style=\"font-size: 10px\">image src: https://asf.alaska.edu/about-asf-daac/</p>\n",
"\n",
"NASA has 12 Distributed Active Archive Centers (DAACs). Each DAAC is associated with a few sub-disciplines of Earth science, and those specialties correspond to which missions and data products those DAACs are in charge of. For example, LPDAAC is the land processes DAAC and is in charge of the Harmonized Landsat Sentinel (HLS) Product which is often used for land classification. Up until about 6 years ago (which is about when I started working with NASA), all NASA Earth Observation archives resided \"on-premise\" at the data center's physical locations in data centers they manage.\n",
"\n",
"NASA, anticipating the exponential growth in their Earth Observation data archives, started the [Earthdata Cloud](https://www.earthdata.nasa.gov/eosdis/cloud-evolution) initiative. Now, NASA DAACs are in the process of migrating their collections to cloud storage. Existing missions are growing their collections as well, but new missions such as NISAR and SWOT are or will be the most significant contributors to NASA's archival volume growth.\n",
"\n",
"\n",
"```{image} ./images/archive-growth-FY22.jpg\n",
":width: 900px\n",
":align: center\n",
"```\n",
"<p style=\"font-size: 10px\">image src: https://www.earthdata.nasa.gov/esds/esds-highlights/2022-esds-highlights</p>\n",
"\n",
"Now, high priority and new datasets are being stored on **cloud object storage**.\n",
"\n",
"<br />\n",
"\n",
"## What is cloud object storage?\n",
"\n",
"Cloud object storage stores and manages unstructured data in a flat structure (as opposed to a hierarchy as with file storage). Object storage is distinguished from a database, which requires software (a database management system) to store data and often has connection limits. Object storage is distinct from local file storage, because you access cloud object storage over a network connection, whereas local file storage is accessed by the central processing unit (CPU) of whatever server you are using.\n",
"\n",
"Cloud object storage is accessible using HTTP or a cloud-object storage protocol, such as AWS' Simple Storage Service (S3). Access over the network is critical because it means many servers can access data in parallel and these storage systems are designed to be infinitely scalable and always available.\n",
"\n",
"```{image} ./images/cloud-and-local.png\n",
":width: 500px\n",
":align: center\n",
"```\n",
"\n",
":::{dropdown} πŸ‹οΈ Exercise: Datasets on Earthdata Cloud\n",
":open:\n",
"\n",
"Navigate [https://search.earthdata.nasa.gov](https://search.earthdata.nasa.gov), search for ICESat-2 and answer the following questions:\n",
"\n",
"* Which DAAC hosts ICESat-2 datasets?\n",
"* How many ICESat-2 datasets are hosted on the AWS Cloud and how can you tell?\n",
":::\n",
"\n",
"\n",
"## There are different access patterns, it can be confusing! 🀯\n",
"\n",
"Here are a likely few:\n",
"1. Download data from a DAAC to your local machine.\n",
"2. Download data from cloud storage to your local machine.\n",
"3. Login to a virtual machine in the cloud and download data from a DAAC (when would you do this?).\n",
"4. Login to a virtual machine in the cloud, like CryoCloud, and access data directly.\n",
"\n",
"```{image} ./images/different-modes-of-access.png\n",
":width: 1000px\n",
":align: center\n",
"```\n",
"\n",
":::{dropdown} Which should you chose and why?\n",
" You should use option 4 - direct access. Because S3 is a cloud service, egress (files being download outside of AWS services) is not free.\n",
" **You can only directly access (both partial reading and download) files on S3 if you are in the same AWS region as the data. This is so NASA can avoid egress fees πŸ’Έ but it also benefits you because this style of access is much faster.**\n",
" The good news is that cryointhecloud is located in AWS us-west-2, the same region as NASA's Earthdata Cloud datasets!\n",
"\n",
" Of course, you may still need to access datasets from on-prem servers as well.\n",
"\n",
" <h3>Caveats</h3>\n",
" <ul>\n",
" <li>Direct S3 access could refer to either copying a whole file using the S3 protocol OR using lazy loading and reading just a portion of the file and the latter usually only performs well for cloud-optimized files.</li>\n",
" <li>Having local file system access will always be faster than reading all or parts of a file over a network, even in region (although S3 access is getting blazingly fast!) You can move data files onto a file system mounted onto a virtual machine, which would result in the fastest access and computation. But before architecting your applications this way, consider the tradeoffs of reproducibility (e.g. you'll have to move the data ever time), cost (e.g. storage volumes can be more expensive than object storage) and scale (e.g. there is usually a volume size limit, except in the case of [AWS Elastic File System](https://aws.amazon.com/efs/) which is even more pricey!).</li>\n",
" \n",
" </ul>\n",
":::\n",
"\n",
"## Cloud vs Local Storage\n",
"\n",
":::{list-table}\n",
":header-rows: 1\n",
"\n",
"* - Feature\n",
" - Local\n",
" - Cloud\n",
"* - Scalability\n",
" - ❌ limited by physical hardware\n",
" - βœ… highly scalable\n",
"* - Accessibility\n",
" - ❌ access is limited to local network or complex setup for remote access\n",
" - βœ… accessible from anywhere with an internet connection\n",
"* - Collaboration\n",
" - ❌ sharing is hard\n",
" - βœ… sharing is possible with tools for access control\n",
"* - Data backup\n",
" - ❌ risk of data loss due to hardware failure or human error\n",
" - βœ… typically includes redundancy ([read more](https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html))\n",
"* - Performance\n",
" - βœ… faster since it does not depend on any network\n",
" - ❌ performance depends on internet speed or proximity to the data\n",
":::\n",
"\n",
"\n",
":::{admonition} Takeaways\n",
"\n",
"1. NASA datasets are still managed by DAACs, even though many datasets are moving to the cloud.\n",
"2. Users are encouraged to access the data directly in the cloud through AWS services (like cryocloud!)\n",
":::"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit ff33f45

Please sign in to comment.