Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PySpark with RasterFrames Support profile #54

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

pomadchin
Copy link

@pomadchin pomadchin commented Jan 24, 2022

This PR adjusts the terraform deployment as well as helmcharts to support the PySpark profile.

Some significant improvements achieved while doing it:

  • RBAC Enabled RBAC cluster (which is a breaking change and will require a wise working cluster migration path)
  • Each Jupyter notebook is spawned in the specific for user namespace, which allows to restrict users access in a more strict fashion
  • Adds proper hub and autohttp roles to give services cross namespaces API access
  • Adds a new extra profile with the installed JDK11, PySpark and RasterFrames

TODO:

Cleanup Spark cpu pools

Expose SparkUI in the spawned notebooks, add more details into the PySpark profile description

Add the Spark env init script

Add Spark executors template to support nodes tolerations

Allocate NoteBooks per namespace

Add hub and proxy ClusterRoles

Make AKS RBAC enabled

Rename ensure_service_account function

Change the spark executor image

Expose the Spark UI in the pyspark profile case

Cleanup hub and autohttps ClusterRoles
@ghost
Copy link

ghost commented Jan 24, 2022

CLA assistant check
All CLA requirements met.

@@ -85,10 +85,17 @@ daskhub:
c.KubeSpawner.extra_labels = {}
kubespawner: |
c.KubeSpawner.start_timeout = 15 * 60 # 15 minutes
# pass the parent namespace through, needed for pre_spawn_hook to copy resources
c.KubeSpawner.environment['NAMESPACE_PARENT'] = c.KubeSpawner.namespace
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: try the template replacement instead of setting the parent namespace variable.

@@ -147,13 +243,31 @@ daskhub:

c.KubeSpawner.pre_spawn_hook = pre_spawn_hook

# it is the spawner post stop hook, not related to the notebook lifecycle
# we don't need it
post_stop_hook: |
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it here, I will remove it unless we want to keep it here for fun cc @TomAugspurger

@@ -23,6 +23,7 @@ module "resources" {
jupyterhub_singleuser_image_name = "pcccr.azurecr.io/public/planetary-computer/python"
jupyterhub_singleuser_image_tag = "2022.01.17.0"
python_image = "pcccr.azurecr.io/public/planetary-computer/python:2022.01.17.0"
pyspark_image = "daunnc/planetary-computer-pyspark:2021.11.29.0-gdal3.4-3.1-rf"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: make PRs against MSFTPC repos with containers.

'spark.driver.memory': '1g',
'spark.executor.cores': '3',
'spark.kubernetes.namespace': namespace_user,
'spark.kubernetes.container.image': 'quay.io/daunnc/spark-k8s-py-3.8.8-gdal32-msftpc:3.1.2',
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be replaced by the MSFTPC containers.

namespace_user = os.environ.get('NAMESPACE_USER', '')
spark_config = {
'spark.master': 'k8s://https://kubernetes.default.svc.cluster.local',
'spark.app.name': 'STAC API with RF in K8S',
Copy link
Author

@pomadchin pomadchin Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spark app name should probably be picked up from the Notebook name / or used the default one (more Microsofty).

default: "True"
description: '4 cores, 32 GiB of memory. <a href="https://github.com/pangeo-data/pangeo-docker-images" target="_blank">Pangeo Notebook</a> environment powered by <a href="https://rasterframes.io/">Raster Frames</a>, <a href="http://geotrellis.io/">GeoTrellis</a> and <a href="https://spark.apache.org/">Apache Spark</a>.'
kubespawner_override:
image: "${pyspark_image}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the only difference between PySpark and Python images is in the amount of the underlying deps.

helm/values.yaml Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant