New Hub: Carbon Plan #291

choldgraf · 2021-03-06T01:40:21Z

Background

CarbonPlan is a non-profit that does work at the intersection of data science / climate modeling / advocacy. @jhamman has been running a few JupyterHubs for CP for a while now, and he's like to transfer operational duties to 2i2c.

This is likely a bit more complex than the "standard Pangeo hubs" we have set up. I believe that @jhamman has a couple of hubs that they run (perhaps he can provide context below).

Setup Information

Hub auth type: GitHub / Auth0
Hub administrators: @jhamman / @freeman-lab
Hub url: https://{cloud-region}.hub.carbonplan.org/
Hub logo:
Hub type: Dask-hub/Pangeo
Hub cluster: Existing k8s clusters on GKE and AKS (AKS needs to move to a new account / region)
Hub image: Multiple, probably curated by CarbonPlan

Important Information

Link to leads issue: 2i2c-org/leads#14
Hub config name: carbonplan/hub (currently here: https://github.com/carbonplan/hub)
Community champion: Joe Hamman (@jhamman)
Hub start date: ASAP
Hub end date: Once we've solved climate change 🌎
Hub important dates: None

Deploy To Do

Understand what hubs we need to deploy
Initial Hub deployments
Administrators able to log on
Community Champion satisfied with hub environment
Hub now in steady-state

Follow up issues

jhamman · 2021-03-06T04:58:27Z

Hi All! Here's some info to fill in the missing info above. What we have right now is something that looks pretty similar to the Pangeo-Cloud-Federation hubs. In fact, https://github.com/carbonplan/hub is a fork of that project with two hubs in it. One in GCP, one in Azure. The Azure hub needs to be moved to a new account so will need a rebuild. The google hub is in more of a maintenance mode and doesn't need too much work beyond updating to the new dask-hub chart. A lot of our devops falls into two main areas: 1) environment management -- we are working on multiple projects that require bespoke environments, and 2) custom resources -- sometimes this means GPUs, sometimes it means large VMs, and other times, its dask-related stuff. We're likely to need more than one image per hub (more like 1+ per project we work on), so that's a consideration worth discussing.

Setup Information

Hub auth type: GitHub / Auth0
Hub administrators: @jhamman / @freeman-lab
Hub url: https://{cloud-region}.hub.carbonplan.org/
Hub logo:
Hub type: Dask-hub/Pangeo
Hub cluster: Existing k8s clusters on GKE and AKS (AKS needs to move to a new account / region)
Hub image: Multiple, probably curated by CarbonPlan

Important Information

Link to leads issue: 2i2c-org/leads#14
Hub config name: carbonplan/hub (currently here: https://github.com/carbonplan/hub)
Community champion: Joe Hamman (@jhamman)
Hub start date: ASAP
Hub end date: Once we've solved climate change 🌎
Hub important dates: None

yuvipanda · 2021-05-04T18:48:17Z

@jhamman can you create AWS users with full access for me (yuvipanda@2i2c.org) and @damianavila (damianavila@2i2c.org)?

- staging and prod clusters that are exactly the same, with just domain differences - Uses traditional autohttps + LoadBalancer to get traffic into the cluster. Could be nginx-ingress later on if necessary. - Manual DNS entries for staging.carbonplan.2i2c.cloud and carbonplan.2i2c.cloud. Initial manual deploy with `proxy.https.enabled` set to false to complete deployment, fetch externalIP of `proxy-public` service, setup DNS, then re-deploy with `proxy.https.enabled` set to true. Ref 2i2c-org#291

yuvipanda · 2021-05-15T00:24:48Z

I'm making dask worker instances be spot instances now. Still need to figure out how users can effectively select instance size for dask workers.

consideRatio · 2021-05-15T00:54:14Z

Should users really select instance size rather than pod CPU/memory requests though? I have two suggestions based on experience with learning-2-learn/l2lhub-deployment on AWS, not knowing if they are relevant or not - here goes.

One node type per instance group.
Make sure to have only one instance group of a certain CPU size to ensure the cluster-autoscaler could choose nodes correctly based on a pending pod's resource requests. If you put node types of of different sizes in an instance group it may fail to scale up a suitable node.
A simple worker resource requests UX
When letting the user choose their resource requests and such for a worker in the cluster options, I think it makes sense to provide choices that fit well on the nodes provided in discrete steps. In other words, instead of free-choice CPU / memory, just provide a list of allowed choices such as: 1 CPU 4 GB, 2 CPU 8 GB, 4 CPU, 16 GB, 8 CPU 32 GB, 16 CPU 64 GB. This configuration needs some slack to ensure pods fit even though a node has some daemonset running on it as well.

Example suggestion 1 - Configuration of AWS instance groups

This is from infra/ekctl-cluster-config.yaml.

  # Important about spot nodes!
  #
  # "Due to the Cluster Autoscaler’s limitations (more on that in the next
  # section) on which Instance type to expand, it’s important to choose
  # instances of the same size (vCPU and memory) for each InstanceGroup."
  #
  # ref: https://medium.com/riskified-technology/run-kubernetes-on-aws-ec2-spot-instances-with-zero-downtime-f7327a95dea
  #
  - name: worker-xlarge
    availabilityZones: [us-west-2d, us-west-2b, us-west-2a]
    minSize: 0
    maxSize: 20
    desiredCapacity: 0
    volumeSize: 80
    labels:
      worker: "true"
    taints:
      worker: "true:NoSchedule"
    tags:
      k8s.io/cluster-autoscaler/node-template/label/worker: "true"
      k8s.io/cluster-autoscaler/node-template/taint/worker: "true:NoSchedule"
    iam:
      withAddonPolicies:
        autoScaler: true
    # Spot instance configuration
    instancesDistribution:  # ref: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-instancesdistribution.html
      instanceTypes:
        - m5.xlarge    # 57 pods, 4 cpu, 16 GB
        - m5a.xlarge    # 57 pods, 4 cpu, 16 GB
        - m5n.xlarge    # 57 pods, 4 cpu, 16 GB
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: "capacity-optimized"  # ref: https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/

  - name: worker-2xlarge
    availabilityZones: [us-west-2d, us-west-2b, us-west-2a]
    minSize: 0
    maxSize: 20
    desiredCapacity: 0
    volumeSize: 80
    labels:
      worker: "true"
    taints:
      worker: "true:NoSchedule"
    tags:
      k8s.io/cluster-autoscaler/node-template/label/worker: "true"
      k8s.io/cluster-autoscaler/node-template/taint/worker: "true:NoSchedule"
    iam:
      withAddonPolicies:
        autoScaler: true
    # Spot instance configuration
    instancesDistribution:  # ref: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-instancesdistribution.html
      instanceTypes:
        - m5.2xlarge   # 57 pods, 8 cpu, 32 GB
        - m5a.2xlarge   # 57 pods, 8 cpu, 32 GB
        - m5n.2xlarge   # 57 pods, 8 cpu, 32 GB
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: "capacity-optimized"  # ref: https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/

  - name: worker-4xlarge
    availabilityZones: [us-west-2d, us-west-2b, us-west-2a]
    minSize: 0
    maxSize: 20
    desiredCapacity: 0
    volumeSize: 80
    labels:
      worker: "true"
    taints:
      worker: "true:NoSchedule"
    tags:
      k8s.io/cluster-autoscaler/node-template/label/worker: "true"
      k8s.io/cluster-autoscaler/node-template/taint/worker: "true:NoSchedule"
    iam:
      withAddonPolicies:
        autoScaler: true
    # Spot instance configuration
    instancesDistribution:  # ref: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-autoscaling-autoscalinggroup-instancesdistribution.html
      instanceTypes:
        - m5.4xlarge   # 233 pods, 16 cpu, 64 GB
        - m5a.4xlarge   # 233 pods, 16 cpu, 64 GB
        - m5n.4xlarge   # 233 pods, 16 cpu, 64 GB
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: "capacity-optimized"  # ref: https://aws.amazon.com/blogs/compute/introducing-the-capacity-optimized-allocation-strategy-for-amazon-ec2-spot-instances/

Example suggestion 2 - Configuration of dask worker requests

This is Helm values configuring a daskhub Helm chart deployment.

daskhub:
  jupyterhub:
    singleuser:
      extraEnv:
        # The default worker image matches the singleuser image.
        DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
        DASK_DISTRIBUTED__DASHBOARD_LINK: '/user/{JUPYTERHUB_USER}/proxy/{port}/status'
        DASK_LABEXTENSION__FACTORY__MODULE: 'dask_gateway'
        DASK_LABEXTENSION__FACTORY__CLASS: 'GatewayCluster'

  # Reference on the configuration options:
  # https://github.com/dask/dask-gateway/blob/master/resources/helm/dask-gateway/values.yaml
  dask-gateway:
    gateway:
      prefix: "/services/dask-gateway"  # Connect to Dask-Gateway through a JupyterHub service.
      auth:
        type: jupyterhub  # Use JupyterHub to authenticate with Dask-Gateway
      extraConfig:
        # This configuration represents options that can be presented to users
        # that want to create a Dask cluster using dask-gateway. For more
        # details, see https://gateway.dask.org/cluster-options.html
        #
        # The goal is to provide a simple configuration that allow the user some
        # flexibility while also fitting well well on AWS nodes that are all
        # having 1:4 ratio between CPU and GB of memory. By providing the
        # username label, we help administrators to track user pods.
        option_handler: |
          from dask_gateway_server.options import Options, Select, String, Mapping
          def cluster_options(user):
              def option_handler(options):
                  if ":" not in options.image:
                      raise ValueError("When specifying an image you must also provide a tag")

                  extra_labels = {
                      "hub.jupyter.org/username": user.name,
                  }
                  chosen_worker_cpu = int(options.worker_specification.split("CPU")[0])
                  chosen_worker_memory = 4 * chosen_worker_cpu

                  # We multiply the requests by a fraction to ensure that the
                  # worker fit well within a node that need some resources
                  # reserved for system pods.
                  return {
                      "image": options.image,
                      "worker_cores": 0.80 * chosen_worker_cpu,
                      "worker_cores_limit": chosen_worker_cpu,
                      "worker_memory": "%fG" % (0.90 * chosen_worker_memory),
                      "worker_memory_limit": "%fG" % chosen_worker_memory,
                      "scheduler_extra_pod_labels": extra_labels,
                      "worker_extra_pod_labels": extra_labels,
                      "environment": options.environment,
                  }
              return Options(
                  Select(
                      "worker_specification",
                      ["1CPU, 4GB", "2CPU, 8GB", "4CPU, 16GB", "8CPU, 32GB", "16CPU, 64GB"],
                      default="1CPU, 4GB",
                      label="Worker specification",
                  ),
                  String("image", default="my-custom-image:latest", label="Image"),
                  Mapping("environment", {}, label="Environment variables"),
                  handler=option_handler,
              )
          c.Backend.cluster_options = cluster_options
        idle: |
          # timeout after 30 minutes of inactivity
          c.KubeClusterConfig.idle_timeout = 1800

yuvipanda · 2021-05-19T10:56:38Z

@consideRatio yeah, agree that Select is the way to go. We shouldn't expose users directly to instance names.

Need to perhaps figure a way out to keep this config in sync with kubepsawner's profiles.

damianavila · 2021-05-20T14:44:02Z

Need to perhaps figure a way out to keep this config in sync with kubepsawner's profiles.

Can you expand on these thoughts, @yuvipanda? Thx!

choldgraf · 2021-05-24T15:52:56Z

Hey all - so the Carbon Plan hub exists now right? What else do we need to do before resolving this issue?

yuvipanda · 2021-05-24T15:54:22Z

I think we need to:

Make the dask workers be spot instances
Figure out dask worker resource sizes

choldgraf · 2021-05-24T21:42:15Z

Cool - I've updated the top comment with these new steps (feel free to update it yourself in general if you like!)

jhamman · 2021-05-25T18:53:47Z

In addition to the worker spot instance todos, I think we may need a bigger VM in the cluster config. Selecting our Huge: r5.8xlarge ~32 CPU, ~256G RAM option results in a no-scale-up-timeout:

yuvipanda · 2021-05-25T19:12:33Z

@jhamman try again? I'll have a PR up shortly but deploy seems to help

They apparently have less than 250G total allocatable space, despite having 256G total RAM Ref 2i2c-org#291

yuvipanda · 2021-05-25T19:17:07Z

(#430)

jhamman · 2021-05-25T22:11:43Z

Seems to be working now. Thanks @yuvipanda!

jhamman · 2021-06-09T20:23:35Z

Hey folks! A few updates / requests / questions after using the hub for a few weeks:

Things are generally working great

@orianac, @tcchiao, and I have been using the hubs daily. The hub has been quite stable and feature complete, so nice work on the initial rollout!

Config update requests

A few things we'd like to change in the configuration:

Default user environment -> jupyterlab
Increase maximum memory per worker in dask-gateway clusters, currently set to 32gb, asking for a change to 64gb

Questions

Did spot instances get implemented for dask clusters?
What is the instance type being used for the dask workers? We'll be running workloads that are pretty memory hungry so if its easy to do, a node pool of high-memory instances (X1?) may be in order.
I'm not sure if this is on 2i2c's radar just yet but I'm curious what story around managing storage permissions via service account is? The Pangeo hubs on aws and gcp have managed to dial this in reasonably well so just surfacing for future conversation.

Currently, users have to set these manually, so limiting it to 32 prevents provisioning instances larger than that. Ref 2i2c-org#291 (comment)

- Requested by Joe in 2i2c-org#291 (comment) - Refresh auth credentials, they had expired. Fixed in 2i2c-org#381

yuvipanda · 2021-06-14T19:15:56Z

Default user environment -> jupyterlab

Done

Increase maximum memory per worker in dask-gateway clusters, currently set to 32gb, asking for a change to 64gb

Done, I removed the upper limit

What is the instance type being used for the dask workers? We'll be running workloads that are pretty memory hungry so if its easy to do, a node pool of high-memory instances (X1?) may be in order.

Same set of instances you see when you create a new user server. Dask and notebook nodes mirror config.

I'm not sure if this is on 2i2c's radar just yet but I'm curious what story around managing storage permissions via service account is? The Pangeo hubs on aws and gcp have managed to dial this in reasonably well so just surfacing for future conversation.

Good chunk of fiddling happening on the GCP setup, not so much on AWS yet. PANGEO_SCRATCH should hopefully make its way soon

choldgraf · 2021-08-10T19:39:24Z

@jhamman in a recent meeting we decided to close out this issue and consider it "finished" since it is just the "first deployment" issue, and the CP hub has been running for a couple months now. We've got the two extra issues about spot instances and dask workers on our deliverables board and can track those improvements there. I'll close this, but if you've got a strong objection to that feel free to speak up and we can discuss!

choldgraf added Needs Hub labels Mar 6, 2021

choldgraf changed the title ~~[Hub] - [Hub name]~~ Carbon Plan Hubs Mar 6, 2021

This was referenced Mar 10, 2021

Improve team process around triaging new issues #298

Closed

Team Sync - Mar 15, 2021 2i2c-org/team-compass#53

Closed

choldgraf mentioned this issue Mar 22, 2021

Team Sync - Mar 22, 2021 2i2c-org/team-compass#58

Closed

This was referenced Mar 29, 2021

Team Sync - Mar 29, 2021 2i2c-org/team-compass#64

Closed

Team Sync - Apr 05, 2021 2i2c-org/team-compass#67

Closed

This was referenced Apr 12, 2021

Team Sync - Apr 12, 2021 2i2c-org/team-compass#68

Closed

Team Sync - Apr 19, 2021 2i2c-org/team-compass#70

Closed

This was referenced Apr 20, 2021

Process of onboarding a new customer #306

Closed

Team Sync - Apr 26, 2021 2i2c-org/team-compass#75

Closed

choldgraf added support and removed Needs Hub labels Apr 27, 2021

choldgraf changed the title ~~Carbon Plan Hubs~~ New Hub: Carbon Plan Apr 27, 2021

choldgraf added the prio: med label Apr 27, 2021

choldgraf mentioned this issue Apr 28, 2021

Create AWS deployment infrastructure #366

Closed

6 tasks

yuvipanda mentioned this issue May 10, 2021

Deploy CarbonPlan cluster with kops + jsonnet #389

Merged

This was referenced May 10, 2021

Add carbonplan cluster + hubs #391

Merged

Team Sync - May 10, 2021 2i2c-org/team-compass#99

Closed

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue May 25, 2021

Reduce memory request for carbonplan r5.8x instance

5516d9f

They apparently have less than 250G total allocatable space, despite having 256G total RAM Ref 2i2c-org#291

yuvipanda mentioned this issue May 25, 2021

Reduce memory request for carbonplan r5.8x instance #430

Merged

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 14, 2021

Remove max limits on dask worker RAM / CPU

d1bc4f8

Currently, users have to set these manually, so limiting it to 32 prevents provisioning instances larger than that. Ref 2i2c-org#291 (comment)

yuvipanda mentioned this issue Jun 14, 2021

Tweak daskhub config defaults #470

Merged

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Jun 14, 2021

Provixion x1 instances for carbonplan

805ee0a

- Requested by Joe in 2i2c-org#291 (comment) - Refresh auth credentials, they had expired. Fixed in 2i2c-org#381

yuvipanda mentioned this issue Jun 14, 2021

Provixion x1 instances for carbonplan #471

Merged

choldgraf mentioned this issue Jul 27, 2021

Deliverables Review Meeting: 2021-07-26 2i2c-org/team-compass#174

Closed

11 tasks

choldgraf mentioned this issue Aug 6, 2021

Make the dask workers be spot instances in carbonplan #490

Closed

choldgraf removed the prio: med label Aug 7, 2021

choldgraf closed this as completed Aug 10, 2021

choldgraf mentioned this issue Aug 13, 2021

Carbon Plan Dask worker improvements #607

Closed

3 tasks

colliand mentioned this issue Dec 1, 2023

[Decommission Hub] Carbon Plan #3483

Closed

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Hub: Carbon Plan #291

New Hub: Carbon Plan #291

choldgraf commented Mar 6, 2021 •

edited

Loading

jhamman commented Mar 6, 2021

yuvipanda commented May 4, 2021

yuvipanda commented May 15, 2021

consideRatio commented May 15, 2021 •

edited

Loading

yuvipanda commented May 19, 2021

damianavila commented May 20, 2021

choldgraf commented May 24, 2021

yuvipanda commented May 24, 2021

choldgraf commented May 24, 2021

jhamman commented May 25, 2021

yuvipanda commented May 25, 2021

yuvipanda commented May 25, 2021

jhamman commented May 25, 2021

jhamman commented Jun 9, 2021

yuvipanda commented Jun 14, 2021

choldgraf commented Aug 10, 2021

New Hub: Carbon Plan #291

New Hub: Carbon Plan #291

Comments

choldgraf commented Mar 6, 2021 • edited Loading

Background

Setup Information

Important Information

Deploy To Do

Follow up issues

jhamman commented Mar 6, 2021

Setup Information

Important Information

yuvipanda commented May 4, 2021

yuvipanda commented May 15, 2021

consideRatio commented May 15, 2021 • edited Loading

Example suggestion 1 - Configuration of AWS instance groups

Example suggestion 2 - Configuration of dask worker requests

yuvipanda commented May 19, 2021

damianavila commented May 20, 2021

choldgraf commented May 24, 2021

yuvipanda commented May 24, 2021

choldgraf commented May 24, 2021

jhamman commented May 25, 2021

yuvipanda commented May 25, 2021

yuvipanda commented May 25, 2021

jhamman commented May 25, 2021

jhamman commented Jun 9, 2021

Things are generally working great

Config update requests

Questions

yuvipanda commented Jun 14, 2021

choldgraf commented Aug 10, 2021

choldgraf commented Mar 6, 2021 •

edited

Loading

consideRatio commented May 15, 2021 •

edited

Loading