Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dask/daskhub helm chart #697

Merged
merged 9 commits into from
Aug 31, 2020
Merged

Conversation

TomAugspurger
Copy link
Member

No description provided.

@TomAugspurger
Copy link
Member Author

I'm having some trouble connecting to the gateway from within the hub:

ClientConnectorSSLError: Cannot connect to host proxy-public:443 ssl:default [[SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1076)]

That's connecting to http://proxy-public/services/dask-gateway/api/v1/clusters/. https://proxy-public/services/dask-gateway also doesn't work. Looking into it now.

@TomAugspurger
Copy link
Member Author

I think this demonstrates the issue: On a singleuser pod in the kubernetes cluster, I want to make a request to http://proxy-public/services/dask-gateway/. When https is not enabled things work fine. I'm running into issues when it is enabled:

(notebook) jovyan@jupyter-tomaugspurger:~$ curl -LI http://proxy-public/services/dask-gateway/ -vv 
*   Trying 10.39.254.203:80...
* Connected to proxy-public (10.39.254.203) port 80 (#0)
> HEAD /services/dask-gateway/ HTTP/1.1
> Host: proxy-public
> User-Agent: curl/7.69.1
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 307 Temporary Redirect
HTTP/1.1 307 Temporary Redirect
< Location: https://proxy-public/services/dask-gateway/
Location: https://proxy-public/services/dask-gateway/
< Date: Wed, 26 Aug 2020 18:53:59 GMT
Date: Wed, 26 Aug 2020 18:53:59 GMT
< Content-Length: 18
Content-Length: 18
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8

< 
* Connection #0 to host proxy-public left intact
* Issue another request to this URL: 'https://proxy-public/services/dask-gateway/'
*   Trying 10.39.254.203:443...
* Connected to proxy-public (10.39.254.203) port 443 (#1)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /srv/conda/envs/notebook/ssl/cacert.pem
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS alert, internal error (592):
* error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error
* Closing connection 1
curl: (35) error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error
(notebook) jovyan@jupyter-tomaugspurger:~$ 

10.39.254.203 is the CLUSTER-IP for proxy-public.

$ kubectl -n staging get svc proxy-public18:56:00 2020
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
proxy-public   LoadBalancer   10.39.254.203   34.69.173.244   443:30970/TCP,80:30477/TCP   55d

@consideRatio do you have any guesses here (just off the top of your head, I'm happy to dig into this myself)? This is being used for the address passed to dask_gateway.Gateway().

@consideRatio
Copy link
Member

consideRatio commented Aug 27, 2020

Ah so proxy public is a service that switches target to the autohttps pod if automatic cert acquisition and tls termination is used. Then, the flow is proxy-public svc into autohttps pod into proxy-http service into proxy pod. So, you want to send traffic to proxy-http instead if you have enabled automatic https stuff.

Alternatively, use https traffic against the proxy-public svc for a detour through the autohttps pod.

@TomAugspurger
Copy link
Member Author

Ohh thanks! I think when I tried proxy-http earlier I used http://proxy-http/services/dask-gateway:8000 instead of http://proxy-http:8000/services/dask-gateway. 😬

OK, so I'll need to figure out a semi-reliable way of detecting whether https is enabled, and if it is then we'll set the gateway address to http://proxy-http:8000/services/dask-gateway. Thanks!

@consideRatio
Copy link
Member

If you are a pod in the namespace where the proxy-http service excist, you will have a env var named PROXY_HTTP_SVC i think or something like this, its one of various env var set by k8s kubelet to help containers find the ips etc of various k8s services in the namespace they run.

So, looking for thesr env vars indicates service availavility, which indicates if you should or not go there.

@TomAugspurger
Copy link
Member Author

Thanks. I'm going to use PROXY_HTTP_SERVICE_HOST and PROXY_HTTP_SERVICE_PORT. Testing those out now but I think it'll work just fine.

TomAugspurger added a commit to TomAugspurger/helm-chart-1 that referenced this pull request Aug 27, 2020
As discovered in
pangeo-data/pangeo-cloud-federation#697, the
current use of `proxy-public` doesn't work when https is enabled. We
detect this and use the appropriate service now.
TomAugspurger added a commit to dask/helm-chart that referenced this pull request Aug 27, 2020
* Handle https-enabled JupyterHub deployments

As discovered in
pangeo-data/pangeo-cloud-federation#697, the
current use of `proxy-public` doesn't work when https is enabled. We
detect this and use the appropriate service now.
@TomAugspurger TomAugspurger marked this pull request as ready for review August 27, 2020 15:41
@TomAugspurger
Copy link
Member Author

OK this should mostly be good to go.

@tjcrone could you update the ooi secrets files to change the top key from pangeo to daskhub?

diff --git a/deployments/gcp-uscentral1b/secrets/staging.yaml b/deployments/gcp-uscentral1b/secrets/staging.yaml
index 3b79dda..257cefe 100644
--- a/deployments/gcp-uscentral1b/secrets/staging.yaml
+++ b/deployments/gcp-uscentral1b/secrets/staging.yaml
@@ -1,4 +1,4 @@
-pangeo:
+daskhub:

Both ooi/secrets/staging.yaml and ooi/secrets.prod.yaml.

@TomAugspurger
Copy link
Member Author

cc @scottyhq as well if you have any questions / concerns. In theory there shouldn't really be any changes, other than a few of the environment variables being set for us automatically now.

@scottyhq
Copy link
Member

Awesome, thanks for making this happen! Linking to pangeo-data/helm-chart#129 for future reference. But merge away!

mem_guarantee: 25G
environment: {'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility'}
tolerations: [{'key': 'nvidia.com/gpu','operator': 'Equal','value': 'present','effect': 'NoSchedule'}]
extra_resource_limits: {"nvidia.com/gpu": "1"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think applying this resource request will make the toleration automatically be applied by some controller in k8s, but it wont hurt also manually applying the toleration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are from a merge conflict, but @scottyhq might want to take a look at the comment :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly specifying extra_resource_limits: {"nvidia.com/gpu": "1"} automatically sets tolerations: [{'key': 'nvidia.com/gpu','operator': 'Equal','value': 'present','effect': 'NoSchedule'}] ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly these settings we're copied over from GCP and we never did two much experimentation to see what was necessary or not. Also things may have changed with more recent AMI versions and CUDA setups. This issue has some additional details jupyterhub/zero-to-jupyterhub-k8s#994 (comment)

kubespawner_override:
image: pangeo/base-notebook:master
- display_name: "Staging ML-notebook"
description: "https://github.com/pangeo-data/pangeo-docker-images/tree/master/ml-notebook"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think title or description should indicate you get a GPU machine, as that influence how the user may want to manually shut down the pod or so to save some money.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call. Maybe "ML Notebook with GPU, please only use if you need it ;)"

@TomAugspurger TomAugspurger mentioned this pull request Aug 31, 2020
@TomAugspurger TomAugspurger merged commit 941e17b into pangeo-data:staging Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants