-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dask/daskhub helm chart #697
Conversation
I'm having some trouble connecting to the gateway from within the hub: ClientConnectorSSLError: Cannot connect to host proxy-public:443 ssl:default [[SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:1076)] That's connecting to |
I think this demonstrates the issue: On a singleuser pod in the kubernetes cluster, I want to make a request to (notebook) jovyan@jupyter-tomaugspurger:~$ curl -LI http://proxy-public/services/dask-gateway/ -vv
* Trying 10.39.254.203:80...
* Connected to proxy-public (10.39.254.203) port 80 (#0)
> HEAD /services/dask-gateway/ HTTP/1.1
> Host: proxy-public
> User-Agent: curl/7.69.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 307 Temporary Redirect
HTTP/1.1 307 Temporary Redirect
< Location: https://proxy-public/services/dask-gateway/
Location: https://proxy-public/services/dask-gateway/
< Date: Wed, 26 Aug 2020 18:53:59 GMT
Date: Wed, 26 Aug 2020 18:53:59 GMT
< Content-Length: 18
Content-Length: 18
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host proxy-public left intact
* Issue another request to this URL: 'https://proxy-public/services/dask-gateway/'
* Trying 10.39.254.203:443...
* Connected to proxy-public (10.39.254.203) port 443 (#1)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /srv/conda/envs/notebook/ssl/cacert.pem
CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS alert, internal error (592):
* error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error
* Closing connection 1
curl: (35) error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error
(notebook) jovyan@jupyter-tomaugspurger:~$ 10.39.254.203 is the CLUSTER-IP for
@consideRatio do you have any guesses here (just off the top of your head, I'm happy to dig into this myself)? This is being used for the |
Ah so proxy public is a service that switches target to the autohttps pod if automatic cert acquisition and tls termination is used. Then, the flow is proxy-public svc into autohttps pod into proxy-http service into proxy pod. So, you want to send traffic to proxy-http instead if you have enabled automatic https stuff. Alternatively, use https traffic against the proxy-public svc for a detour through the autohttps pod. |
Ohh thanks! I think when I tried OK, so I'll need to figure out a semi-reliable way of detecting whether https is enabled, and if it is then we'll set the gateway address to |
If you are a pod in the namespace where the proxy-http service excist, you will have a env var named PROXY_HTTP_SVC i think or something like this, its one of various env var set by k8s kubelet to help containers find the ips etc of various k8s services in the namespace they run. So, looking for thesr env vars indicates service availavility, which indicates if you should or not go there. |
Thanks. I'm going to use |
As discovered in pangeo-data/pangeo-cloud-federation#697, the current use of `proxy-public` doesn't work when https is enabled. We detect this and use the appropriate service now.
* Handle https-enabled JupyterHub deployments As discovered in pangeo-data/pangeo-cloud-federation#697, the current use of `proxy-public` doesn't work when https is enabled. We detect this and use the appropriate service now.
OK this should mostly be good to go. @tjcrone could you update the ooi secrets files to change the top key from diff --git a/deployments/gcp-uscentral1b/secrets/staging.yaml b/deployments/gcp-uscentral1b/secrets/staging.yaml
index 3b79dda..257cefe 100644
--- a/deployments/gcp-uscentral1b/secrets/staging.yaml
+++ b/deployments/gcp-uscentral1b/secrets/staging.yaml
@@ -1,4 +1,4 @@
-pangeo:
+daskhub: Both |
cc @scottyhq as well if you have any questions / concerns. In theory there shouldn't really be any changes, other than a few of the environment variables being set for us automatically now. |
Awesome, thanks for making this happen! Linking to pangeo-data/helm-chart#129 for future reference. But merge away! |
deployments/icesat2/config/prod.yaml
Outdated
mem_guarantee: 25G | ||
environment: {'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility'} | ||
tolerations: [{'key': 'nvidia.com/gpu','operator': 'Equal','value': 'present','effect': 'NoSchedule'}] | ||
extra_resource_limits: {"nvidia.com/gpu": "1"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think applying this resource request will make the toleration automatically be applied by some controller in k8s, but it wont hurt also manually applying the toleration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are from a merge conflict, but @scottyhq might want to take a look at the comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand correctly specifying extra_resource_limits: {"nvidia.com/gpu": "1"}
automatically sets tolerations: [{'key': 'nvidia.com/gpu','operator': 'Equal','value': 'present','effect': 'NoSchedule'}]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly these settings we're copied over from GCP and we never did two much experimentation to see what was necessary or not. Also things may have changed with more recent AMI versions and CUDA setups. This issue has some additional details jupyterhub/zero-to-jupyterhub-k8s#994 (comment)
kubespawner_override: | ||
image: pangeo/base-notebook:master | ||
- display_name: "Staging ML-notebook" | ||
description: "https://github.com/pangeo-data/pangeo-docker-images/tree/master/ml-notebook" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think title or description should indicate you get a GPU machine, as that influence how the user may want to manually shut down the pod or so to save some money.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call. Maybe "ML Notebook with GPU, please only use if you need it ;)"
No description provided.