Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Injecting S3 Credentials into tfserving Container #749

Closed
devstein opened this issue Aug 2, 2019 · 18 comments
Closed

Injecting S3 Credentials into tfserving Container #749

devstein opened this issue Aug 2, 2019 · 18 comments
Assignees

Comments

@devstein
Copy link

devstein commented Aug 2, 2019

Given the SeldonDeployment below, I want to inject AWS credentials in the tfserving container. Is there a way to do this?

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: test
spec:
  name: test
  predictors:
    - graph:
        children: []
        implementation: TENSORFLOW_SERVER
        modelUri: s3://bucket/path/to/model
        name: test
        parameters:
          - name: signature_name
            type: STRING
            value: predict
          - name: model_name
            type: STRING
            value: model
      name: default
      replicas: 1
@ukclivecox
Copy link
Contributor

The solution is discussed in #748 @ryandawsonuk can expand on where we are with it.

@ryandawsonuk
Copy link
Contributor

In the latest snapshot version the code is now in place for a SeldonDeployment to work like https://github.com/kubeflow/kfserving/tree/9b14e08038405348ac21f1acbbf8b26e4c26631d/docs/samples/s3

So you'd create the Secret and ServiceAccount like in that example and in the SeldonDeployment specify a serviceAccountName alsongside the modelUri like:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: test
spec:
  name: test
  predictors:
    - graph:
        children: []
        implementation: TENSORFLOW_SERVER
        modelUri: s3://bucket/path/to/model
        serviceAccountName: NAMEOFSERVICEACCOUNT
        name: test
        parameters:
          - name: signature_name
            type: STRING
            value: predict
          - name: model_name
            type: STRING
            value: model
      name: default
      replicas: 1

I've checked that secrets do get applied to the pod like this. I haven't fully checked that a model is downloaded from a private bucket, though as far as I can see it should work. @devstein if you'd like to try it on the latest snapshot version it would certainly be appreciated.

@devstein
Copy link
Author

devstein commented Aug 8, 2019

@ryandawsonuk Thanks for fixing. After updating my deployment, I'm getting the following error from the tfserving-model-initializer container

Traceback (most recent call last):
  File "/model-initializer/scripts/initializer-entrypoint", line 14, in <module>
    kfserving.Storage.download(src_uri, dest_path)
  File "/kfserving/kfserving/storage.py", line 41, in download
    Storage._download_s3(uri, out_dir)
  File "/kfserving/kfserving/storage.py", line 52, in _download_s3
    client = Storage._create_minio_client()
  File "/kfserving/kfserving/storage.py", line 105, in _create_minio_client
    secure=True)
  File "/usr/local/lib/python3.7/site-packages/minio/api.py", line 149, in __init__
    is_valid_endpoint(endpoint)
  File "/usr/local/lib/python3.7/site-packages/minio/helpers.py", line 301, in is_valid_endpoint
    if hostname[-1] == '.':
IndexError: string index out of range

@ryandawsonuk
Copy link
Contributor

Could you do a kubectl get pods and then kubectl get pod <pod_name> -o yaml? It would be good if you could paste the yaml here.

@lennon310
Copy link
Contributor

I saw the same error trace with the yaml file defined here

@phsiao
Copy link

phsiao commented Aug 9, 2019

Can we support the injection of credential environment variables via secretRef instead of via serviceAccount?

This is the hack we patched our operator with to keep it compatible with how it used to work.

--- a/pkg/controller/seldondeployment/model_initializer_injector.go
+++ b/pkg/controller/seldondeployment/model_initializer_injector.go
@@ -156,6 +156,15 @@ func InjectModelInitializer(deployment *appsv1.Deployment, containerName string,
                        srcURI,
                        DefaultModelLocalMountPath,
                },
+               EnvFrom: []corev1.EnvFromSource{
+                       {
+                               SecretRef: &corev1.SecretEnvSource{
+                                       LocalObjectReference: corev1.LocalObjectReference{
+                                               Name: "s3-secret",
+                                       },
+                               },
+                       },
+               },
                VolumeMounts:             modelInitializerMounts,
                TerminationMessagePath:   "/dev/termination-log",
                TerminationMessagePolicy: corev1.TerminationMessageReadFile,

@devstein
Copy link
Author

devstein commented Aug 9, 2019

@ryandawsonuk

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prometheus.io/path: prometheus
    prometheus.io/port: "8000"
    prometheus.io/scrape: "true"
  creationTimestamp: "2019-08-09T17:34:07Z"
  generateName: tf-model-default-6b6f7df5c7-
  labels:
    app: tf-model-default
    fluentd: "true"
    pod-template-hash: "2629389173"
    seldon-app: tf-model-tf-model-default
    seldon-app-tf-model: tf-model-default-tf-model-seldonio-tfserving-proxy-rest-0-3
    seldon-deployment-id: tf-model-tf-model
    version: default
  name: tf-model-default-6b6f7df5c7-9mp7g
  namespace: seldon-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: tf-model-default-6b6f7df5c7
    uid: e3890a0e-bacb-11e9-9b1e-0a666f37ba34
  resourceVersion: "20280551"
  selfLink: /api/v1/namespaces/seldon-system/pods/tf-model-default-6b6f7df5c7-9mp7g
  uid: e3918f1d-bacb-11e9-9b1e-0a666f37ba34
spec:
  containers:
  - env:
    - name: PREDICTIVE_UNIT_PARAMETERS
      value: '[{"name":"signature_name","value":"predict","type":"STRING"},{"name":"model_name","value":"tf-model","type":"STRING"},{"name":"rest_endpoint","value":"http://0.0.0.0:2001","type":"STRING"}]'
    - name: PREDICTIVE_UNIT_SERVICE_PORT
      value: "9000"
    - name: PREDICTIVE_UNIT_ID
      value: tf-model
    - name: PREDICTOR_ID
      value: default
    - name: SELDON_DEPLOYMENT_ID
      value: tf-model
    image: seldonio/tfserving-proxy_rest:0.3
    imagePullPolicy: Always
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - /bin/sleep 10
    livenessProbe:
      failureThreshold: 3
      initialDelaySeconds: 60
      periodSeconds: 5
      successThreshold: 1
      tcpSocket:
        port: http
      timeoutSeconds: 1
    name: tf-model
    ports:
    - containerPort: 9000
      name: http
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      initialDelaySeconds: 20
      periodSeconds: 5
      successThreshold: 1
      tcpSocket:
        port: http
      timeoutSeconds: 1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-dnnrf
      readOnly: true
  - args:
    - /usr/bin/tensorflow_model_server
    - --port=2000
    - --rest_api_port=2001
    - --model_name=tf-model
    - --model_base_path=/mnt/models
    image: tensorflow/serving:latest
    imagePullPolicy: IfNotPresent
    name: tfserving
    ports:
    - containerPort: 2000
      protocol: TCP
    - containerPort: 2001
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /mnt/models
      name: tfserving-provision-location
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-dnnrf
      readOnly: true
  - env:
    - name: ENGINE_PREDICTOR
      value: eyJuYW1lIjoiZGVmYXVsdCIsImdyYXBoIjp7Im5hbWUiOiJ0Zi1tb2RlbCIsInR5cGUiOiJNT0RFTCIsImltcGxlbWVudGF0aW9uIjoiVEVOU09SRkxPV19TRVJWRVIiLCJlbmRwb2ludCI6eyJzZXJ2aWNlX2hvc3QiOiJsb2NhbGhvc3QiLCJzZXJ2aWNlX3BvcnQiOjkwMDAsInR5cGUiOiJSRVNUIn0sInBhcmFtZXRlcnMiOlt7Im5hbWUiOiJzaWduYXR1cmVfbmFtZSIsInZhbHVlIjoicHJlZGljdCIsInR5cGUiOiJTVFJJTkcifSx7Im5hbWUiOiJtb2RlbF9uYW1lIiwidmFsdWUiOiJ0Zi1tb2RlbCIsInR5cGUiOiJTVFJJTkcifV0sIm1vZGVsVXJpIjoiczM6Ly9tb2RlbHMudmlhZHVjdC5haS9kZXZzdGVpbi9zYXZlLWxvY2F0aW9uLW1vZGVsLWFzLXRmL2xvY2F0aW9uLW1vZGVsLXRmLXNlcnZpbmctLW5pZ2h0bHktLTItYjRkd2QvbW9kZWwiLCJzZXJ2aWNlQWNjb3VudE5hbWUiOiJzZWxkb24tY29yZS1vcGVyYXRvciJ9LCJyZXBsaWNhcyI6MSwiZW5naW5lUmVzb3VyY2VzIjp7fSwibGFiZWxzIjp7InZlcnNpb24iOiJkZWZhdWx0In0sInN2Y09yY2hTcGVjIjp7fSwiZXhwbGFpbmVyIjp7ImNvbnRhaW5lclNwZWMiOnsibmFtZSI6IiIsInJlc291cmNlcyI6e319fX0=
    - name: DEPLOYMENT_NAME
      value: tf-model
    - name: DEPLOYMENT_NAMESPACE
      value: seldon-system
    - name: ENGINE_SERVER_PORT
      value: "8000"
    - name: ENGINE_SERVER_GRPC_PORT
      value: "5001"
    - name: JAVA_OPTS
      value: -Dcom.sun.management.jmxremote.rmi.port=9090 -Dcom.sun.management.jmxremote
        -Dcom.sun.management.jmxremote.port=9090 -Dcom.sun.management.jmxremote.ssl=false
        -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false
        -Djava.rmi.server.hostname=127.0.0.1
    - name: SELDON_LOG_MESSAGES_EXTERNALLY
      value: "false"
    image: docker.io/seldonio/engine:0.3.2-SNAPSHOT
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - curl 127.0.0.1:8000/pause; /bin/sleep 10
    livenessProbe:
      failureThreshold: 7
      httpGet:
        path: /live
        port: admin
        scheme: HTTP
      initialDelaySeconds: 20
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 2
    name: seldon-container-engine
    ports:
    - containerPort: 8000
      protocol: TCP
    - containerPort: 5001
      protocol: TCP
    - containerPort: 8082
      name: admin
      protocol: TCP
    - containerPort: 9090
      name: jmx
      protocol: TCP
    readinessProbe:
      failureThreshold: 1
      httpGet:
        path: /ready
        port: admin
        scheme: HTTP
      initialDelaySeconds: 20
      periodSeconds: 1
      successThreshold: 1
      timeoutSeconds: 2
    resources:
      requests:
        cpu: 100m
    securityContext:
      runAsUser: 8888
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/podinfo
      name: podinfo
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-dnnrf
      readOnly: true
  dnsPolicy: ClusterFirst
  initContainers:
  - args:
    - s3://path/to/model
    - /mnt/models
    env:
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          key: awsAccessKeyID
          name: seldon-aws-creds
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          key: awsSecretAccessKey
          name: seldon-aws-creds
    image: gcr.io/kfserving/model-initializer:latest
    imagePullPolicy: IfNotPresent
    name: tfserving-model-initializer
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /mnt/models
      name: tfserving-provision-location
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-dnnrf
      readOnly: true
  nodeName: ip-123-123.us-west-2.compute.internal
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 20
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: tfserving-provision-location
  - downwardAPI:
      defaultMode: 420
      items:
      - fieldRef:
          apiVersion: v1
          fieldPath: metadata.annotations
        path: annotations
    name: podinfo
  - name: default-token-dnnrf
    secret:
      defaultMode: 420
      secretName: default-token-dnnrf
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2019-08-09T17:34:07Z"
    message: 'containers with incomplete status: [tfserving-model-initializer]'
    reason: ContainersNotInitialized
    status: "False"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2019-08-09T17:34:07Z"
    message: 'containers with unready status: [tf-model tfserving seldon-container-engine]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    message: 'containers with unready status: [tf-model tfserving seldon-container-engine]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2019-08-09T17:34:07Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: docker.io/seldonio/engine:0.3.2-SNAPSHOT
    imageID: ""
    lastState: {}
    name: seldon-container-engine
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: PodInitializing
  - image: seldonio/tfserving-proxy_rest:0.3
    imageID: ""
    lastState: {}
    name: tf-model
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: PodInitializing
  - image: tensorflow/serving:latest
    imageID: ""
    lastState: {}
    name: tfserving
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: PodInitializing
  hostIP: 172.20.38.82
  initContainerStatuses:
  - containerID: docker://b613cb57827e9941c4bc25da0674714f4039a5d2c9b2112ad7e76d83fb5aef37
    image: gcr.io/kfserving/model-initializer:latest
    imageID: docker-pullable://gcr.io/kfserving/model-initializer@sha256:9aebf5116f2186eae53d9b9e73697e6ca340b9d4a65ea2ca66fbbcdc76d030bb
    lastState:
      terminated:
        containerID: docker://b613cb57827e9941c4bc25da0674714f4039a5d2c9b2112ad7e76d83fb5aef37
        exitCode: 1
        finishedAt: "2019-08-09T17:34:57Z"
        reason: Error
        startedAt: "2019-08-09T17:34:57Z"
    name: tfserving-model-initializer
    ready: false
    restartCount: 3
    state:
      waiting:
        message: Back-off 40s restarting failed container=tfserving-model-initializer
          pod=tf-model-default-6b6f7df5c7-9mp7g_seldon-system(e3918f1d-bacb-11e9-9b1e-0a666f37ba34)
        reason: CrashLoopBackOff
  phase: Pending
  podIP: 100.96.2.26
  qosClass: Burstable
  startTime: "2019-08-09T17:34:07Z"

@ukclivecox
Copy link
Contributor

I believe the PRs are merged now for this so will close. Please reopen if mistaken.

@arunbenoyv
Copy link

arunbenoyv commented May 27, 2020

Hi I'm getting the error with SKLEARN_SERVER

Traceback (most recent call last):
File "/storage-initializer/scripts/initializer-entrypoint", line 14, in
kfserving.Storage.download(src_uri, dest_path)
File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 50, in download
Storage._download_s3(uri, out_dir)
File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 65, in _download_s3
client = Storage._create_minio_client()
File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 217, in _create_minio_client
secure=use_ssl)
File "/usr/local/lib/python3.7/site-packages/minio/api.py", line 150, in init
is_valid_endpoint(endpoint)
File "/usr/local/lib/python3.7/site-packages/minio/helpers.py", line 301, in is_valid_endpoint
if hostname[-1] == '.':
IndexError: string index out of range

Secret created :
kubectl create secret generic seldon-init-container-secret --from-literal=AWS_ENDPOINT_URL='s3.amazonaws.com' --from-literal=AWS_ACCESS_KEY_ID='XXXXX' --from-literal=AWS_SECRET_ACCESS_KEY='XXXX' --from-literal=USE_SSL=false

Seldon-installed :
helm install seldon-core seldon-core-operator --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --set predictiveUnit.defaultEnvSecretRefName=seldon-core-init-container-secret

Yaml similar to iris example following parameters

  modelUri: s3://path/to/model
  envSecretRefName: seldon-init-container-secret

@arunbenoyv
Copy link

arunbenoyv commented May 28, 2020

Secondly what I wanted to understand is that if IAM role is attached to an EC2 instance then is there a provision to fetch model from s3 without any of these configurations?

@ukclivecox
Copy link
Contributor

@arunbenoyv Have you run through the docs?: https://docs.seldon.io/projects/seldon-core/en/latest/servers/overview.html#handling-credentials

Which version of Seldon are you running? If not latest then could it be the issue fixed by: #885

@ryandawsonuk
Copy link
Contributor

ryandawsonuk commented May 28, 2020

Secondly what I wanted to understand is that if IAM role is attached to an EC2 instance then is there a provision to fetch model from s3 without any of these configurations?

This is a cool idea but I think the answer is no right now. The download code expects to be hooked up with credentials in order to make the download. I suspect that even if the EC2 instance were given the role, that role wouldn't be used automatically by the code running in the kubernetes node. The code being used borrows from minio and there's a similar discussion on the minio github - minio/minio#6124. There's also a related idea being floated in #1865

@arunbenoyv
Copy link

@cliveseldon : As I have provided the repo in helm install my understanding was that it would pull the latest

helm install seldon-core seldon-core-operator --repo https://storage.googleapis.com/seldon-charts --set usageMetrics.enabled=true --set predictiveUnit.defaultEnvSecretRefName=seldon-core-init-container-secret

From describing the pod the image tag is : docker.io/seldonio/seldon-core-operator:1.1.0.

Is the error expected with this version?

@DFuller134
Copy link

DFuller134 commented Jun 11, 2020

Secondly what I wanted to understand is that if IAM role is attached to an EC2 instance then is there a provision to fetch model from s3 without any of these configurations?

This is a cool idea but I think the answer is no right now. The download code expects to be hooked up with credentials in order to make the download. I suspect that even if the EC2 instance were given the role, that role wouldn't be used automatically by the code running in the kubernetes node. The code being used borrows from minio and there's a similar discussion on the minio github - minio/minio#6124. There's also a related idea being floated in #1865

Please consider increasing priority on developing IAM Role capability. This is a critically important security feature for enterprises as distributing keys is a security concern. We only use IAM roles now and temporary keys that expire after a specified number of hours after lessons learned from other companies' high-profile data breaches.

@axsaucedo
Copy link
Contributor

@DFuller134 we have production use-cases of seldon integrated with IAM role capability, the way that it's set up atm is using boto3 to download the artifacts. Because of this I can confirm this can be done today, the two ways you can approach this is either by extending/creating your own initContainer that uses boto3, or by adding this functionality to use boto3 to load the model in your python model server wrapper's init function. @adriangonz can provide more details as he has seen this usecase runnign in production clusters, and should be enough to address your needs - this is the main reason why we made initContainers disjoint as they are very easy to extend (and alternatively addit it to the wrapper itself)

@jincongho
Copy link

@arunbenoyv looking from: https://github.com/SeldonIO/seldon-core/blob/master/python/seldon_core/storage.py , your AWS_ENDPOINT_URL should be 'http://s3.amazonaws.com'.
Does this work for you?

@dping1
Copy link

dping1 commented Mar 14, 2021

Hello,
Was anyone able to get this to work? I am having similar issue with MLFLOW SERVER.

Secret
kubectl create secret generic seldon-init-container-secret --from-literal=AWS_ENDPOINT_URL='s3.amazonaws.com' --from-literal=AWS_ACCESS_KEY_ID='XXXXX' --from-literal=AWS_SECRET_ACCESS_KEY='XXXX' --from-literal=USE_SSL=false

tried multiple variations of AWS_ENDPOINT_URL

MLFlow Server serving yaml
apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
name: model-deploy
spec:
name: model-deploy
predictors:
- graph:
children: []
implementation: MLFLOW_SERVER
modelUri: s3:///sklearn-model/
name: classifier
envSecretRefName: seldon-init-container-secret
name: default
replicas: 1

Error logs from the init-model container
Traceback (most recent call last):
File "/storage-initializer/scripts/initializer-entrypoint", line 14, in
kfserving.Storage.download(src_uri, dest_path)
File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 50, in download
Storage._download_s3(uri, out_dir)
File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 65, in _download_s3
client = Storage._create_minio_client()
File "/usr/local/lib/python3.7/site-packages/kfserving/storage.py", line 218, in _create_minio_client
secure=use_ssl)
File "/usr/local/lib/python3.7/site-packages/minio/api.py", line 167, in init
is_valid_endpoint(endpoint)
File "/usr/local/lib/python3.7/site-packages/minio/helpers.py", line 322, in is_valid_endpoint
raise InvalidEndpointError('Hostname cannot be empty.')
minio.error.InvalidEndpointError: InvalidEndpointError: message: Hostname cannot be empty.

@kuzm1ch
Copy link

kuzm1ch commented Apr 21, 2021

hello, I had similar issue as @dping1 with AWS S3.
As I understand, AWS_ENDPOINT_URL should contain schema ( http:// or https://) and region in case you use non-default one (us-east-1 )

When you use an endpoint with no Region, AWS routes the Amazon EC2 request to US East (N. Virginia) (us-east-1), which is the default Region for API calls.

AWS_ENDPOINT_URL = https://s3.eu-central-1.amazonaws.com ( based on my bucket location, EU central)

kubectl create secret generic seldon-init-container-secret --from-literal=AWS_ACCESS_KEY_ID='****' --from-literal=AWS_SECRET_ACCESS_KEY='****' --from-literal=AWS_ENDPOINT_URL='https://s3.eu-central-1.amazonaws.com' --from-literal=USE_SSL=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests