Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the python model download methods for S3 and GCS. #116

Merged
merged 7 commits into from
May 29, 2019

Conversation

Tomcli
Copy link
Member

@Tomcli Tomcli commented May 24, 2019

What this PR does / why we need it:
This PR implements the python model download methods for S3 and GCS. Since KFServing can provide S3 and GCS credentials as environment variables, we will be using those environment variables for user authentication.

I also added the gsutil to the docker images since we need it for accessing public GCS data artifacts.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #87

Special notes for your reviewer:

  1. It would be nice if we can upload the current sklearn and xgboost sample models to the gs://kfserving-samples bucket, so we can update the example to use gcs and don't have to include the sample models in the docker image.

Release note:



This change is Reviewable

@k8s-ci-robot
Copy link
Contributor

Hi @Tomcli. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Tomcli
Copy link
Member Author

Tomcli commented May 24, 2019

/assign @ellis-bigelow
/assign @animeshsingh

RUN apt-get update && apt-get install -y curl lsb-release gnupg
RUN export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
RUN apt-get update && apt-get install -y google-cloud-sdk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no python version for google cloud sdk - available with pip install?

Copy link
Member Author

@Tomcli Tomcli May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is, I used it when the user provide the GOOGLE_APPLICATION_CREDENTIALS for their own bucket. However, for public accessible data, I only can pull it using either gsutil or rest API. You can visit the details at https://cloud.google.com/storage/docs/access-public-data.

If anyone found a solution to access gcs public data using the python SDK, I would happy to implement that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the pip install version of gsutil only works on python 2.7 and our image is using python 3.7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear over here, the google cloud python client is available on python 3.7. However, we need the gsutil command line client to access public data, and the pip install for gsutil is python 2.7 only https://cloud.google.com/storage/docs/gsutil_install#alt-install.

Therefore, I went with installing the whole google cloud command line SDK which won't depends on the local machine python version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ellis-bigelow is that the normal user experience on GCS? Two different SDKs based on scenarios?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let`s jus stick to gsutil then. Less error prone

Copy link
Member Author

@Tomcli Tomcli May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take a look at the google.cloud.storage source code, looks like with anonymous client and .bucket() will able to get the object without credentials. I think the google cloud docs is somewhat misleading the users to use .get_bucket() which required access credentials.
https://cloud.google.com/storage/docs/downloading-objects#storage-download-object-code_sample
Also the access-public-data docs doesn't show any client SDK example except gsutil.

Let me investigate a bit more and see can I do it in pure python SDK. If not, then we can stick back with gsutil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ellis-bigelow So I did find a way to access public data with the Python client using the anonymous client. I reimplemented the gcs function with pure gcs python SDK now.

S3_PATH = 's3://bucket/path'
S3_ENDPOINT = ''
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is prepending with 'AWS_' required by Minio, even if its not AWS backend?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These environment variable name are based on PR #94. If we want a different environment variable name, then we have to change the go controller as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AWS_ envs work for other S3 compatible storages too.

object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/")
client.fget_object(bucket_name, obj.object_name, temp_dir + "/" + object_file_name)
except Exception as e:
raise Exception(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of catching generic exception and rethrow the same again ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, I updated the catching with a specific exception now.

S3_PATH = 's3://bucket/path'
S3_ENDPOINT = ''
AWS_ACCESS_KEY_ID = ''
AWS_SECRET_ACCESS_KEY = ''

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AWS_ envs work for other S3 compatible storages too.

url = re.compile(r"https?://")
minioClient = Minio(url.sub("", os.getenv("S3_ENDPOINT", "")),
access_key=os.getenv("AWS_ACCESS_KEY_ID", ""),
secret_key=os.getenv("AWS_SECRET_ACCESS_KEY", ""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I thought once envs are set you no longer need to pass these to minio client.. This is definitely not needed for boto but I am not sure about minio client.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for minio you will need to pass it as parameters. Else it will just create an anonymous client according to the docs. https://docs.min.io/docs/python-client-api-reference

except IndexError as err:
raise Exception('Error: Incorrect syntax with the S3 endpoint or credentials. \n' + err)
except Exception as e:
raise Exception(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the catching with a specific exception now.



def test_private_gcs():
if GOOGLE_APPLICATION_CREDENTIALS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am missing something, but looks GOOGLE_APPLICATION_CREDENTIALS is set to empty string how this branch get tested ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I don't have access to kfserving's testing infrastructure, this test won't run unless someone set the global value GOOGLE_APPLICATION_CREDENTIALS.

Do the kfserving CI/CD expose GOOGLE_APPLICATION_CREDENTIALS and S3 credentials in the testing environment? If so, I can update these tests accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we are doing unit tests not integration tests, so you can set the env in the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the unit test to run it when the environment variables are available.



def test_private_s3():
if S3_ENDPOINT and AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

def test_no_permission_buckets():
bad_s3_path = "s3://random/path"
bad_gcs_path = "gs://random/path"
with pytest.raises(Exception):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific exception for no permission error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the exceptions, and now they should be catching errors from minio and google.cloud.

object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/")
client.fget_object(bucket_name, obj.object_name, temp_dir + "/" + object_file_name)
except error.NoSuchBucket as e:
raise error.NoSuchBucket("Bucket is not found, please double check the bucket name.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just pass the error directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR now to pass errors unless it's necessary to catch it.

@animeshsingh
Copy link
Member

/ok-to-test

bucket_path = bucket_args[1] if len(bucket_args) > 1 else ""
objects = client.list_objects(bucket_name, prefix=bucket_path, recursive=True)
for obj in objects:
object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the name here is confusing, I renamed the variable as subdir_object_key. What we want to do over here is to replace the prefix from s3 or gcs with the temp_dir path on the local file system. I didn't use basename since we want to keep the object key with all the subdirectories after the prefix and append it to temp_dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

objects = client.list_objects(bucket_name, prefix=bucket_path, recursive=True)
for obj in objects:
object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/")
client.fget_object(bucket_name, obj.object_name, temp_dir + "/" + object_file_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work Tommy! I think it's almost there.

A couple things I missed before:

  1. Our tests shouldn't interact with the outside world.
  2. We should use python file utilities in some of these cases. I think that might squash a few lurking bugs.

blobs = bucket.list_blobs(prefix=bucket_path)
for blob in blobs:
object_file_name = blob.name.replace(bucket_path, "", 1).strip("/")
# Create necessary subdirectory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


def test_public_gcs():
gcs_path = 'gs://kfserving-samples/models/tensorflow/flowers'
assert kfserving.Storage.download(gcs_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's typically bad practice for unit tests to reach out into the real world like this. Can we mock the minio and gcs client instead? Otherwise we're implementing non-hermetic which can be extremely brittle to changes -- like if I delete this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use python MagicMock to mock the response and test different exceptions.
https://docs.python.org/3/library/unittest.mock.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my recent commits, I used magicmock and mock.patch to simulate some of the outputs for unit tests.

def test_private_gcs():
if os.getenv("GOOGLE_APPLICATION_CREDENTIALS", "") and os.getenv("GCS_PRIVATE_PATH", ""):
assert kfserving.Storage.download(os.getenv("GCS_PRIVATE_PATH", ""))
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment via mocking, our tests should not turn off if environment variables aren't set.

@ellistarn
Copy link
Contributor

Nice work!
/lgtm
Given the scope, I'll wait for another approver to give you a thumbs up as well.

@animeshsingh
Copy link
Member

/lgtm

@animeshsingh
Copy link
Member

@ellis-bigelow can we move the sklearn and xgboost models to default gcs account you are using for TF flower sample?

bad_s3_path = "s3://random/path"
bad_gcs_path = "gs://random/path"
# Access private buckets without credentials
mock_minio.return_value = Minio("s3.us.cloud-object-storage.appdomain.cloud", secure=True)
Copy link
Member

@yuzisun yuzisun May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we need to patch urllib3.PoolManager as well to mock the exception response
mock_connection.side_effect = error.AccessDenied(None)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks for the advice.

mock_minio.return_value = Minio("s3.us.cloud-object-storage.appdomain.cloud", secure=True)
with pytest.raises(error.AccessDenied):
kfserving.Storage.download(bad_s3_path)
with pytest.raises(exceptions.Forbidden):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@k8s-ci-robot k8s-ci-robot removed the lgtm label May 29, 2019
@yuzisun
Copy link
Member

yuzisun commented May 29, 2019

Looks good now, thanks!

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuzisun

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 69cc3b4 into kserve:master May 29, 2019
ellistarn pushed a commit to ellistarn/kfserving that referenced this pull request Jul 28, 2020
* update kfserving storage to handle s3 and gcs

* update dockerfile

* update exception catches

* updated gcs with anonymous client trick, passing original errors instead of exception raise

* update readme related to the gcs changes

* update unit tests to use mock client

* add mock sideeffect for exception tests
Surya-smart619 pushed a commit to Surya-smart619/kserve that referenced this pull request May 11, 2022
Signed-off-by: Laurent Grangeau <laurent.grangeau@gmail.com>
terrytangyuan pushed a commit to terrytangyuan/kserve that referenced this pull request May 13, 2024
…erve-28

Red Hat Konflux update kserve-28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Python Model download for GCS and S3
5 participants