-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the python model download methods for S3 and GCS. #116
Conversation
Hi @Tomcli. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @ellis-bigelow |
python/sklearn.Dockerfile
Outdated
RUN apt-get update && apt-get install -y curl lsb-release gnupg | ||
RUN export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" && echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list | ||
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - | ||
RUN apt-get update && apt-get install -y google-cloud-sdk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no python version for google cloud sdk - available with pip install?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, I used it when the user provide the GOOGLE_APPLICATION_CREDENTIALS
for their own bucket. However, for public accessible data, I only can pull it using either gsutil or rest API. You can visit the details at https://cloud.google.com/storage/docs/access-public-data.
If anyone found a solution to access gcs public data using the python SDK, I would happy to implement that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the pip install version of gsutil only works on python 2.7 and our image is using python 3.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear over here, the google cloud python client is available on python 3.7. However, we need the gsutil command line client to access public data, and the pip install for gsutil is python 2.7 only https://cloud.google.com/storage/docs/gsutil_install#alt-install.
Therefore, I went with installing the whole google cloud command line SDK which won't depends on the local machine python version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ellis-bigelow is that the normal user experience on GCS? Two different SDKs based on scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let`s jus stick to gsutil then. Less error prone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take a look at the google.cloud.storage source code, looks like with anonymous client and .bucket()
will able to get the object without credentials. I think the google cloud docs is somewhat misleading the users to use .get_bucket()
which required access credentials.
https://cloud.google.com/storage/docs/downloading-objects#storage-download-object-code_sample
Also the access-public-data docs doesn't show any client SDK example except gsutil.
Let me investigate a bit more and see can I do it in pure python SDK. If not, then we can stick back with gsutil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ellis-bigelow So I did find a way to access public data with the Python client using the anonymous client. I reimplemented the gcs function with pure gcs python SDK now.
S3_PATH = 's3://bucket/path' | ||
S3_ENDPOINT = '' | ||
AWS_ACCESS_KEY_ID = '' | ||
AWS_SECRET_ACCESS_KEY = '' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is prepending with 'AWS_' required by Minio, even if its not AWS backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These environment variable name are based on PR #94. If we want a different environment variable name, then we have to change the go controller as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AWS_
envs work for other S3 compatible storages too.
object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/") | ||
client.fget_object(bucket_name, obj.object_name, temp_dir + "/" + object_file_name) | ||
except Exception as e: | ||
raise Exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of catching generic exception and rethrow the same again ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, I updated the catching with a specific exception now.
S3_PATH = 's3://bucket/path' | ||
S3_ENDPOINT = '' | ||
AWS_ACCESS_KEY_ID = '' | ||
AWS_SECRET_ACCESS_KEY = '' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AWS_
envs work for other S3 compatible storages too.
url = re.compile(r"https?://") | ||
minioClient = Minio(url.sub("", os.getenv("S3_ENDPOINT", "")), | ||
access_key=os.getenv("AWS_ACCESS_KEY_ID", ""), | ||
secret_key=os.getenv("AWS_SECRET_ACCESS_KEY", ""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I thought once envs are set you no longer need to pass these to minio client.. This is definitely not needed for boto but I am not sure about minio client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for minio you will need to pass it as parameters. Else it will just create an anonymous client according to the docs. https://docs.min.io/docs/python-client-api-reference
except IndexError as err: | ||
raise Exception('Error: Incorrect syntax with the S3 endpoint or credentials. \n' + err) | ||
except Exception as e: | ||
raise Exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the catching with a specific exception now.
|
||
|
||
def test_private_gcs(): | ||
if GOOGLE_APPLICATION_CREDENTIALS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i am missing something, but looks GOOGLE_APPLICATION_CREDENTIALS
is set to empty string how this branch get tested ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I don't have access to kfserving's testing infrastructure, this test won't run unless someone set the global value GOOGLE_APPLICATION_CREDENTIALS
.
Do the kfserving CI/CD expose GOOGLE_APPLICATION_CREDENTIALS
and S3 credentials in the testing environment? If so, I can update these tests accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we are doing unit tests not integration tests, so you can set the env in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the unit test to run it when the environment variables are available.
|
||
|
||
def test_private_s3(): | ||
if S3_ENDPOINT and AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
def test_no_permission_buckets(): | ||
bad_s3_path = "s3://random/path" | ||
bad_gcs_path = "gs://random/path" | ||
with pytest.raises(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific exception for no permission error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the exceptions, and now they should be catching errors from minio and google.cloud.
object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/") | ||
client.fget_object(bucket_name, obj.object_name, temp_dir + "/" + object_file_name) | ||
except error.NoSuchBucket as e: | ||
raise error.NoSuchBucket("Bucket is not found, please double check the bucket name.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just pass the error directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR now to pass errors unless it's necessary to catch it.
/ok-to-test |
…ead of exception raise
bucket_path = bucket_args[1] if len(bucket_args) > 1 else "" | ||
objects = client.list_objects(bucket_name, prefix=bucket_path, recursive=True) | ||
for obj in objects: | ||
object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about basename? https://docs.python.org/2/library/os.path.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the name here is confusing, I renamed the variable as subdir_object_key
. What we want to do over here is to replace the prefix from s3 or gcs with the temp_dir path on the local file system. I didn't use basename
since we want to keep the object key with all the subdirectories after the prefix and append it to temp_dir
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
objects = client.list_objects(bucket_name, prefix=bucket_path, recursive=True) | ||
for obj in objects: | ||
object_file_name = obj.object_name.replace(bucket_path, "", 1).strip("/") | ||
client.fget_object(bucket_name, obj.object_name, temp_dir + "/" + object_file_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about join? https://docs.python.org/2/library/os.path.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hard work Tommy! I think it's almost there.
A couple things I missed before:
- Our tests shouldn't interact with the outside world.
- We should use python file utilities in some of these cases. I think that might squash a few lurking bugs.
blobs = bucket.list_blobs(prefix=bucket_path) | ||
for blob in blobs: | ||
object_file_name = blob.name.replace(bucket_path, "", 1).strip("/") | ||
# Create necessary subdirectory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use exist_ok=True
https://docs.python.org/3/library/os.html#os.makedirs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
def test_public_gcs(): | ||
gcs_path = 'gs://kfserving-samples/models/tensorflow/flowers' | ||
assert kfserving.Storage.download(gcs_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's typically bad practice for unit tests to reach out into the real world like this. Can we mock the minio and gcs client instead? Otherwise we're implementing non-hermetic which can be extremely brittle to changes -- like if I delete this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use python MagicMock
to mock the response and test different exceptions.
https://docs.python.org/3/library/unittest.mock.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my recent commits, I used magicmock and mock.patch to simulate some of the outputs for unit tests.
def test_private_gcs(): | ||
if os.getenv("GOOGLE_APPLICATION_CREDENTIALS", "") and os.getenv("GCS_PRIVATE_PATH", ""): | ||
assert kfserving.Storage.download(os.getenv("GCS_PRIVATE_PATH", "")) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment via mocking, our tests should not turn off if environment variables aren't set.
Nice work! |
/lgtm |
@ellis-bigelow can we move the sklearn and xgboost models to default gcs account you are using for TF flower sample? |
bad_s3_path = "s3://random/path" | ||
bad_gcs_path = "gs://random/path" | ||
# Access private buckets without credentials | ||
mock_minio.return_value = Minio("s3.us.cloud-object-storage.appdomain.cloud", secure=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we need to patch urllib3.PoolManager
as well to mock the exception response
mock_connection.side_effect = error.AccessDenied(None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thanks for the advice.
mock_minio.return_value = Minio("s3.us.cloud-object-storage.appdomain.cloud", secure=True) | ||
with pytest.raises(error.AccessDenied): | ||
kfserving.Storage.download(bad_s3_path) | ||
with pytest.raises(exceptions.Forbidden): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Looks good now, thanks! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yuzisun The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* update kfserving storage to handle s3 and gcs * update dockerfile * update exception catches * updated gcs with anonymous client trick, passing original errors instead of exception raise * update readme related to the gcs changes * update unit tests to use mock client * add mock sideeffect for exception tests
Signed-off-by: Laurent Grangeau <laurent.grangeau@gmail.com>
…erve-28 Red Hat Konflux update kserve-28
What this PR does / why we need it:
This PR implements the python model download methods for S3 and GCS. Since KFServing can provide S3 and GCS credentials as environment variables, we will be using those environment variables for user authentication.
I also added the gsutil to the docker images since we need it for accessing public GCS data artifacts.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #87
Special notes for your reviewer:
gs://kfserving-samples
bucket, so we can update the example to use gcs and don't have to include the sample models in the docker image.Release note:
This change is