-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-4255] Replace Discovery based api with client based for GCS #5054
Conversation
6786d39
to
707e6ba
Compare
707e6ba
to
dc16fc5
Compare
Does this change require a note in file |
airflow/contrib/hooks/gcs_hook.py
Outdated
'storage', 'v1', http=http_authorized, cache_discovery=False) | ||
if not self._conn: | ||
self._conn = storage.Client(credentials=self._get_credentials(), | ||
project=self.project_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other hooks, project_id
is a method parameter. In this implementation, user can only pass project_id
as a connection configuration. This introduces inconsistencies. What steps should we take to unify these situations for all GCP operator?
We have a 3 options:
- Specifying
project_id
in connection configuration. - Specifying
project_id
in a method parameter with fallback to connection configuration - Specifying
project_id
in a hook constructor parameter with fallback to connection configuration.
The third variant does not appear anywhere, but it seems to me most expected. Initalizing parameters are not mixed with execution time parameters. project_id
is a parameter that initialize client library. It don't execute a API call.
Probably the wrong place for this discussion, but we should take steps to use each operator and hook for GCP to be identical.
CC: @potiuk @antonimaciej
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, let's discuss this and decide on this on the mailing list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we discuss this? region_name
on various AWS hooks/operators have the same pattern (some take them as kwargs, some just from the connection)
pageToken=pageToken, | ||
blobs = bucket.list_blobs( | ||
max_results=maxResults, | ||
page_token=pageToken, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is deprecated. Could you use a new way?
page_token (str) – (Optional) If present, return the next batch of blobs, using the value, which must correspond to the
nextPageToken
value returned in the previous response. Deprecated: use the pages property of the returned iterator instead of manually passing the token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that is in my todo list. As I wrote in the description of this PR, I want to keep this PR as backwards-compatible as possible, hence there is no note in Updating.md. I will add the note however as I think even though the function input and output are same, it still adds an extra dependency of google-cloud-storage
, so I will do that.
I will take care of the Deprecated nextPageToken
in the upcoming PR.
airflow/contrib/hooks/gcs_hook.py
Outdated
raise ValueError('Object Not Found') | ||
client = self.get_conn() | ||
bucket = client.get_bucket(bucket) | ||
blob = bucket.blob(blob_name=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks like the get_blob
method code.
Is this duplication intentional?
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/bucket.py#L691-L706
But in this case in not effective. It's do 2 calls to external API.
I write a sample script:
client = storage.Client()
bucket = client.get_bucket("instance-mb-test-1")
blob = bucket.get_blob('file-1.bin')
print("Blob size: ", blob.size)
On the screen a have a message:
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.googleapis.com:443
DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /storage/v1/b/instance-mb-test-1?projection=noAcl HTTP/1.1" 200 447
DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /storage/v1/b/instance-mb-test-1/o/file-1.bin HTTP/1.1" 200 753
Blob size: 104960000
It's confirm that you implementation do a two API calls (plus one call for authorization).
I am proposing that you use the code:
client = storage.Client()
bucket = storage.Bucket(client, "instance-mb-test-1")
blob = bucket.get_blob('file-1.bin')
print("Blob size: ", blob.size)
It's do one API call:
DEBUG:urllib3.util.retry:Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None)
DEBUG:google.auth.transport.requests:Making request: POST https://oauth2.googleapis.com/token
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG:urllib3.connectionpool:https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.googleapis.com:443
DEBUG:urllib3.connectionpool:https://www.googleapis.com:443 "GET /storage/v1/b/instance-mb-test-1/o/file-1.bin HTTP/1.1" 200 753
Blob size: 104960000
It is important to optimize this methos, because it is often used in a loop, and therefore the number of queries is significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inspired from Google's code and examples:
I don't get why there are 2 calls for one and not the other, may be I am missing something. Because it can either first get bucket object and then create a blob or create a bucket object and then get blob, looks same to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it in few places, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_bucket
method executes reload
method, so it does additional API request.
- First API call:
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/client.py#L227 - Second API call:
https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/bucket.py#L702
Implementation of reload
is available: https://github.com/googleapis/google-cloud-python/blob/master/storage/google/cloud/storage/_helpers.py#L110-L132
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an additional advantage in using both get_bucket
and get_blob
. The get_bucket
method raises an error if the Bucket does not exists. The get_blob
just returns None if the object doesn't exist.
I like this:
client = storage.Client()
bucket = client.get_bucket("instance-mb-test-1")
blob = bucket.get_blob('file-1.bin')
print("Blob size: ", blob.size)
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, get_blob()
doesn't contain blob.reload
in the latest stable google-cloud-storage
release (1.14.0)
The link you pasted is for master branch and has not yet made through in release :) Hopefully they release it soon and we can remove blob.reload
from our code
blob.reload() | ||
blob_crc32c = blob.crc32c | ||
self.log.info('The crc32c checksum of %s is %s', object, blob_crc32c) | ||
return blob_crc32c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same comment as
https://github.com/apache/airflow/pull/5054/files#r272844679
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
@@ -193,16 +187,7 @@ def upload(self, bucket, object, filename, | |||
:type mime_type: str | |||
:param gzip: Option to compress file for upload | |||
:type gzip: bool | |||
:param multipart: If True, the upload will be split into multiple HTTP requests. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that multipart support is gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko No, Google client library handles the multipart for your now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that as well in Updating.md
if you think other users might think the same as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a change in API of the operator so should go in UPDATING, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Added a comment on Multipart
in Updating.md.
@@ -82,7 +83,7 @@ def execute(self, context): | |||
object=self.object, | |||
filename=self.filename) | |||
if self.store_to_xcom_key: | |||
if sys.getsizeof(file_bytes) < 48000: | |||
if sys.getsizeof(file_bytes) < MAX_XCOM_SIZE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one
Codecov Report
@@ Coverage Diff @@
## master #5054 +/- ##
=========================================
- Coverage 76.98% 76.9% -0.08%
=========================================
Files 463 455 -8
Lines 29806 29667 -139
=========================================
- Hits 22945 22816 -129
+ Misses 6861 6851 -10
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small comments
airflow/contrib/hooks/gcs_hook.py
Outdated
if blob_update_time > ts: | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'return False' is missing for if blob_udate_time is None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call.
return True | ||
if gzip: | ||
os.remove(filename) | ||
self.log.info('File %s uploaded to %s in %s bucket', filename, object, bucket) | ||
|
||
# pylint:disable=redefined-builtin | ||
def exists(self, bucket, object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: maybe we can change the "object" name in the signature of functions. Since we are introducing backwards-incompatible changes anyway, that might be good time to get rid of the "object" redefinition and remove the pylint disable warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I have that PR ready. I am trying to keep the changes in this PR to be more on a backwards-compatible side.
The next PR will contain some breaking changes which will contain these name changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what are the intentions of sharing one refactorization for a few PR's? This makes changes much more difficult to review. I see a reason if this change was backwards compatible, but it is not. We have a note in fleUpdating.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maintain intention is so that we can cherry-pick this one in 1.10.4.
If you look at this PR and check for "breaking-changes" - the one's that are there are not widely used (also in Updating.md).
I wouldn't want to change the name of something like the object
parameter (or even bucket
) and just put a note in Updating.md. We wont cherry-pick the 2nd PR for 1.10.4 and would target 2.0 instead.
They are fundamentally 2 separate pieces: This PR focuses on "Replacing discovery api with client api" and not on "updating parameter name". Also more readable in Changelog.
None of the changes in this PR remove or change any required parameter of any method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes perfect sense @kaxil 👍 . Thanks for explanation.
else: | ||
return False | ||
|
||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you approve this PR if you are ok with it :) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! done. I like it :)
@kaxil We probably shouldn't pull this in to 1.10.4 since it changes the function sig, should we? |
Make sure you have checked all steps below.
Jira
Description
https://cloud.google.com/apis/docs/client-libraries-explained
Google Cloud Client Libraries use our latest client library model and are our recommended option for accessing Cloud APIs programmatically, where available.
https://pypi.org/project/google-cloud-storage/ library is available and we should be using that.
This is Part 1 of probably 3 parts. I am trying to not break any changes in this PR and keep it backward compatible so that we could include it in a patch or minor version release.
The 2nd & 3rd PR would contain some breaking changes and will contain notes in Updating.md
Tests
The current tests already cover some and will add few more tests
Commits
Documentation
Code Quality
flake8
cc @fenglu-g