-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve resilience of extension downloads #1463
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the worst scenario (when the download sequence fails for all uris, after all retries), we would raise an exception with the message "Failed to download extension". However, this complete failure can be because:
- we actually fail to download from all uris, or
- we download a corrupt zip every time (the publisher messed up) and we fail to expand.
Currently, we are not explicitly addressing 2). Would it be worth it to add another type of exception for this case?
Also, should we be testing if we've properly cleaned up invalid downloads/expansions?
file_downloaded = self.protocol.download_ext_handler_pkg(uri.uri, destination) | ||
if not package_exists: | ||
downloaded = False | ||
i = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a better name for this variable might be "retry_count"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The loop is short and very simple; I think in this case the longer variable name wouldn't add much.
self.download_failures = 0 | ||
|
||
def download_ext_handler_pkg(_uri, destination): | ||
if self.download_failures < 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use NUMBER_OF_DOWNLOAD_RETRIES - 2 instead of 3? It seems like an arbitrary number otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is this related to the number of uris you initialize in the test case set up method? In that case, I would also refer to len(self.pkg.uris).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an arbitrary number :) - Read it as "fail a few times, then succeed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment to code
self.download_failures = 0 | ||
|
||
def download_ext_handler_pkg(_uri, destination): | ||
if self.download_failures < 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
self.download_failures = 0 | ||
|
||
def download_ext_handler_pkg(_uri, destination): | ||
if self.download_failures < 3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
We are addressing both cases (with the same exception). The log will contain the specific errors. There is a different work item to add telemetry for download failures; over there it may be useful to differentiate both cases. |
test_it_should_ignore_existing_extension_package_when_it_is_invalid verifies that, if for any reason we don't cleanup an invalid zip, we re-download. I'll add a couple of asserts to test_it_should_raise_an_exception_when_all_downloads_fail to check that the zip & extension dir are gone |
Thanks, Norberto. LGTM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Minor comments. Not blocker, but can be done as a part of later releases.
return False | ||
return True | ||
|
||
def _unzip_extension_package(self, source_file, target_directory): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a clarification - Would the issue of corrupt zip arise for Agent as well? and we need similar logic in azurelinuxagent.ga.update.GuestAgent#_unpack
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, and I intend to fix that as well. However, this is a hotfix release, so I do not want to add any other changes to it. Thanks for pointing it out.
except Exception as exception: | ||
self.logger.info("Error downloading extension package: {0}", ustr(exception)) | ||
if os.path.exists(target_file): | ||
os.remove(target_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fileutil.clean_ioerror
already does this partially. We should change (overload) this as util and use that in both the place (756, 766 and 768).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function seems to be meant to handle very specific errors and cannot handle this one in particular - it has a check for the exception type. Since that function is used in many places in the code, I decided to leave it alone for the moment.
if not file_downloaded: | ||
raise ExtensionDownloadError("Failed to download extension", code=1001) | ||
self.logger.info("Failed to download the extension package from all uris, will retry after a minute") | ||
time.sleep(60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! 👍
fileutil.clean_ioerror(e, paths=[self.get_base_dir(), self.pkg_file]) | ||
raise ExtensionError(u"Failed to unzip extension package", e, code=1001) | ||
if not downloaded: | ||
raise ExtensionDownloadError("Failed to download extension", code=1001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a task. We need to document these magical numbers. (such as 1001).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
protocol = WireProtocol("http://Microsoft.CPlat.Core.RunCommandLinux/foo-bar") | ||
|
||
self.pkg = ExtHandlerPackage() | ||
self.pkg.uris = [ ExtHandlerVersionUri(), ExtHandlerVersionUri(), ExtHandlerVersionUri(), ExtHandlerVersionUri(), ExtHandlerVersionUri() ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change ExtHandlerVersionUri
to the following, It would make both the class as well as the test a little cleaner:
class ExtHandlerVersionUri(DataContract):
def __init__(self, uri=None):
self.uri = uri
self.pkg.uris = [
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr00a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'),
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr01a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'),
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr02a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'),
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr02a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'),
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr02a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0')
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExtHandlerVersionUri is meant to be used to deserialize from xml, etc. The class has other members other than uri. I think it would not be worth it to add all of those parameters to init just for the sake of this test.
Description
PR information
Quality of Code and Contribution Guidelines