Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resilience of extension downloads #1463

Merged
merged 3 commits into from
Feb 14, 2019

Conversation

narrieta
Copy link
Member

Description

  • Expanding the extension zip should be done inside the loop that tries the different uris
  • Added 5 tries at 1 min interval
  • Unzipping a bad file throws Exception, not IoError
  • Fixed cleanup logic when a failure occurs

PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • Except for special cases involving multiple contributors, the PR is started from a fork of the main repository, not a branch.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made and Travis.CI is passing.

Quality of Code and Contribution Guidelines

Copy link
Contributor

@pgombar pgombar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the worst scenario (when the download sequence fails for all uris, after all retries), we would raise an exception with the message "Failed to download extension". However, this complete failure can be because:

  1. we actually fail to download from all uris, or
  2. we download a corrupt zip every time (the publisher messed up) and we fail to expand.
    Currently, we are not explicitly addressing 2). Would it be worth it to add another type of exception for this case?

Also, should we be testing if we've properly cleaned up invalid downloads/expansions?

file_downloaded = self.protocol.download_ext_handler_pkg(uri.uri, destination)
if not package_exists:
downloaded = False
i = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a better name for this variable might be "retry_count"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The loop is short and very simple; I think in this case the longer variable name wouldn't add much.

self.download_failures = 0

def download_ext_handler_pkg(_uri, destination):
if self.download_failures < 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use NUMBER_OF_DOWNLOAD_RETRIES - 2 instead of 3? It seems like an arbitrary number otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this related to the number of uris you initialize in the test case set up method? In that case, I would also refer to len(self.pkg.uris).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an arbitrary number :) - Read it as "fail a few times, then succeed".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment to code

self.download_failures = 0

def download_ext_handler_pkg(_uri, destination):
if self.download_failures < 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

self.download_failures = 0

def download_ext_handler_pkg(_uri, destination):
if self.download_failures < 3:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

@narrieta
Copy link
Member Author

@pgombar

2. we download a corrupt zip every time (the publisher messed up) and we fail to expand.
Currently, we are not explicitly addressing 2). Would it be worth it to add another type of exception for this case?

We are addressing both cases (with the same exception). The log will contain the specific errors. There is a different work item to add telemetry for download failures; over there it may be useful to differentiate both cases.

@narrieta
Copy link
Member Author

@pgombar

Also, should we be testing if we've properly cleaned up invalid downloads/expansions?

test_it_should_ignore_existing_extension_package_when_it_is_invalid verifies that, if for any reason we don't cleanup an invalid zip, we re-download.

I'll add a couple of asserts to test_it_should_raise_an_exception_when_all_downloads_fail to check that the zip & extension dir are gone

@pgombar
Copy link
Contributor

pgombar commented Feb 13, 2019

Thanks, Norberto. LGTM.

Copy link
Member

@vrdmr vrdmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor comments. Not blocker, but can be done as a part of later releases.

return False
return True

def _unzip_extension_package(self, source_file, target_directory):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a clarification - Would the issue of corrupt zip arise for Agent as well? and we need similar logic in azurelinuxagent.ga.update.GuestAgent#_unpack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, and I intend to fix that as well. However, this is a hotfix release, so I do not want to add any other changes to it. Thanks for pointing it out.

except Exception as exception:
self.logger.info("Error downloading extension package: {0}", ustr(exception))
if os.path.exists(target_file):
os.remove(target_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fileutil.clean_ioerror already does this partially. We should change (overload) this as util and use that in both the place (756, 766 and 768).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function seems to be meant to handle very specific errors and cannot handle this one in particular - it has a check for the exception type. Since that function is used in many places in the code, I decided to leave it alone for the moment.

if not file_downloaded:
raise ExtensionDownloadError("Failed to download extension", code=1001)
self.logger.info("Failed to download the extension package from all uris, will retry after a minute")
time.sleep(60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! 👍

fileutil.clean_ioerror(e, paths=[self.get_base_dir(), self.pkg_file])
raise ExtensionError(u"Failed to unzip extension package", e, code=1001)
if not downloaded:
raise ExtensionDownloadError("Failed to download extension", code=1001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a task. We need to document these magical numbers. (such as 1001).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

protocol = WireProtocol("http://Microsoft.CPlat.Core.RunCommandLinux/foo-bar")

self.pkg = ExtHandlerPackage()
self.pkg.uris = [ ExtHandlerVersionUri(), ExtHandlerVersionUri(), ExtHandlerVersionUri(), ExtHandlerVersionUri(), ExtHandlerVersionUri() ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change ExtHandlerVersionUri to the following, It would make both the class as well as the test a little cleaner:

class ExtHandlerVersionUri(DataContract):
    def __init__(self, uri=None):
        self.uri = uri
self.pkg.uris = [ 
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr00a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'), 
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr01a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'), 
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr02a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'), 
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr02a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0'), 
ExtHandlerVersionUri('https://zrdfepirv2cy4prdstr02a.blob.core.windows.net/f72653efd9e349ed9842c8b99e4c1712-foobar/Microsoft.CPlat.Core__RunCommandLinux__1.0.0')
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtHandlerVersionUri is meant to be used to deserialize from xml, etc. The class has other members other than uri. I think it would not be worth it to add all of those parameters to init just for the sake of this test.

@narrieta narrieta merged commit 0f7f375 into Azure:release-2.2.37 Feb 14, 2019
@narrieta narrieta deleted the download-error branch February 14, 2019 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants