-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health store integration #1196
Health store integration #1196
Conversation
@@ -62,6 +61,10 @@ | |||
httpclient.ACCEPTED | |||
] | |||
|
|||
UPSTREAM_FAILURE_CODES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HOSTPLUGIN_UPSTREAM_FAILURE_CODES ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -1117,7 +1113,7 @@ def set_handler_status(self, status="NotReady", message="", code=0): | |||
except (IOError, ValueError, ProtocolError) as e: | |||
fileutil.clean_ioerror(e, | |||
paths=[status_file]) | |||
self.logger.error("Failed to save handler status: {0}", traceback.format_exc()) | |||
self.logger.error("Failed to save handler status: {0}", ustr(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ustr(e) [](start = 67, length = 8)
The traceback was valuable in locating the cause. I would not remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this because I found that it was not populated, due to the way the threads work. Are you sure this specific one was useful? If so I can add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, crap. Now you're fighting my memory, and I only half trust it. I thought this was a code path where the traceback was used to diagnose a failure. Your mind is fresh. If you think it is useless, please remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no harm in having both - I will add it back along with the error message, in case I missed something.
azurelinuxagent/ga/monitor.py
Outdated
|
||
if datetime.datetime.utcnow() < (last_host_plugin_heartbeat + MonitorHandler.HOST_PLUGIN_HEARTBEAT_PERIOD): | ||
return last_host_plugin_heartbeat | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This construct is used over and over again. This should be refactored into something for better re-use. Please open an issue if you cannot address it for this commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made some changes here, which I will push. Let me know if you think it still requires further refactoring.
or other_errors > 0: | ||
|
||
msg = "hostplugin:{0};protocol:{1};other:{2}"\ | ||
.format(hostplugin_errors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you just re-format the file? It's hard to see real changes when the formatting gets adjusted this much. Keep the format changes because I want them, but consider doing them in a distinct phase to make code review easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually keep reformats/whitespace to separate commits, but in a review this size I understand it becomes too difficult to look at them separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, hell, this stomps all over work I've been doing for cgroups. Again. I see another two hours of rebasing ahead of me.
@@ -540,7 +542,7 @@ def _purge_agents(self): | |||
|
|||
known_versions = [agent.version for agent in self.agents] | |||
if CURRENT_VERSION not in known_versions: | |||
logger.info( | |||
logger.verbose( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
self._validate_hostplugin_args( | ||
patch_http.call_args_list[0], | ||
test_goal_state, | ||
exp_method, exp_url, exp_data) | ||
|
||
# second call is to health service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
second call is to health service [](start = 16, length = 34)
Thank you, this makes it so much easier to read.
jstr = patch_http_post.call_args_list[1][0][1] | ||
obj = json.loads(jstr) | ||
self.assertEqual(1, len(obj['Observations'])) | ||
self.assertFalse(obj['Observations'][0]['IsHealthy']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.assertFalse(obj['Observations'][0]['IsHealthy']) [](start = 7, length = 54)
I think this is really what you want the test to assert, but it takes a lot of code to actually get there. I have found this to be true for other parts of the agent. 😢
|
||
|
||
@patch("azurelinuxagent.common.utils.restutil.http_get") | ||
def test_health(self, patch_http_get): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_health [](start = 8, length = 11)
I think there should be a test case for HTTP 502. I think there should be a test case of /extensionArtifact and /vmAgentLog. What else am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are quite a few tests still coming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🕐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments, but otherwise LGTM.
if last_timestamp is None: | ||
last_timestamp = datetime.datetime.utcnow() - period | ||
|
||
return datetime.datetime.utcnow() >= (last_timestamp + period) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I see it. Would you please add a docstring to the method to make that clearer.
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -1117,7 +1113,7 @@ def set_handler_status(self, status="NotReady", message="", code=0): | |||
except (IOError, ValueError, ProtocolError) as e: | |||
fileutil.clean_ioerror(e, | |||
paths=[status_file]) | |||
self.logger.error("Failed to save handler status: {0}", traceback.format_exc()) | |||
self.logger.error("Failed to save handler status: {0}", ustr(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, crap. Now you're fighting my memory, and I only half trust it. I thought this was a code path where the traceback was used to diagnose a failure. Your mind is fresh. If you think it is useless, please remove it.
tests/ga/test_monitor.py
Outdated
self.assertEqual(None, monitor_handler.last_telemetry_heartbeat) | ||
|
||
monitor_handler.start() | ||
time.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this to milliseconds instead, or patch time to avoid sleeping in a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made these shorter. Since I want to test how the thread sleep functions, I felt patching time would negate the value of the test.
self.assertEqual(200, len(o.value)) | ||
self.assertEqual(200, len(o.description)) | ||
|
||
self.assertEqual(64, len(o.as_obj['ObservationName'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you asserting length? What do the values 64 and 128 represent? Why don't you just assert the contents? I think I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is to assert the maximum string length for these fields; longer values cause 400s. The other tests do assert value. Specifically this ensures that even though we set a 200 character string, the json representation is 64/128.
self.assertEqual(1, patch_should_report.call_count) | ||
self.assertEqual(1, patch_report_status.call_count) | ||
|
||
def test_should_report(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is test clarifies a lot of things for me - nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome.
tests/ga/test_monitor.py
Outdated
self.assertNotEqual(0, patch_hostplugin_heartbeat.call_count) | ||
self.assertNotEqual(0, patch_send_events.call_count) | ||
self.assertNotEqual(0, patch_telemetry_heartbeat.call_count) | ||
|
||
monitor_handler.should_run = False | ||
monitor_handler.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change!
Merging, since this is approved, but if there is any more feedback I will be happy to address it in the next PR for integrating IMDS health signals. |
The intent of this work is to add health signals for the host channel, which we can use to monitor PF rollouts. The following signals are in place:
GuestAgentPluginHeartbeat
/health
GuestAgentPluginVersions
/versions
GuestAgentPluginArtifact
/extensionArtifact
api5xx
, except502
, which is an upstream failureDescription
fieldGuestAgentPluginStatus
/status
api5xx
, except502
, which is an upstream failureObservations are reported to
http://{wireserver_ip}:80/HealthService
Additional changes
ResourceGoneError
; this is legacy host plugin behavior which has since been fixedfetch
, to simplify number of changesExtHandler
, for brevity