Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health store integration #1196

Merged
merged 62 commits into from
Jun 5, 2018
Merged

Health store integration #1196

merged 62 commits into from
Jun 5, 2018

Conversation

hglkrijger
Copy link
Member

@hglkrijger hglkrijger commented May 31, 2018

The intent of this work is to add health signals for the host channel, which we can use to monitor PF rollouts. The following signals are in place:

GuestAgentPluginHeartbeat

  • monitors the host plugin process is alive by calling /health
  • check every 1m, send every 1m, failure > 5m unhealthy
{
  "Version": "1.0",
  "Observations": [
    {
      "IsHealthy": true,
      "Description": "",
      "Value": "",
      "ObservationName": "GuestAgentPluginHeartbeat"
    }
  ],
  "Api": "reporttargethealth",
  "Source": "WALinuxAgent"
}

GuestAgentPluginVersions

  • monitors the host plugin channel is available by calling /versions
  • determine health on usage (process start), send immediately, no failure rollup
{
  "Version": "1.0",
  "Observations": [
    {
      "Description": "",
      "Value": "",
      "IsHealthy": true,
      "ObservationName": "GuestAgentPluginVersions"
    }
  ],
  "Api": "reporttargethealth",
  "Source": "WALinuxAgent"
}

GuestAgentPluginArtifact

  • monitors the /extensionArtifact api
  • failure is determined by status codes 5xx, except 502, which is an upstream failure
  • determine health on usage (extension/manifest/agent download), send every 1m, failure > 5m unhealthy
  • we fetch for the purpose of extension handling (WireClient) and agent updates (GuestAgent), and may want to distinguish between these signals, so the source is listed in the Description field
  • WireClient:
{
  "Version": "1.0",
  "Observations": [
    {
      "IsHealthy": true,
      "Description": "WireClient",
      "Value": "",
      "ObservationName": "GuestAgentPluginArtifact"
    }
  ],
  "Api": "reporttargethealth",
  "Source": "WALinuxAgent"
}
  • GuestAgent:
{
  "Version": "1.0",
  "Observations": [
    {
      "IsHealthy": true,
      "Description": "GuestAgent",
      "Value": "",
      "ObservationName": "GuestAgentPluginArtifact"
    }
  ],
  "Api": "reporttargethealth",
  "Source": "WALinuxAgent"
}

GuestAgentPluginStatus

  • monitors the /status api
  • failure is determined by status codes 5xx, except 502, which is an upstream failure
  • check every goal_state_interval (3s), send every 1m, failure > 5m unhealthy
{
  "Api": "reporttargethealth",
  "Observations": [
    {
      "Value": "",
      "ObservationName": "GuestAgentPluginStatus",
      "Description": "",
      "IsHealthy": true
    }
  ],
  "Version": "1.0",
  "Source": "WALinuxAgent"
}

Observations are reported to http://{wireserver_ip}:80/HealthService

Additional changes

  • removed 400s from the ResourceGoneError; this is legacy host plugin behavior which has since been fixed
  • switched extension package downloads from the base handler to client fetch, to simplify number of changes
  • updated the logging prefix to ExtHandler, for brevity
  • removed enable events from the logs, for clarity
  • updated tests accordingly

@@ -62,6 +61,10 @@
httpclient.ACCEPTED
]

UPSTREAM_FAILURE_CODES = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HOSTPLUGIN_UPSTREAM_FAILURE_CODES ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -1117,7 +1113,7 @@ def set_handler_status(self, status="NotReady", message="", code=0):
except (IOError, ValueError, ProtocolError) as e:
fileutil.clean_ioerror(e,
paths=[status_file])
self.logger.error("Failed to save handler status: {0}", traceback.format_exc())
self.logger.error("Failed to save handler status: {0}", ustr(e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ustr(e) [](start = 67, length = 8)

The traceback was valuable in locating the cause. I would not remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this because I found that it was not populated, due to the way the threads work. Are you sure this specific one was useful? If so I can add it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, crap. Now you're fighting my memory, and I only half trust it. I thought this was a code path where the traceback was used to diagnose a failure. Your mind is fresh. If you think it is useless, please remove it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no harm in having both - I will add it back along with the error message, in case I missed something.


if datetime.datetime.utcnow() < (last_host_plugin_heartbeat + MonitorHandler.HOST_PLUGIN_HEARTBEAT_PERIOD):
return last_host_plugin_heartbeat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This construct is used over and over again. This should be refactored into something for better re-use. Please open an issue if you cannot address it for this commit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made some changes here, which I will push. Let me know if you think it still requires further refactoring.

or other_errors > 0:

msg = "hostplugin:{0};protocol:{1};other:{2}"\
.format(hostplugin_errors,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you just re-format the file? It's hard to see real changes when the formatting gets adjusted this much. Keep the format changes because I want them, but consider doing them in a distinct phase to make code review easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually keep reformats/whitespace to separate commits, but in a review this size I understand it becomes too difficult to look at them separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, hell, this stomps all over work I've been doing for cgroups. Again. I see another two hours of rebasing ahead of me.

@@ -540,7 +542,7 @@ def _purge_agents(self):

known_versions = [agent.version for agent in self.agents]
if CURRENT_VERSION not in known_versions:
logger.info(
logger.verbose(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

self._validate_hostplugin_args(
patch_http.call_args_list[0],
test_goal_state,
exp_method, exp_url, exp_data)

# second call is to health service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second call is to health service [](start = 16, length = 34)

Thank you, this makes it so much easier to read.

jstr = patch_http_post.call_args_list[1][0][1]
obj = json.loads(jstr)
self.assertEqual(1, len(obj['Observations']))
self.assertFalse(obj['Observations'][0]['IsHealthy'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.assertFalse(obj['Observations'][0]['IsHealthy']) [](start = 7, length = 54)

I think this is really what you want the test to assert, but it takes a lot of code to actually get there. I have found this to be true for other parts of the agent. 😢



@patch("azurelinuxagent.common.utils.restutil.http_get")
def test_health(self, patch_http_get):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_health [](start = 8, length = 11)

I think there should be a test case for HTTP 502. I think there should be a test case of /extensionArtifact and /vmAgentLog. What else am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few tests still coming.

Copy link
Member

@boumenot boumenot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

Copy link
Member

@boumenot boumenot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, but otherwise LGTM.

if last_timestamp is None:
last_timestamp = datetime.datetime.utcnow() - period

return datetime.datetime.utcnow() >= (last_timestamp + period)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I see it. Would you please add a docstring to the method to make that clearer.

@@ -1117,7 +1113,7 @@ def set_handler_status(self, status="NotReady", message="", code=0):
except (IOError, ValueError, ProtocolError) as e:
fileutil.clean_ioerror(e,
paths=[status_file])
self.logger.error("Failed to save handler status: {0}", traceback.format_exc())
self.logger.error("Failed to save handler status: {0}", ustr(e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, crap. Now you're fighting my memory, and I only half trust it. I thought this was a code path where the traceback was used to diagnose a failure. Your mind is fresh. If you think it is useless, please remove it.

self.assertEqual(None, monitor_handler.last_telemetry_heartbeat)

monitor_handler.start()
time.sleep(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this to milliseconds instead, or patch time to avoid sleeping in a unit test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made these shorter. Since I want to test how the thread sleep functions, I felt patching time would negate the value of the test.

self.assertEqual(200, len(o.value))
self.assertEqual(200, len(o.description))

self.assertEqual(64, len(o.as_obj['ObservationName']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you asserting length? What do the values 64 and 128 represent? Why don't you just assert the contents? I think I am missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is to assert the maximum string length for these fields; longer values cause 400s. The other tests do assert value. Specifically this ensures that even though we set a 200 character string, the json representation is 64/128.

self.assertEqual(1, patch_should_report.call_count)
self.assertEqual(1, patch_report_status.call_count)

def test_should_report(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is test clarifies a lot of things for me - nice!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.

self.assertNotEqual(0, patch_hostplugin_heartbeat.call_count)
self.assertNotEqual(0, patch_send_events.call_count)
self.assertNotEqual(0, patch_telemetry_heartbeat.call_count)

monitor_handler.should_run = False
monitor_handler.stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change!

@hglkrijger
Copy link
Member Author

Merging, since this is approved, but if there is any more feedback I will be happy to address it in the next PR for integrating IMDS health signals.

@hglkrijger hglkrijger merged commit 1cf3b74 into Azure:master Jun 5, 2018
@hglkrijger hglkrijger deleted the health-store branch June 5, 2018 21:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants