add fluentd logging driver config check #4592

juanvallejo · 2017-06-26T21:47:38Z

Related trello card: https://trello.com/c/t6H0D8Ox

Checks to see if the fluentd pod is configured to aggregate logs from journald or json-file and verifies that the host's Docker config matches the expected logging driver.

TODO

add tests

cc @sosiouxme @rhcarvalho @brenton

juanvallejo · 2017-06-26T21:49:44Z

roles/openshift_health_checker/openshift_checks/logging/fluentd.py

+        Returns an error string if the above condition is not met, or None otherwise."""
+
+        pod_name = fluentd_pods[0]["metadata"]["name"]
+        use_journald = self._exec_oc("exec {} /bin/printenv USE_JOURNAL".format(pod_name), [], task_vars)


@sosiouxme I decided to read the value of the env var rather than the ansible variable (openshift_logging_fluentd_use_journal) since I believe the ansible variable is not set by default (https://github.com/openshift/openshift-ansible/blob/master/inventory/byo/hosts.origin.example).

We have one extra problem here that just occurred to me. This check only runs against one master. It will correctly check whether the docker there has a matching journal driver, but it won't check any of the other hosts. This may be tricky to accomplish.

Hm, we may have to have a separate fluentd check that does not inherit from LoggingCheck

juanvallejo · 2017-06-27T14:41:01Z

aos-ci-test

juanvallejo · 2017-06-27T18:54:58Z

@sosiouxme went ahead and moved the fluentd config check to its own file. PTAL

sosiouxme · 2017-06-27T19:12:36Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+        """Check that Fluentd has running pods, and that its logging config matches Docker's logging config."""
+
+        self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default="logging")
+        fluentd_pods, error = self.get_pods_for_component(


This and other oc calls will only work on master, though. That's what makes this tricky. We could say to just look at the logging var to see what fluentd is supposed to have and not check what it actually has. Otherwise we're gonna have to do some really weird delegation to run oc against master while we're looking at everything else.

I was hesitant to rely on the logging var since its value is not set by default, but I guess for now it might be the easiest starting point for this. Will update

juanvallejo · 2017-06-27T21:03:37Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+        then the value of the configured Docker logging driver should be "journald".
+        Otherwise, the value of the Docker logging driver should be "json-file".
+        Returns an error string if the above condition is not met, or None otherwise."""
+        use_journald = get_var(task_vars, "openshift_logging_fluentd_use_journal", default=True)


@sosiouxme removed use of ocutil in favor of using the value of the ansible variable. Defaulted value to True, making the check assume journalctl as the default logging driver, unless explicitly set by a user to False. Not entirely sure about this, however.

This is correct

May or may not have any consequence to this PR, but in any case, FWIW I see this variable was deprecated 14 days ago:

roles/openshift_logging/README.md#L55

roles/openshift_logging_fluentd/tasks/main.yaml#L18-L20

It's not just deprecated, it's obviated... with 3.6 fluentd looks at docker config to use the same driver. That makes this whole check kind of pointless for 3.6. And I'm not sure what we plan to tell customers about running checks against earlier versions?

juanvallejo · 2017-06-28T13:57:26Z

aos-ci-test

openshift-bot · 2017-06-28T15:58:09Z

error: aos-ci-jenkins/OS_3.5_NOT_containerized for 0d803ad (logs)

openshift-bot · 2017-06-28T16:02:53Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 0d803ad (logs)

openshift-bot · 2017-06-28T16:04:10Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 0d803ad (logs)

openshift-bot · 2017-06-28T16:06:49Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 0d803ad (logs)

juanvallejo · 2017-07-06T17:17:32Z

@sosiouxme switched back to using ocutil if check is running on a master. PTAL

sosiouxme · 2017-07-07T19:15:31Z

Ran into this while testing:

     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "fluentd_config":
               Invalid value received from command exec logging-fluentd-nfwxp /bin/printenv USE_JOURNAL: Unexpected error using `oc` to validate the logging stack components.
               Error executing `oc exec logging-fluentd-nfwxp /bin/printenv USE_JOURNAL`:
               [rc 1] /usr/local/bin/oc --config /etc/origin/master/admin.kubeconfig -n logging exec logging-fluentd-nfwxp /bin/printenv USE_JOURNAL

Unfortunately there doesn't seem to be any more helpful debug output.

sosiouxme · 2017-07-07T19:31:05Z

But speaking of the above, why try to exec into the pod? You should have the pod definition right there to look at what the env vars are.

juanvallejo · 2017-07-07T22:57:37Z

@sosiouxme

But speaking of the above, why try to exec into the pod? You should have the pod definition right there to look at what the env vars are.

Thanks, updated check / tests to do this instead

sosiouxme

Tiny nits; but this is going to take me a bit longer to review properly

sosiouxme · 2017-07-07T18:02:43Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

@@ -0,0 +1,190 @@
+"""
+Module for performing checks on an Fluentd logging deployment configuration


sosiouxme · 2017-07-07T18:06:48Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+
+        running_fluentd_pods = [pod for pod in fluentd_pods if pod['status']['phase'] == 'Running']
+        if not len(running_fluentd_pods):
+            msg = ('No Fluentd pods were found to be in the "Running" state.'


needs a space after the period

sosiouxme · 2017-07-11T21:07:09Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+        then the value of the configured Docker logging driver should be "journald".
+        Otherwise, the value of the Docker logging driver should be "json-file".
+        Returns an error string if the above condition is not met, or None otherwise."""
+        use_journald = get_var(task_vars, "openshift_logging_fluentd_use_journal", default=True)


This is correct

rhcarvalho

Hmm, I just found some old comments I forgot to submit, maybe something is still applicable?

rhcarvalho · 2017-07-07T13:08:41Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+    def run(self, tmp, task_vars):
+        """Check that Fluentd has running pods, and that its logging config matches Docker's logging config."""
+
+        self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default="logging")


I'm noticing this is perhaps unnecessarily different than the approach in https://github.com/openshift/openshift-ansible/pull/4682/files#diff-a2a36e6105b8e46e1282819c771393beR22

Better choose a single way of doing it.

Thanks, will stick to a default class field

rhcarvalho · 2017-07-07T13:10:02Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+        config_error = self.check_logging_config(task_vars)
+        if config_error:
+            msg = ("The following Fluentd logging configuration problem was found:"
+                   "\n-------\n"


Why a separator? We don't seem to include it in any other error message.

Right now, we seem to use it for similarly-worded error messages in the logging stack checks.

I think that's a legacy from when the logging check was monolithic and you could get a bunch of different msgs for the different components under one check. Dividers probably don't make sense with them all split out in separate checks.

I think that's a legacy from when the logging check was monolithic and you could get a bunch of different msgs for the different components under one check. Dividers probably don't make sense with them all split out in separate checks.

ack, will go ahead and remove them from existing checks as well

rhcarvalho · 2017-07-07T13:10:54Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+                   "{}".format(config_error))
+            return {"failed": True, "changed": False, "msg": msg}
+
+        return {"failed": False, "changed": False, "msg": 'No problems found with Fluentd logging configuration.'}


How is the "msg" used when "failed": False?

It's not, except for showing up in -v output.

I suspect on '-v' output it will appear in the middle of a big json object?
I'd rather have less noise. failed: False conveys the same info.

I have also mixed feelings about the "changed" key. I don't feel confident we use it consistently and correctly, and we don't really use it in any meaningful way. I remember I wanted to "be prepared" for supporting check-mode but thus far I feel we're only carrying extra code.

You're right, I don't see a need for a msg here. No one will look at it. There was some place where it just made the logic cleaner to set one but there's no need most of the time.

"changed" response doesn't really have anything to do with check-mode. It's just a record that something was changed during this task, which gets summed up in the output at the end. Idempotency seems to be a big deal for Ansible users so reporting whether you actually changed something (e.g. installed a dependency) or not may matter to some. It does not seem critical to me but I'm not an admin. If we want a better way to track this we could set an instance variable and have the action plugin look at them to determine whether anything changed (even if an exception was thrown) and just keep all the "changed" response logic out of the actual checks.

Will remove "changed" and "msg" keys

rhcarvalho · 2017-07-07T13:16:18Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+                     'from your Docker containers.\n').format(driver=logging_driver)
+
+        if error:
+            error += ('To resolve this issue, add the following variable to your Ansible inventory file:\n\n'


Suggestion: add a new line \n at the beginning of this string here, and remove it from the end of both error messages above.
From a maintenance point of view, this makes it less likely that we introduce a new error string and have it concatenated with no space or new line in between.

rhcarvalho · 2017-07-17T14:41:24Z

roles/openshift_health_checker/test/fluentd_config_test.py

+    return check
+
+
+def canned_fluentd_pod(containers=[]):


You have a flake8 error here. Not a problem in this case, but sometimes mutable containers in arguments cause hairy problems...
Seems that we never call the function without an argument, so why don't simply make it required? (s/=[]//)

rhcarvalho · 2017-07-17T14:42:05Z

roles/openshift_health_checker/openshift_checks/fluentd_config.py

+
+    def run(self, tmp, task_vars):
+        """Check that Fluentd has running pods, and that its logging config matches Docker's logging config."""
+


extraneous empty line

juanvallejo · 2017-07-27T14:58:24Z

@rhcarvalho thanks for the review, comments addressed

rhcarvalho · 2017-07-27T15:04:09Z

roles/openshift_health_checker/test/kibana_test.py

@@ -68,7 +65,7 @@ def assert_error(error, expect_error):
    ),
 ])
 def test_check_kibana(pods, expect_error):
-    check = canned_kibana()
+    check = kibana()


Similar to another review today, maybe this is just Kibana() and no need for a fixture.

rhcarvalho · 2017-07-27T15:06:03Z

roles/openshift_health_checker/openshift_checks/__init__.py

+            components[0] = openshift_major_release_version[components[0]]
+
+        components = [int(x) for x in components[:2]]
+        return tuple(components)


If you want to return a tuple, no need to create a list first...

components = tuple(int(x) for x in components[:2]) return components

rhcarvalho · 2017-07-27T15:07:29Z

roles/openshift_health_checker/openshift_checks/__init__.py

+        if components[0] in openshift_major_release_version:
+            components[0] = openshift_major_release_version[components[0]]
+
+        components = [int(x) for x in components[:2]]


Can't openshift_image_tag carry a non-numeric version component, like rcN or beta or alpha, etc?

not sure
cc @sosiouxme

EDIT: we currently use the value of openshift_image_tag in the package_version check

Yeah it can pretty much be anything. It's an opaque string.

How should we handle non numeric version values; try, except? or maybe fail the check altogether

I spoke too soon. Actually openshift_version asserts that it look like a version string. The requirements for origin vs enterprise are a little different but we should be confident that it has an optional v and two numeric components at the beginning. There can be any kind of garbage after the base version in origin though.

(Sorry; I spent too long looking at this at some point in the past, figured it all out, then forgot it. That said, should possibly revisit to see if there's something better to base it on.)

Sounds good, I am only focusing on the first two numeric components of the string as it is. Not sure if there is a better var to base this on, but will take a look at sample inventory files

For origin the expression includes arbitrary letters, but the first two components should be numeric.

openshift_image_tag|match('(^v?\\d+\\.\\d+\\.\\d+(-[\\w\\-\\.]*)?$)')

We should be all set, thanks.

rhcarvalho · 2017-07-27T15:08:55Z

roles/openshift_health_checker/openshift_checks/__init__.py

@@ -105,6 +105,28 @@ def get_var(self, *keys, **kwargs):
            raise OpenShiftCheckException("'{}' is undefined".format(".".join(map(str, keys))))
        return value

+    @staticmethod
+    def get_openshift_version(openshift_image_tag):


Maybe this should have a more specific name, since it strips out version information? It is returning MAJOR.MINOR and throwing away the rest.

sosiouxme · 2017-07-27T19:34:51Z

Some refactoring on the way that parallels some of this: #4913

juanvallejo · 2017-07-27T21:04:29Z

aos-ci-test

openshift-bot · 2017-07-28T02:31:33Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 153fbaa (logs)

openshift-bot · 2017-07-28T02:34:32Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 153fbaa (logs)

rhcarvalho · 2017-07-28T08:11:43Z

roles/openshift_health_checker/openshift_checks/logging/fluentd_config.py

+        logging_deployed = self.get_var("openshift_hosted_logging_deploy", default=False)
+
+        try:
+            version = self.get_major_minor_version(self.get_var("openshift_image_tag"))


Note that this can raise OpenShiftCheckException. We should protect the action plugin call to is_active with a try... except block for cases like this, to prevent the action plugin run from blowing up. I can include that in my current work branch.

Yeah, good idea. While you're at it, probably best to add an except Exception to catch any dumb bug we leave on one check so the others still run.

juanvallejo · 2017-08-01T14:56:49Z

@rhcarvalho tagging this for [merge] per scrum discussion

openshift-bot · 2017-08-01T15:02:19Z

[test]ing while waiting on the merge queue

juanvallejo · 2017-08-01T21:17:35Z

re[test]

openshift-bot · 2017-08-01T21:31:35Z

Evaluated for openshift ansible merge up to 04154a2

openshift-bot · 2017-08-01T21:31:36Z

Evaluated for openshift ansible test up to 04154a2

sosiouxme · 2017-08-01T22:32:06Z

[ERROR] Status 'fedora/25/atomic' is in the 'failure' state. This status must succeed for a merge.

Wait - what? That's new.

openshift-bot · 2017-08-01T23:47:40Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/391/) (Base Commit: 01e4893) (PR Branch Commit: 04154a2)

sdodson · 2017-08-02T00:45:10Z

aos-ci-test

sdodson · 2017-08-02T00:52:15Z

bot, retest this

openshift-bot · 2017-08-02T04:49:08Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 04154a2 (logs)

openshift-bot · 2017-08-02T04:52:53Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 04154a2 (logs)

openshift-bot · 2017-08-02T08:07:31Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/780/) (Base Commit: beb5dea) (PR Branch Commit: 04154a2)

juanvallejo commented Jun 26, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 5b998dd to 7412780 Compare June 26, 2017 21:50

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 2 times, most recently from 75f7a0f to c34475b Compare June 27, 2017 18:53

sosiouxme reviewed Jun 27, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from c34475b to 6b3bc00 Compare June 27, 2017 20:59

juanvallejo commented Jun 27, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 6b3bc00 to 0d803ad Compare June 27, 2017 21:21

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 0d803ad to 7c0b63f Compare June 29, 2017 18:11

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from a58bbba to 9bbc959 Compare July 6, 2017 19:50

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 2 times, most recently from 148e381 to efd978b Compare July 7, 2017 22:57

sosiouxme reviewed Jul 11, 2017

View reviewed changes

rhcarvalho reviewed Jul 12, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 3 times, most recently from af33581 to f422f41 Compare July 14, 2017 21:59

rhcarvalho reviewed Jul 17, 2017

View reviewed changes

rhcarvalho reviewed Jul 27, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 3 times, most recently from 9e03112 to c0591d3 Compare July 27, 2017 18:16

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from c0591d3 to 153fbaa Compare July 27, 2017 21:04

rhcarvalho reviewed Jul 28, 2017

View reviewed changes

rhcarvalho approved these changes Aug 1, 2017

View reviewed changes

add fluentd logging driver config check

04154a2

juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 153fbaa to 04154a2 Compare August 1, 2017 21:25

openshift-bot merged commit b22ebdc into openshift:master Aug 2, 2017

juanvallejo deleted the jvallejo/verify-fluentd-logging-config branch August 2, 2017 13:49

sosiouxme mentioned this pull request Aug 2, 2017

master cherry picks for 3.6.1 #4814

Merged

		@@ -0,0 +1,190 @@
		"""
		Module for performing checks on an Fluentd logging deployment configuration


		def run(self, tmp, task_vars):
		"""Check that Fluentd has running pods, and that its logging config matches Docker's logging config."""

add fluentd logging driver config check #4592

add fluentd logging driver config check #4592

Conversation

juanvallejo commented Jun 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 27, 2017

juanvallejo commented Jun 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

juanvallejo commented Jul 6, 2017

sosiouxme commented Jul 7, 2017

sosiouxme commented Jul 7, 2017

juanvallejo commented Jul 7, 2017

sosiouxme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jul 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

sosiouxme Jul 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme commented Jul 27, 2017

juanvallejo commented Jul 27, 2017

openshift-bot commented Jul 28, 2017

openshift-bot commented Jul 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Aug 1, 2017

openshift-bot commented Aug 1, 2017

juanvallejo commented Aug 1, 2017

openshift-bot commented Aug 1, 2017

openshift-bot commented Aug 1, 2017

sosiouxme commented Aug 1, 2017

openshift-bot commented Aug 1, 2017

sdodson commented Aug 2, 2017

sdodson commented Aug 2, 2017

openshift-bot commented Aug 2, 2017

openshift-bot commented Aug 2, 2017

openshift-bot commented Aug 2, 2017 • edited Loading

juanvallejo commented Jun 26, 2017 •

edited

Loading

juanvallejo Jul 27, 2017 •

edited

Loading

juanvallejo Jul 27, 2017 •

edited

Loading

sosiouxme Jul 27, 2017 •

edited

Loading

openshift-bot commented Aug 2, 2017 •

edited

Loading