Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fluentd logging driver config check #4592

Conversation

juanvallejo
Copy link
Contributor

@juanvallejo juanvallejo commented Jun 26, 2017

Related trello card: https://trello.com/c/t6H0D8Ox

Checks to see if the fluentd pod is configured to aggregate logs from journald or json-file and verifies that the host's Docker config matches the expected logging driver.

TODO

  • add tests

cc @sosiouxme @rhcarvalho @brenton

Returns an error string if the above condition is not met, or None otherwise."""

pod_name = fluentd_pods[0]["metadata"]["name"]
use_journald = self._exec_oc("exec {} /bin/printenv USE_JOURNAL".format(pod_name), [], task_vars)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme I decided to read the value of the env var rather than the ansible variable (openshift_logging_fluentd_use_journal) since I believe the ansible variable is not set by default (https://github.com/openshift/openshift-ansible/blob/master/inventory/byo/hosts.origin.example).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have one extra problem here that just occurred to me. This check only runs against one master. It will correctly check whether the docker there has a matching journal driver, but it won't check any of the other hosts. This may be tricky to accomplish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, we may have to have a separate fluentd check that does not inherit from LoggingCheck

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 5b998dd to 7412780 Compare June 26, 2017 21:50
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 2 times, most recently from 75f7a0f to c34475b Compare June 27, 2017 18:53
@juanvallejo
Copy link
Contributor Author

@sosiouxme went ahead and moved the fluentd config check to its own file. PTAL

"""Check that Fluentd has running pods, and that its logging config matches Docker's logging config."""

self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default="logging")
fluentd_pods, error = self.get_pods_for_component(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and other oc calls will only work on master, though. That's what makes this tricky. We could say to just look at the logging var to see what fluentd is supposed to have and not check what it actually has. Otherwise we're gonna have to do some really weird delegation to run oc against master while we're looking at everything else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitant to rely on the logging var since its value is not set by default, but I guess for now it might be the easiest starting point for this. Will update

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from c34475b to 6b3bc00 Compare June 27, 2017 20:59
then the value of the configured Docker logging driver should be "journald".
Otherwise, the value of the Docker logging driver should be "json-file".
Returns an error string if the above condition is not met, or None otherwise."""
use_journald = get_var(task_vars, "openshift_logging_fluentd_use_journal", default=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme removed use of ocutil in favor of using the value of the ansible variable. Defaulted value to True, making the check assume journalctl as the default logging driver, unless explicitly set by a user to False. Not entirely sure about this, however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May or may not have any consequence to this PR, but in any case, FWIW I see this variable was deprecated 14 days ago:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just deprecated, it's obviated... with 3.6 fluentd looks at docker config to use the same driver. That makes this whole check kind of pointless for 3.6. And I'm not sure what we plan to tell customers about running checks against earlier versions?

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 6b3bc00 to 0d803ad Compare June 27, 2017 21:21
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.5_NOT_containerized for 0d803ad (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 0d803ad (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 0d803ad (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 0d803ad (logs)

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 0d803ad to 7c0b63f Compare June 29, 2017 18:11
@juanvallejo
Copy link
Contributor Author

@sosiouxme switched back to using ocutil if check is running on a master. PTAL

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from a58bbba to 9bbc959 Compare July 6, 2017 19:50
@sosiouxme
Copy link
Member

Ran into this while testing:

     Task:     openshift_health_check
     Message:  One or more checks failed
     Details:  check "fluentd_config":
               Invalid value received from command exec logging-fluentd-nfwxp /bin/printenv USE_JOURNAL: Unexpected error using `oc` to validate the logging stack components.
               Error executing `oc exec logging-fluentd-nfwxp /bin/printenv USE_JOURNAL`:
               [rc 1] /usr/local/bin/oc --config /etc/origin/master/admin.kubeconfig -n logging exec logging-fluentd-nfwxp /bin/printenv USE_JOURNAL

Unfortunately there doesn't seem to be any more helpful debug output.

@sosiouxme
Copy link
Member

But speaking of the above, why try to exec into the pod? You should have the pod definition right there to look at what the env vars are.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 2 times, most recently from 148e381 to efd978b Compare July 7, 2017 22:57
@juanvallejo
Copy link
Contributor Author

@sosiouxme

But speaking of the above, why try to exec into the pod? You should have the pod definition right there to look at what the env vars are.

Thanks, updated check / tests to do this instead

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nits; but this is going to take me a bit longer to review properly

@@ -0,0 +1,190 @@
"""
Module for performing checks on an Fluentd logging deployment configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/an/a/


running_fluentd_pods = [pod for pod in fluentd_pods if pod['status']['phase'] == 'Running']
if not len(running_fluentd_pods):
msg = ('No Fluentd pods were found to be in the "Running" state.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a space after the period

then the value of the configured Docker logging driver should be "journald".
Otherwise, the value of the Docker logging driver should be "json-file".
Returns an error string if the above condition is not met, or None otherwise."""
use_journald = get_var(task_vars, "openshift_logging_fluentd_use_journal", default=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just found some old comments I forgot to submit, maybe something is still applicable?

def run(self, tmp, task_vars):
"""Check that Fluentd has running pods, and that its logging config matches Docker's logging config."""

self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default="logging")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing this is perhaps unnecessarily different than the approach in https://github.com/openshift/openshift-ansible/pull/4682/files#diff-a2a36e6105b8e46e1282819c771393beR22

Better choose a single way of doing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will stick to a default class field

config_error = self.check_logging_config(task_vars)
if config_error:
msg = ("The following Fluentd logging configuration problem was found:"
"\n-------\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a separator? We don't seem to include it in any other error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, we seem to use it for similarly-worded error messages in the logging stack checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a legacy from when the logging check was monolithic and you could get a bunch of different msgs for the different components under one check. Dividers probably don't make sense with them all split out in separate checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a legacy from when the logging check was monolithic and you could get a bunch of different msgs for the different components under one check. Dividers probably don't make sense with them all split out in separate checks.

ack, will go ahead and remove them from existing checks as well

"{}".format(config_error))
return {"failed": True, "changed": False, "msg": msg}

return {"failed": False, "changed": False, "msg": 'No problems found with Fluentd logging configuration.'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the "msg" used when "failed": False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, except for showing up in -v output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect on '-v' output it will appear in the middle of a big json object?
I'd rather have less noise. failed: False conveys the same info.

I have also mixed feelings about the "changed" key. I don't feel confident we use it consistently and correctly, and we don't really use it in any meaningful way. I remember I wanted to "be prepared" for supporting check-mode but thus far I feel we're only carrying extra code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I don't see a need for a msg here. No one will look at it. There was some place where it just made the logic cleaner to set one but there's no need most of the time.

"changed" response doesn't really have anything to do with check-mode. It's just a record that something was changed during this task, which gets summed up in the output at the end. Idempotency seems to be a big deal for Ansible users so reporting whether you actually changed something (e.g. installed a dependency) or not may matter to some. It does not seem critical to me but I'm not an admin. If we want a better way to track this we could set an instance variable and have the action plugin look at them to determine whether anything changed (even if an exception was thrown) and just keep all the "changed" response logic out of the actual checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove "changed" and "msg" keys

'from your Docker containers.\n').format(driver=logging_driver)

if error:
error += ('To resolve this issue, add the following variable to your Ansible inventory file:\n\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: add a new line \n at the beginning of this string here, and remove it from the end of both error messages above.
From a maintenance point of view, this makes it less likely that we introduce a new error string and have it concatenated with no space or new line in between.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 3 times, most recently from af33581 to f422f41 Compare July 14, 2017 21:59
return check


def canned_fluentd_pod(containers=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a flake8 error here. Not a problem in this case, but sometimes mutable containers in arguments cause hairy problems...
Seems that we never call the function without an argument, so why don't simply make it required? (s/=[]//)


def run(self, tmp, task_vars):
"""Check that Fluentd has running pods, and that its logging config matches Docker's logging config."""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extraneous empty line

@juanvallejo
Copy link
Contributor Author

@rhcarvalho thanks for the review, comments addressed

@@ -68,7 +65,7 @@ def assert_error(error, expect_error):
),
])
def test_check_kibana(pods, expect_error):
check = canned_kibana()
check = kibana()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to another review today, maybe this is just Kibana() and no need for a fixture.

components[0] = openshift_major_release_version[components[0]]

components = [int(x) for x in components[:2]]
return tuple(components)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to return a tuple, no need to create a list first...

components = tuple(int(x) for x in components[:2])
return components

if components[0] in openshift_major_release_version:
components[0] = openshift_major_release_version[components[0]]

components = [int(x) for x in components[:2]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't openshift_image_tag carry a non-numeric version component, like rcN or beta or alpha, etc?

Copy link
Contributor Author

@juanvallejo juanvallejo Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure
cc @sosiouxme

EDIT: we currently use the value of openshift_image_tag in the package_version check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it can pretty much be anything. It's an opaque string.

Copy link
Contributor Author

@juanvallejo juanvallejo Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle non numeric version values; try, except? or maybe fail the check altogether

Copy link
Member

@sosiouxme sosiouxme Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke too soon. Actually openshift_version asserts that it look like a version string. The requirements for origin vs enterprise are a little different but we should be confident that it has an optional v and two numeric components at the beginning. There can be any kind of garbage after the base version in origin though.

(Sorry; I spent too long looking at this at some point in the past, figured it all out, then forgot it. That said, should possibly revisit to see if there's something better to base it on.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I am only focusing on the first two numeric components of the string as it is. Not sure if there is a better var to base this on, but will take a look at sample inventory files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For origin the expression includes arbitrary letters, but the first two components should be numeric.

openshift_image_tag|match('(^v?\\d+\\.\\d+\\.\\d+(-[\\w\\-\\.]*)?$)')

We should be all set, thanks.

@@ -105,6 +105,28 @@ def get_var(self, *keys, **kwargs):
raise OpenShiftCheckException("'{}' is undefined".format(".".join(map(str, keys))))
return value

@staticmethod
def get_openshift_version(openshift_image_tag):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should have a more specific name, since it strips out version information? It is returning MAJOR.MINOR and throwing away the rest.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch 3 times, most recently from 9e03112 to c0591d3 Compare July 27, 2017 18:16
@sosiouxme
Copy link
Member

Some refactoring on the way that parallels some of this: #4913

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from c0591d3 to 153fbaa Compare July 27, 2017 21:04
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 153fbaa (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 153fbaa (logs)

logging_deployed = self.get_var("openshift_hosted_logging_deploy", default=False)

try:
version = self.get_major_minor_version(self.get_var("openshift_image_tag"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this can raise OpenShiftCheckException. We should protect the action plugin call to is_active with a try... except block for cases like this, to prevent the action plugin run from blowing up. I can include that in my current work branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea. While you're at it, probably best to add an except Exception to catch any dumb bug we leave on one check so the others still run.

@juanvallejo
Copy link
Contributor Author

@rhcarvalho tagging this for [merge] per scrum discussion

@openshift-bot
Copy link

[test]ing while waiting on the merge queue

@juanvallejo
Copy link
Contributor Author

re[test]

@juanvallejo juanvallejo force-pushed the jvallejo/verify-fluentd-logging-config branch from 153fbaa to 04154a2 Compare August 1, 2017 21:25
@openshift-bot
Copy link

Evaluated for openshift ansible merge up to 04154a2

@openshift-bot
Copy link

Evaluated for openshift ansible test up to 04154a2

@sosiouxme
Copy link
Member

[ERROR] Status 'fedora/25/atomic' is in the 'failure' state. This status must succeed for a merge.

Wait - what? That's new.

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/391/) (Base Commit: 01e4893) (PR Branch Commit: 04154a2)

@sdodson
Copy link
Member

sdodson commented Aug 2, 2017

aos-ci-test

@sdodson
Copy link
Member

sdodson commented Aug 2, 2017

bot, retest this

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 04154a2 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 04154a2 (logs)

@openshift-bot
Copy link

openshift-bot commented Aug 2, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/780/) (Base Commit: beb5dea) (PR Branch Commit: 04154a2)

@openshift-bot openshift-bot merged commit b22ebdc into openshift:master Aug 2, 2017
@juanvallejo juanvallejo deleted the jvallejo/verify-fluentd-logging-config branch August 2, 2017 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants