verify sane log times in logging stack #4682

juanvallejo · 2017-07-05T17:08:19Z

Related trello card: https://trello.com/c/7uOluOTW
Based on @richm 's script: https://bugzilla.redhat.com/attachment.cgi?id=1258225

This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.

TODO

add tests

juanvallejo · 2017-07-05T17:09:09Z

@rhcarvalho thanks for the feedback. Review comments from #4519 addressed

rhcarvalho · 2017-07-07T11:33:27Z

roles/openshift_health_checker/openshift_checks/logging/logging.py

@@ -20,7 +20,7 @@ def is_active(cls, task_vars):
    @staticmethod
    def is_first_master(task_vars):
        """Run only on first master and only when logging is configured. Returns: bool"""
-        logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=True)
+        logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=False)


rhcarvalho · 2017-07-07T11:35:03Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        except ValueError:
+            return {
+                "failed": True,
+                "msg": ('Invalid value provided in your Ansible inventory file for '


We can't know where the value is coming from with certainty, s/in your Ansible inventory file//

rhcarvalho · 2017-07-07T11:38:03Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+                "msg": ('Invalid value provided in your Ansible inventory file for '
+                        '"openshift_check_logging_index_timeout_seconds". Value must '
+                        'be an integer representing an amount in seconds.'),
+            }


@juanvallejo @sosiouxme what if instead of failing the check we would use the default value in this case?

Pros:

Less annoyance from a failed check.

Cons:

May be hard to realize the config is invalid / not being used.

I think that it might be worth the price of the added annoyance by having the check fail. That way, a user is at least aware of why a value supplied for openshift_check_logging_index_timeout_seconds was invalid

I agree if they went to the trouble to set this, they should be alerted if it isn't valid. Until we get some kind of "warning" mechanism anyway.

Well, one thing we need to work on is how users discover they can set a variable openshift_check_logging_index_timeout_seconds, and then that this value must be an integer (and not a float). This is a general problem for all checks, their usage and their customization -- unrelated to this PR.

One of the things I intend to do is introduce a kwarg on get_var so you can specify what type a var should have. Then it will either come back as that type or as an exception thrown that hopefully informs the user what is expected (in a unified fashion, not per check).

That doesn't help them discover the value in the first place... it would also be good if each check had a spec for which values are configurable and the expected semantics, which we could show in either a check "help" playbook or when the checks fail (as an addendum to the "how to disable" advice). In fact maybe get_var could look at that too :)

Of course this kinda goes against the r_role_variable concept.

rhcarvalho · 2017-07-07T12:13:18Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        if not len(running_es_pods):
+            msg = ('No Elasticsearch pods were found to be in the "Running" state.'
+                   'At least one Elasticsearch pod is required in order to perform this check.')
+            return {"failed": True, "changed": False, "msg": msg}


We can avoid duplication and ensure consistency:

self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace) for component, name in (['kibana', 'Kibana'], ['es', 'Elasticsearch']): pods, error = self.get_pods_for_component( self.execute_module, self.logging_namespace, component, task_vars, ) if error: msg = 'Unable to retrieve pods for the {} logging component: {}' return {"failed": True, "changed": False, "msg": msg.format(name, error)} running_pods = self.running_pods(pods) if not running_pods: msg = ('No {} pods in the "Running" state were found.' 'At least one pod is required in order to perform this check.') return {"failed": True, "changed": False, "msg": msg.format(name)}

Thanks, will update

rhcarvalho · 2017-07-07T12:13:46Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+            return {"failed": True, "changed": False, "msg": msg.format(error)}
+
+        running_es_pods = self.running_pods(pods)
+        if not len(running_es_pods):


s/len(...)/...

rhcarvalho · 2017-07-07T12:24:44Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+    def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars):
+        """Retry an Elasticsearch query every second until query success, or a defined
+        length of time has passed."""
+        now = time.time()


Original comment: #4519 (comment)

Let me try to explain again.

"Retry ... until a defined length of time has passed" sounds plausible, but is hard to implement without access to a monotonic time source. time.time() is not monotonic.

Instead of adding hacks to try to compensate for problems with time.time(), what I am proposing is an algorithm that doesn't depend on it: "retry N times with S seconds interval" (time.sleep uses a monotonic clock internally).

In code:

interval = 1 for i in range(max_tries): if self.query_es_from_es(es_pod, uuid, task_vars): return if i < max_tries - 1: time.sleep(interval) # note: we actually don't need to sleep after the last try... msg = ... raise ...

@rhcarvalho What I don't like about that is that it doesn't account for the time spent on the query, which involves execing into the pod (could take an unbounded amount of time) and running curl (could take up to the limit we set). Frequently the sleep(1) would be the briefest part of the loop. So you could easily have 30 times through the loop varying from ~30s to ~1000s, especially under adverse conditions (exactly the kind you are looking for in the first place). And that's the tradeoff for avoiding a large-clock-reset edge case that should be very rare. I'd rather detect the clock reset and restart the timer in that rare event than warp the accuracy of the check continuously.

rhcarvalho · 2017-07-07T12:27:42Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+
+    def oc_cmd(self, cmd_str, extra_args, task_vars):
+        """Wrap parent exec_oc method. Allows for testing without actually invoking other modules."""
+        return super(LoggingIndexTime, self).exec_oc(


Why super(LoggingIndexTime, self).exec_oc? Super is used when you want to access the parent's implementation skipping the current class', useful when overriding a method. Here it should probably be self.exec_oc.

rhcarvalho · 2017-07-07T12:28:25Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+
+        # get all Kibana pods
+        self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)
+        pods, error = super(LoggingIndexTime, self).get_pods_for_component(


s/super(LoggingIndexTime, self)/self ?

Although, looking at this a bit more, wouldn't it be better to re-use get_pods_for_component from the parent class (LoggingCheck)?

EDIT: just saw #4682 (comment), thanks

When an instance inst of a class C doesn't implement / override a given method m, calling inst.m() is the same as super(C, self).m(). However, as soon as you override the method, the two calls produce a different result.

Semantically, we mean inst.m() and the Python runtime will lookup what is the implementation of m to run. The super(C, self).m() for is useful when you need to bypass the lookup, normally in a case where you would introduce an infinite recursion looking up for a method implementation. E.g., this doesn't work:

>>> class Foo: ... def m(self): ... return self.m() ... >>> Foo().m() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 3, in m File "<stdin>", line 3, in m ... RecursionError: maximum recursion depth exceeded

juanvallejo · 2017-07-07T19:48:16Z

@rhcarvalho thanks for the feedback, review comments addressed

rhcarvalho · 2017-07-12T12:41:35Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+
+    def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars):
+        """Retry an Elasticsearch query every second until query success, or a defined
+        length of time has passed."""


Hmm, now this is out of sync with the implementation?

Re-reading the previous conversation in #4682 (comment) and reconsidering, I get that the intention of this check is look at the elapsed time between writing something to Kibana and being able to read it from Elasticsearch.

Sorry for the back-and-forth, but I'll go with what @sosiouxme said and what you had originally, that we want the whole check to run within a certain timeout, and we try to query ES as many times as possible within that time, always spaced by some interval.

rhcarvalho · 2017-07-12T12:43:57Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        """Wrap uuid generator. Allows for testing with expected values."""
+        return str(uuid4())
+
+    def oc_cmd(self, cmd_str, extra_args, task_vars):


Perhaps this belongs in the parent class? It seems to have nothing specific about the logging index time check.

juanvallejo · 2017-07-13T22:01:31Z

@rhcarvalho @sosiouxme thanks for the feedback, comments addressed

juanvallejo · 2017-07-17T20:03:15Z

@rhcarvalho @sosiouxme Will go ahead and squash if no further comments on this. Thanks

rhcarvalho · 2017-07-18T10:39:28Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+            if self.query_es_from_es(es_pod, uuid, task_vars):
+                return
+            time.sleep(interval)
+            now = int(time.time())


Why int? I think we don't need to truncate the time.

rhcarvalho · 2017-07-18T10:42:10Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        """Retry an Elasticsearch query every second until query success, or a defined
+        length of time has passed."""
+        now = time.time()
+        time_start = now


Seems that this is not used for anything, dead code.

It's used later for detecting if the clock reset far back (before the start time).

If the clock only resets a little, nothing special happens. Same if it resets ahead... the loop will not have as long to run.

rhcarvalho · 2017-07-18T10:55:56Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        while now < time_end:
+            if self.query_es_from_es(es_pod, uuid, task_vars):
+                return
+            time.sleep(interval)


We can save 1s in run time by not sleeping in the last loop iteration.

Suggestion:

def wait_until_doc_in_es(self, es_pod, uuid, timeout_secs, task_vars): deadline = time.time() + timeout_secs interval = 1 while not self.query_es_from_es(es_pod, uuid, task_vars): if time.time() + interval > deadline: msg = "Message with UUID={} not found in Elasticsearch after {:.1f}s." raise OpenShiftCheckException(msg.format(uuid, timeout_secs)) time.sleep(interval)

The above ignores clock resets, but guarantees that we query Elasticsearch at least once.
It also exits immediately when the deadline expired or the document was found, without the extra sleep in the last loop iteration.

It would only sleep on the last loop iteration if the query were not passing.

Err... so do you want to ignore clock resets now? It certainly simplifies the code... or was this just to insure against time resets forward?

Err... so do you want to ignore clock resets now? It certainly simplifies the code...

It seems better to have it simple and to guarantee it runs at least once. Trying to implement monotonic time out time.time() is not going to work, plus adds bloat to confuse us later.

juanvallejo · 2017-07-19T21:40:12Z

aos-ci-test

openshift-bot · 2017-07-19T22:58:25Z

error: aos-ci-jenkins/OS_3.6_containerized for 9cda0ad (logs)

openshift-bot · 2017-07-19T23:01:38Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 9cda0ad (logs)

rhcarvalho

This is almost ready.

rhcarvalho · 2017-07-20T13:13:57Z

roles/openshift_health_checker/openshift_checks/logging/logging.py

@@ -20,7 +21,7 @@ def is_active(cls, task_vars):
    @staticmethod
    def is_first_master(task_vars):


This is a bad name, makes the is_active expression look like we only care about being first master, when in fact it is a bit more than that.

Thanks, updated method to only check if first master. Moved logging_deployed check to is_active

rhcarvalho · 2017-07-20T13:14:59Z

roles/openshift_health_checker/test/logging_index_time_test.py

+        check.curl_kibana_with_uuid(plain_running_kibana_pod, None)
+
+    for word in extra_words:
+            assert word in str(error)


Too much indentation

rhcarvalho · 2017-07-20T13:17:23Z

roles/openshift_health_checker/test/logging_index_time_test.py

+        with pytest.raises(expect_exception) as error:
+            check.wait_until_cmd_or_err(plain_running_elasticsearch_pod, uuid, timeout, None)
+    else:
+        check.wait_until_cmd_or_err(plain_running_elasticsearch_pod, uuid, timeout, None)


Please split into two tests and avoid conditional assertions.
The parameters above mix positive and negative cases, making it harder to understand what is being tested.

rhcarvalho · 2017-07-20T13:19:34Z

Also please squash the commits before we run aos-ci-test.

juanvallejo · 2017-07-20T15:10:13Z

@rhcarvalho thanks, review comments addressed

rhcarvalho

One nit and we should hopefully merge this today.

rhcarvalho · 2017-07-20T16:07:44Z

roles/openshift_health_checker/openshift_checks/logging/logging.py

        # Note: It would be nice to use membership in oo_first_master group, however for now it
        # seems best to avoid requiring that setup and just check this is the first master.
        hostname = get_var(task_vars, "ansible_ssh_host") or [None]
        masters = get_var(task_vars, "groups", "masters", default=None) or [None]
-        return logging_deployed and masters[0] == hostname
+        return masters[0] == hostname


We may assume there is always at least one master. For robustness and to avoid an IndexError, we could do

return masters and masters[0] == hostname

rhcarvalho · 2017-07-20T16:30:12Z

roles/openshift_health_checker/openshift_checks/logging/logging.py

@@ -94,3 +95,13 @@ def exec_oc(execute_module=None, namespace="logging", cmd_str="", extra_args=Non
            raise OpenShiftCheckException(msg)

        return result.get("result", "")
+
+    def oc_cmd(self, cmd_str, extra_args, task_vars):
+        """Wrap parent exec_oc method. Allows for testing without actually invoking other modules."""


This docstring is out-of-sync. exec_oc is not a parent method, it is defined right here in LoggingCheck, and not in the parent class, OpenShiftCheck.

Tip: document what it does and why someone would want to call this method and what to expect from the return value. No need to make references to class hierarchy nor to "testing without actually invoking other modules" (typically that is not interesting, you can always monkey-patch things in Python... one could monkey-patch exec_oc itself and this method wouldn't even be needed. It seems that its existence is solely to let you pass in 2 arguments less.

Suggestion:

def oc_cmd(self, cmd_str, extra_args, task_vars): """Same as exec_oc with a predefined execute_module and namespace."""

I almost wrote that this simplifies calls to exec_oc, but then I noted that exec_oc has all optional arguments while oc_cmd has 3 required args, so maybe it is not all that more convenient to call?! All calls I saw pass and empty [] to extra_args.

Thanks, will just remove oc_cmd

This patch verifies that logs sent from logging pods can be queried on the Elasticsearch pod within a reasonable amount of time.

sosiouxme · 2017-07-20T20:18:44Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+            "--cacert /etc/elasticsearch/secret/admin-ca "
+            "--cert /etc/elasticsearch/secret/admin-cert "
+            "--key /etc/elasticsearch/secret/admin-key "
+            "https://logging-es:9200/project.{namespace}*/_count?q=message:{uuid}"


Technically it should probably retrieve the project's UUID and be more specific about the ES index it's querying - reduce query time and load on ES. However I think it's a small enough concern that we can save this for when it becomes a problem or we're revisiting the code for some reason.

sosiouxme · 2017-07-21T19:13:31Z

@rhcarvalho do you approve the changes?

rhcarvalho

LGTM

rhcarvalho · 2017-07-24T08:55:39Z

aos-ci-test

openshift-bot · 2017-07-24T09:40:42Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for ede0374 (logs)

openshift-bot · 2017-07-24T09:43:50Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for ede0374 (logs)

rhcarvalho · 2017-07-24T09:50:38Z

[merge]

rhcarvalho · 2017-07-24T12:10:43Z

Flake openshift/origin#15356

And I don't understand https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible_extended_conformance_install_update/649/, says Stage VERIFY THE INSTALLED OPENSHIFT-ANSIBLE RELEASE [00h 00m 01s] failed. -- Looks like it comes from https://github.com/openshift/aos-cd-jobs/blob/master/sjb/config/test_cases/test_branch_origin_extended_conformance_install_update.yml#L136-L143, and according to the console output:

+ cd /data/src/github.com/openshift/openshift-ansible
++ git describe --tags --abbrev=0 --exact-match HEAD
+ last_tag=openshift-ansible-3.6.154-1
++ git log -n 1 --pretty=%h
+ last_commit=4513926
+ rpm -V openshift-ansible-3.6.154-1.git.0.4513926.el7
package openshift-ansible-3.6.154-1.git.0.4513926.el7 is not installed
++ export status=FAILURE
++ status=FAILURE
+ set +o xtrace
########## FINISHED STAGE: FAILURE: VERIFY THE INSTALLED OPENSHIFT-ANSIBLE RELEASE [00h 00m 01s] ##########

[merge]

openshift-bot · 2017-07-24T12:15:22Z

Evaluated for openshift ansible merge up to ede0374

openshift-bot · 2017-07-24T13:55:48Z

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/728/) (Base Commit: 2a706ad) (PR Branch Commit: ede0374)

rhcarvalho · 2017-07-24T14:10:19Z

Again flake openshift/origin#15356 & Stage VERIFY THE INSTALLED OPENSHIFT-ANSIBLE RELEASE [00h 00m 01s] failed.

Merging manually as per https://github.com/openshift/openshift-ansible/blob/master/docs/pull_requests.md#manual-merges.

rhcarvalho · 2017-07-24T14:18:24Z

For the record, the other failure seems to be a CI script error: openshift/origin#15186

sdodson · 2017-07-25T00:05:52Z

thanks for tracking that down, sorry didn't get to respond to you on irc.

juanvallejo mentioned this pull request Jul 5, 2017

verify sane log times in logging stack #4519

Closed

1 task

juanvallejo mentioned this pull request Jul 5, 2017

add mention of openshift-ansible image in Scaling and Performance Guide openshift/openshift-docs#4579

Closed

rhcarvalho mentioned this pull request Jul 7, 2017

Updated diagnostic_tools for openshift-ansible image openshift/openshift-docs#4713

Merged

rhcarvalho suggested changes Jul 7, 2017

View reviewed changes

rhcarvalho suggested changes Jul 12, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from 201e8b9 to 15cc5cd Compare July 14, 2017 21:54

rhcarvalho suggested changes Jul 18, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-logging-index-time branch 2 times, most recently from 27f9074 to 9cda0ad Compare July 19, 2017 02:00

rhcarvalho suggested changes Jul 20, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from 9cda0ad to 2969965 Compare July 20, 2017 15:09

rhcarvalho reviewed Jul 20, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from 186f3d4 to c1aafde Compare July 20, 2017 18:47

verify sane log times in logging stack

ede0374

This patch verifies that logs sent from logging pods can be queried on the Elasticsearch pod within a reasonable amount of time.

juanvallejo force-pushed the jvallejo/verify-logging-index-time branch from c1aafde to ede0374 Compare July 20, 2017 19:17

sosiouxme approved these changes Jul 20, 2017

View reviewed changes

rhcarvalho approved these changes Jul 24, 2017

View reviewed changes

rhcarvalho merged commit c5e782d into openshift:master Jul 24, 2017

juanvallejo deleted the jvallejo/verify-logging-index-time branch July 24, 2017 14:11

sosiouxme mentioned this pull request Jul 24, 2017

master cherry picks for 3.6.1 #4814

Merged

		@@ -20,7 +21,7 @@ def is_active(cls, task_vars):
		@staticmethod
		def is_first_master(task_vars):

verify sane log times in logging stack #4682

verify sane log times in logging stack #4682

Conversation

juanvallejo commented Jul 5, 2017

juanvallejo commented Jul 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jul 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo Jul 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jul 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jul 13, 2017

juanvallejo commented Jul 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jul 19, 2017

openshift-bot commented Jul 19, 2017

openshift-bot commented Jul 19, 2017

rhcarvalho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho commented Jul 20, 2017

juanvallejo commented Jul 20, 2017

rhcarvalho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme commented Jul 21, 2017

rhcarvalho left a comment

Choose a reason for hiding this comment

rhcarvalho commented Jul 24, 2017

openshift-bot commented Jul 24, 2017

openshift-bot commented Jul 24, 2017

rhcarvalho commented Jul 24, 2017

rhcarvalho commented Jul 24, 2017

openshift-bot commented Jul 24, 2017

openshift-bot commented Jul 24, 2017

rhcarvalho commented Jul 24, 2017

rhcarvalho commented Jul 24, 2017

sdodson commented Jul 25, 2017

sosiouxme Jul 12, 2017 •

edited

Loading

juanvallejo Jul 7, 2017 •

edited

Loading