verify sane log times in logging stack #4519

juanvallejo · 2017-06-20T23:21:32Z

Related trello card: https://trello.com/c/7uOluOTW
Based on @richm 's script: https://bugzilla.redhat.com/attachment.cgi?id=1258225

This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.

TODO

add tests

cc @brenton @sosiouxme @rhcarvalho

rhcarvalho

Sorry, too many comments

rhcarvalho · 2017-06-21T11:18:33Z

roles/openshift_health_checker/openshift_checks/logging/logging.py

+            pod for pod in pods
+            if any(
+                condition['type'] == 'Ready' and condition['status'] == 'True'
+                for condition in pod['status']['conditions']


I didn't recall the conditions field... is Ready==True equivalent to Phase==Running?

vendor/k8s.io/kubernetes/pkg/api/v1/types.go#L2453-L2461

I believe that it can be in the Running state, but its endpoint not be reachable unless its condition.status is true kubernetes/kubernetes#30785 (comment)

We do the same sort of check for not_running_pods

but thinking about this a bit more, I think for this method, it should be enough to just check if the pod's Phase == "Running"

rhcarvalho · 2017-06-21T11:20:51Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        uid = str(uuid4())
+        pod_name = kibanna_pod["metadata"]["name"]
+        exec_cmd = ("exec {pod_name} -c kibana -- curl --connect-timeout 1 -s "
+                    "http://localhost:5601/{uid}")


Poor indentation style; maybe we could fit all in one longer line?

I suspect the port is not user-configurable in the logging stack installation?

We may need to fine tune more flags for curl e.g. to cover failures (-f).

I'm surprised that apparently we create an entry with a GET request?!

Will this somehow pollute what users see, e.g. in the Web Console? Are they going to start seeing this mysterious UUIDs in their logs?

I'm surprised that apparently we create an entry with a GET request?!

Yes, this is how the script I am basing this off works: https://bugzilla.redhat.com/attachment.cgi?id=1258225

Will this somehow pollute what users see, e.g. in the Web Console? Are they going to start seeing this mysterious UUIDs in their logs?

I believe this will happen, the 404 request is logged in kibana, and later aggregated into the elasticsearch logs. @sosiouxme wondering your thoughts on this as well

@rhcarvalho

I suspect the port is not user-configurable in the logging stack installation?

I'm not sure, but did not see any config for them in the sample inventories

I believe this will happen, the 404 request is logged in kibana, and later aggregated into the elasticsearch logs.

Ingenious trick!

We use the same trick in logging CI https://github.com/openshift/origin-aggregated-logging/blob/master/deployer/scripts/util.sh#L459

# $1 - unique value to search for in es function add_test_message() { local kib_pod=`get_running_pod kibana` oc exec $kib_pod -c kibana -- curl --connect-timeout 1 -s \ http://localhost:5601/$1 > /dev/null 2>&1 }

Right, it just causes Kibana itself to create a log entry. So it's only logs for Kibana that get polluted (only admins will see).

rhcarvalho · 2017-06-21T11:25:17Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        result = self._exec_oc(exec_cmd, [], task_vars)
+        return json.loads(result)["count"]
+
+    def _exec_oc(self, cmd_str, extra_args, task_vars):


Why the underscore?

rhcarvalho · 2017-06-21T11:26:04Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+            return {"failed": True, "changed": False, "msg": ('No Kibanna pods were found to be' 
+                                                              'in the "Running" state.'
+                                                              'At least one Kibana pod is required' 
+                                                              'in order to perform this check.')}


Poor indentation style.

rhcarvalho · 2017-06-21T11:26:42Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+            return {"failed": True, "changed": False, "msg": ('No Elasticsearch pods were found to be'
+                                                              'in the "Running" state.'
+                                                              'At least one Kibana pod is required'
+                                                              'in order to perform this check.')}


rhcarvalho · 2017-06-21T11:34:53Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+from openshift_checks.logging.logging import LoggingCheck
+
+
+ES_CMD_TIMEOUT_SECONDS = 30


This is the source of uncertainty and flakiness of this check :-/

Monitoring the time it takes over time would be much better than a point-in-time check. What is "reasonable amount of time" for one cluster is not for another.

This could be user configurable, but then I also would not like us to end up with too many flags for people to tune.

This is the source of uncertainty and flakiness of this check :-/

Monitoring the time it takes over time would be much better than a point-in-time check. What is > "reasonable amount of time" for one cluster is not for another.

While I agree, how would we report this information without failing the check? If I understand correctly, messages from a check do not always show up at the end of a successful check.

This could be user configurable, but then I also would not like us to end up with too many flags for people to tune.

I think at least for now we could try this, as well as set the default time to a much higher value.
I don't think this will become a problem of flakes, as this check is not meant to be run as pre-install or pre-upgrade; I was thinking it'd be more along the lines of the openshift image data check

Sorry, I was just thinking aloud how the stuff we do need not necessarily fit in the "check" format.

Let's start with this and iterate as we get feedback.

rhcarvalho · 2017-06-21T11:36:24Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+class Podlogs(LoggingCheck):
+    """Check that logs from ElasticSearch pods can be queried within a reasonable amount of time."""
+
+    name = "podlogs"


Maybe pod_logs (separate words with underscore as we do for other check names)? Or even something more descriptive/verbose.

Then make the class name a CamelCase version of the name.

rhcarvalho · 2017-06-21T11:40:41Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        raise OpenShiftCheckException(msg.format(uid, timeout_secs))
+
+    def curl_kibana_with_uuid(self, kibanna_pod, task_vars):
+        """curl Kibanna with a unique uuid to search for in Elasticsearch."""


Don't mix the matters in the docstring. This is probably unrelated to Elasticsearch, even though in a greater context the latter will be involved.

rhcarvalho · 2017-06-21T11:42:15Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+
+    def query_es_from_es(self, es_pod, es_host, es_index, uid, task_vars):
+        """curl the Elasticsearch pod and look for the unique uuid that was set
+        as part of the error log from the curl request made in the Kibanna pod."""


The knowledge of how ES and Kib. mix together should be in the caller that coordinates the use of these methods, not in the methods themselves. This method is only about ES. A caller does whatever it wants with it.

rhcarvalho · 2017-06-21T11:43:00Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        as part of the error log from the curl request made in the Kibanna pod."""
+        pod_name = es_pod["metadata"]["name"]
+        exec_cmd = ("exec {pod_name} -- curl "
+                    "--connect-timeout 1 -s -k "


Why do we use -k and pass --cert and --key, are those somehow contradictory?

juanvallejo · 2017-06-21T16:25:07Z

@rhcarvalho thanks for the feedback, comments addressed

rhcarvalho · 2017-06-21T21:03:00Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        if error:
+            return {"failed": True, "changed": False, "msg": error}
+
+        kibanna_pods = super(PodLogQueryTime, self).running_pods(pods)


Why not self.running_pods?!

Sure, will update

rhcarvalho · 2017-06-21T21:04:09Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        uuid from earlier, there should only be a single match.
+        Raise an OpenShiftCheckException if the timeout is reached before a result is found."""
+
+        interval = 1  # amount of time to wait between retries


s/amount of time/seconds ?

rhcarvalho · 2017-06-21T21:07:17Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+            if total == 1:
+                return
+
+        msg = "expecting match in Elasticsearch for message with uuid {}, but no matches were found after {}s."


Eventually we may have slept for more than orig_timeout, but that's probably irrelevant, agree?

Queries should be pretty quick but in times of trouble they may not be... probably better if the loop condition was how much time had passed, not how many times through the loop. Would be even better if there were an actual timer set so that if something got stuck it wouldn't just sit there, but that may be too complicated for now.

how much time had passed

The detail of doing that right is that we need a monotonic clock/timer.

The total time per loop iteration should be approx. the time slept of at least interval + the time running curl.
We can limit curl a bit more by setting --max-time.

I'm okay if this function can run for (a bit) longer than orig_timeout, specially if we lower that value as @sosiouxme proposes.

@rhcarvalho @sosiouxme sounds good, updated curl commands to have a connect-timeout of 30s and a max-time of 60s. Also, updated total time per loop iteration to take into account total curl time as well:29149c2

rhcarvalho · 2017-06-21T21:08:17Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+    def query_es_from_es(self, es_pod, uid, task_vars):
+        """curl the Elasticsearch pod and look for a unique uuid in its logs."""
+        pod_name = es_pod["metadata"]["name"]
+        exec_cmd = ("exec {pod_name} -- curl --connect-timeout 1 -s -k -f "


My question remain, why do we use -k and --cert and --key together?

Because es requires client cert auth for admin access. So you can either use -k or --cacert /etc/elasticsearch/secret/admin-ca, your choice.

Thanks @richm. I see no reason not to use --cacert /etc/elasticsearch/secret/admin-ca then.

As for the --connect-timeout, 1 second seems too aggressive.
If we're willing to wait up to 10 minutes retrying until this succeeds, I'd suggest being more tolerant with slow connections.
--connect-timeout 60?

I agree, will increase connect timeout

juanvallejo · 2017-06-21T22:06:22Z

Thanks for the feedback, will begin adding tests

juanvallejo · 2017-06-22T19:47:58Z

aos-ci-test

openshift-bot · 2017-06-22T23:41:27Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 03e694a (logs)

openshift-bot · 2017-06-22T23:43:37Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 03e694a (logs)

openshift-bot · 2017-06-22T23:44:18Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 03e694a (logs)

openshift-bot · 2017-06-22T23:46:19Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 03e694a (logs)

sosiouxme · 2017-06-22T22:56:06Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

@@ -0,0 +1,174 @@
+"""
+Module for ensuring logs from pods can be queried in a reasonable amount of time.


It's a check, not a module

Funny thing is that in one sense it is a "Python module". I agree with @sosiouxme that in this context we better call it a check.

sosiouxme · 2017-06-22T22:59:44Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+
+    def run(self, tmp, task_vars):
+        """Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs."""
+        self.pod_es_host = get_var(task_vars, "openshift_check_pod_elasticsearch_host", default=ES_DEFAULT_HOST)


I don't think there's a point in making this configurable; it will never be different.

If we ever have this check look at the ops instance then this would be logging-es-ops but I think we'd structure the check a little differently to do both.

sosiouxme · 2017-06-22T23:02:11Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+    def run(self, tmp, task_vars):
+        """Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs."""
+        self.pod_es_host = get_var(task_vars, "openshift_check_pod_elasticsearch_host", default=ES_DEFAULT_HOST)
+        self.pod_es_index = get_var(task_vars, "openshift_check_pod_elasticsearch_index", default=ES_DEFAULT_INDEX)


This also doesn't need to be configured, it needs to be based off the actual name of the logging project (logging_namespace below, defaults to logging)

sosiouxme · 2017-06-22T23:12:14Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+
+    logging_namespace = "logging"
+    pod_es_host = ES_DEFAULT_INDEX
+    pod_es_index = ES_DEFAULT_HOST


It doesn't make sense to define these as properties and then pass them around as method parameters. I don't think these should be properties at all.

Originally I passed them as arguments to curl_kibana_with_uuid and query_es_from_es methods, but was running into a pylint too-many-arguments error.

sosiouxme · 2017-06-23T01:54:10Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+            return {"failed": True, "changed": False, "msg": msg}
+
+        # get all Elasticsearch pods
+        self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)


hasn't changed since the last time you got it...

sosiouxme · 2017-06-23T02:16:17Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+
+    def wait_until_cmd_or_err(self, es_pod, uid, timeout_secs, task_vars):
+        """Wait a maximum of timeout_secs for the uuid logged in Kibana to be found in
+        the Elasticsearch logs. Since we are querying for a message with the the unique


sosiouxme · 2017-06-23T02:19:20Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+            if total == 1:
+                return
+
+        msg = "expecting match in Elasticsearch for message with uuid {}, but no matches were found after {}s."


Queries should be pretty quick but in times of trouble they may not be... probably better if the loop condition was how much time had passed, not how many times through the loop. Would be even better if there were an actual timer set so that if something got stuck it wouldn't just sit there, but that may be too complicated for now.

sosiouxme · 2017-06-23T02:20:45Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        """curl Kibana with a unique uuid."""
+        uid = str(self.generate_uuid())
+        pod_name = kibanna_pod["metadata"]["name"]
+        exec_cmd = "exec {pod_name} -c kibana -- curl --connect-timeout 60 -s http://localhost:5601/{uid}"


If it takes 60 seconds to connect something is really wrong... I'd say maybe 5? For that matter, probably want to add --max-time 15 to the curl as well so queries can't get stuck.

In a previous iteration it was --connect-timeout 1. The discussion went towards not making it too aggressive, because if the network is slow but otherwise functioning, we'd loop and try to call curl several times, but always giving up too quick before connecting.

Any number is a wild guess.

👍 to setting a --max-time. Or replacing --connect-timeout with only --max-time (doesn't really matter how long it takes to connect, but we want to set an upper bound to how long curl runs).

It's a wild guess to some degree, but this is all intra-cluster networking. It would be a bad sign if connecting to a service IP took more than a second even if your cluster were federated around the world (although it could always be a transient network issue, I think it's much more likely to be something more significant). The actual retrieval of the request hits storage and processing so I could see a little more leeway on that, agree that it's a good idea to just use --max-time to cover the whole operation.

sosiouxme · 2017-06-23T02:21:20Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+
+    def curl_kibana_with_uuid(self, kibanna_pod, task_vars):
+        """curl Kibana with a unique uuid."""
+        uid = str(self.generate_uuid())


just curious, would this ever not already be a string?

In tests it is a string (when the method gets monkey patched), but otherwise it is UUID object...
For consistency, I suggest changing generate_uuid to always return a string.

sosiouxme · 2017-06-23T02:28:23Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+    def query_es_from_es(self, es_pod, uid, task_vars):
+        """curl the Elasticsearch pod and look for a unique uuid in its logs."""
+        pod_name = es_pod["metadata"]["name"]
+        exec_cmd = ("exec {pod_name} -- curl --connect-timeout 60 -s -k -f "


I don't think there is a need for -k here, should be able to just specify the CA and have a valid TLS connection... remind me to check into it.

This got lost in a previous discussion. The conclusion was to replace -k with --ca-cert /etc/.....

sosiouxme · 2017-06-23T02:48:22Z

Hah, I mostly just totally contradicted or repeated earlier feedback didn't I? Can discuss further...

rhcarvalho · 2017-06-23T08:52:30Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        except ValueError as error:
+            error = "Invalid response from pod {}".format(str(error))
+
+        return error


I agree with @sosiouxme re: exception handling here, and add that we better not mix styles. In Go we pass around errors as values and the common idiom is to have a return value of type error. Not that we couldn't do something similar in Python, but if we are raising exceptions, let's use a single mechanism for communicating errors.

rhcarvalho · 2017-06-23T09:44:18Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+
+    def curl_kibana_with_uuid(self, kibanna_pod, task_vars):
+        """curl Kibana with a unique uuid."""
+        uid = str(self.generate_uuid())


In tests it is a string (when the method gets monkey patched), but otherwise it is UUID object...
For consistency, I suggest changing generate_uuid to always return a string.

rhcarvalho · 2017-06-23T09:51:53Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+        """curl Kibana with a unique uuid."""
+        uid = str(self.generate_uuid())
+        pod_name = kibanna_pod["metadata"]["name"]
+        exec_cmd = "exec {pod_name} -c kibana -- curl --connect-timeout 60 -s http://localhost:5601/{uid}"


In a previous iteration it was --connect-timeout 1. The discussion went towards not making it too aggressive, because if the network is slow but otherwise functioning, we'd loop and try to call curl several times, but always giving up too quick before connecting.

Any number is a wild guess.

👍 to setting a --max-time. Or replacing --connect-timeout with only --max-time (doesn't really matter how long it takes to connect, but we want to set an upper bound to how long curl runs).

rhcarvalho · 2017-06-23T09:53:24Z

roles/openshift_health_checker/openshift_checks/logging/podlogs.py

+    def query_es_from_es(self, es_pod, uid, task_vars):
+        """curl the Elasticsearch pod and look for a unique uuid in its logs."""
+        pod_name = es_pod["metadata"]["name"]
+        exec_cmd = ("exec {pod_name} -- curl --connect-timeout 60 -s -k -f "


This got lost in a previous discussion. The conclusion was to replace -k with --ca-cert /etc/.....

rhcarvalho · 2017-06-23T09:56:49Z

roles/openshift_health_checker/test/podlogs_test.py

+from openshift_checks.logging.podlogs import PodLogQueryTime, OpenShiftCheckException
+
+
+SAMPLE_UID = "3ea53ca5-f60a-40e5-8828-983240aee8a1"


Note that this doesn't need to be a proper UUID -- when the test fails, seeing this string can be more puzzling than enlightening.

rhcarvalho · 2017-06-23T09:59:06Z

roles/openshift_health_checker/test/podlogs_test.py

+import pytest
+import json
+
+


Too many empty lines, pylint doesn't run on test files yet...

Suggestion: separate imports of stdlib, 3rd party and local pkgs/modules, in that order:

import json import pytest from openshift_checks.logging.podlogs import PodLogQueryTime, OpenShiftCheckException

rhcarvalho · 2017-06-23T10:00:12Z

roles/openshift_health_checker/test/podlogs_test.py

+        assert error
+        assert expect_error in error
+    else:
+        assert not error


rhcarvalho · 2017-06-23T10:15:47Z

roles/openshift_health_checker/test/podlogs_test.py

+
+
+@pytest.mark.parametrize('json_response, uuid, timeout, expect_exception, extra_words', [
+    # test with invalid json response


If these comments are to be useful, they should be part of the output when a test fail...
We should not need to manually match complex data in the output to identify which test case failed.

See https://docs.pytest.org/en/latest/parametrize.html#_pytest.python.Metafunc.parametrize.

I couldn't find a clean example right now, but I can type:

@pytest.mark.parametrize('name,json_response,uuid,timeout,exc,extra_words', [ ( 'invalid json response', ... ), ( 'empty response', ... ), ... ], ids=lambda argval: argval[0]) def test_...(name, json_response, ...): ...

We also have one example in openshift-ansible, look for roles/openshift_certificate_expiry/test/conftest.py.

Will change, originally based this on the way we have written every other test so far, however

I agree it's better to have the comments in the output than in the code, and I think I'm the one that started with the code comments so I'll take the blame.

juanvallejo · 2017-06-27T15:32:30Z

@sosiouxme thanks for the review, comments addressed

juanvallejo · 2017-06-27T21:29:41Z

aos-ci-test

openshift-bot · 2017-06-28T04:33:25Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 0a33c22 (logs)

openshift-bot · 2017-06-28T04:35:51Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 0a33c22 (logs)

openshift-bot · 2017-06-28T04:36:48Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 0a33c22 (logs)

openshift-bot · 2017-06-28T04:39:07Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 0a33c22 (logs)

sosiouxme · 2017-06-28T16:38:36Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+
+        # get all Kibana pods
+        self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)
+        pods, error = super(LoggingIndexTime, self).get_pods_for_component(


Actually if there are no Kibana pods running, this comes back with:

There are no pods in the logging namespace. Is logging deployed?

Which is sort of a universal problem with using this method; if any component is completely down it says there are no pods in the whole namespace. Originally I think it got all the pods and then sorted them out by component and that error really meant there were no pods. Now it's misleading. Could the message be adjusted or handled differently?

Thanks, updated the error returned by get_pods_for_component to mention the logging component instead, and added further clarification when handling the error in the check.

sosiouxme · 2017-06-28T16:38:50Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+            return {"failed": True, "changed": False, "msg": msg}
+
+        # get all Elasticsearch pods
+        pods, error = super(LoggingIndexTime, self).get_pods_for_component(


same issue here

sosiouxme · 2017-06-28T16:39:57Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+                   'At least one Elasticsearch pod is required in order to perform this check.')
+            return {"failed": True, "changed": False, "msg": msg}
+
+        uid = self.curl_kibana_with_uuid(running_kibana_pods[0], task_vars)


while I'm at it... is there some reason not to call this uuid?

uid is user ID, uuid is universally unique ID.

sosiouxme · 2017-06-28T16:40:25Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+
+    def wait_until_cmd_or_err(self, es_pod, uid, timeout_secs, task_vars):
+        """Wait a maximum of timeout_secs for the uuid logged in Kibana to be found in
+        the Elasticsearch logs. Since we are querying for a message with the unique


alright, super nit but... "unique" is redundant when describing a uuid.

This patch verifies that logs sent from logging pods can be queried on the Elasticsearch pod within a reasonable amount of time.

juanvallejo · 2017-06-28T19:07:17Z

@sosiouxme Thanks, review comments addressed

sosiouxme · 2017-06-28T22:42:06Z

@juanvallejo Can't believe I'm just noticing this but:

https://github.com/juanvallejo/openshift-ansible/blob/18715362c46e21b044f5edbecefd9b7f32bae85b/roles/openshift_health_checker/openshift_checks/logging/logging.py#L23

I don't think the default is to deploy logging. So all these checks would be running even if logging is not deployed. Could you change that default to false with this PR? Then I think we're good to go.

juanvallejo · 2017-06-30T18:10:47Z

@sosiouxme ping

sosiouxme · 2017-07-02T22:18:20Z

aos-ci-test

openshift-bot · 2017-07-02T23:14:39Z

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for d72dc1d (logs)

openshift-bot · 2017-07-02T23:17:15Z

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for d72dc1d (logs)

openshift-bot · 2017-07-02T23:19:34Z

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for d72dc1d (logs)

openshift-bot · 2017-07-02T23:23:17Z

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for d72dc1d (logs)

rhcarvalho · 2017-07-03T08:28:30Z

roles/openshift_health_checker/openshift_checks/logging/logging.py

@@ -20,7 +20,7 @@ def is_active(cls, task_vars):
    @staticmethod
    def is_first_master(task_vars):
        """Run only on first master and only when logging is configured. Returns: bool"""
-        logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=True)
+        logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=False)


@sosiouxme we need to make sure this change is in 3.6, right?

@juanvallejo let's split it into a separate commit with relevant commit message why we're changing it ("the default was incorrectly set, causing the check to run when it shouldn't" -- or sth else)

rhcarvalho · 2017-07-03T08:31:00Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+
+    def run(self, tmp, task_vars):
+        """Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs."""
+        pod_log_timeout = int(get_var(task_vars, "openshift_check_pod_logs_timeout", default=ES_CMD_TIMEOUT_SECONDS))


Should handle a potential ValueError exception.

rhcarvalho · 2017-07-03T08:33:54Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        found in the Elasticsearch logs. Since we are querying for a message
+        with the uuid that was set earlier, there should only be a single match.
+        Raise an OpenShiftCheckException if timeout is reached finding a match."""
+        now = int(time.time())


Why truncate it into an int?

rhcarvalho · 2017-07-03T08:52:51Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        return {}
+
+    def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars):
+        """Wait a maximum of timeout_secs for the uuid logged in Kibana to be


"Wait a maximum of timeout_secs for ...": this is not what this method does.
We're actually retrying with at least 1s intervals, and the "wait time" is defined right in query_es_from_es in the call to curl.

We could really simplify this by simply stating a simpler algorithm: "retry N times with S seconds interval", and never call time.time().

We could really simplify this by simply stating a simpler algorithm: "retry N times with S seconds interval", and never call time.time().

Not sure I follow your thinking here. The reason why I'm using time.time() is to ensure that delay between log creation and log aggregation does not exceed an overall time constraint.

rhcarvalho · 2017-07-03T09:08:30Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+
+    @staticmethod
+    def running_pods(pods):
+        """Returns: list of pods in a running state"""


"Filter out pods that are not running" ?
"Filter pods that are running"

Or at least s/Returns:/Return a

rhcarvalho · 2017-07-03T09:08:47Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        return [
+            pod for pod in pods
+            if pod['status']['phase'] == 'Running'
+        ]


Could be in one line?

rhcarvalho · 2017-07-03T09:10:16Z

roles/openshift_health_checker/test/logging_index_time_test.py

+SAMPLE_UUID = "unique-test-uuid"
+
+
+def canned_podlogquerytime(exec_oc=None):


The name here is reminiscent from a recent past... isn't it outdated?

rhcarvalho · 2017-07-03T09:13:19Z

roles/openshift_health_checker/test/logging_index_time_test.py

+        if error:
+            assert word in str(error)
+        elif result:
+            assert word in result


I'd find it easier to understand if "positive" and "negative" tests were separate.

Here, it is unclear what assertions are actually made and, more importantly, what do they mean. Maybe it's just me :/

stevekuznetsov · 2017-07-03T18:17:06Z

roles/openshift_health_checker/openshift_checks/logging/logging_index_time.py

+        )
+
+        if error:
+            msg = 'Unable to retrieve pods for the "Kibanna" logging component: {}'


s/Kibanna/Kibana/

In an Openshift-Ansible inventory file, logging is not set to deploy by default. This patch updates the LoggingCheck's assumption of logging_deployed to False, to match the inventory file's default.

juanvallejo · 2017-07-05T17:08:49Z

Did not mean to delete branch - re-opening here: #4682

rhcarvalho reviewed Jun 21, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 4 times, most recently from 0959b35 to 6cb7b3c Compare June 21, 2017 16:23

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 3 times, most recently from 801df81 to 5f606a7 Compare June 21, 2017 19:41

rhcarvalho reviewed Jun 21, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 5f606a7 to ab2e09e Compare June 21, 2017 22:02

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 5 times, most recently from 6892b30 to 03e694a Compare June 22, 2017 19:47

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 03e694a to 920fbb1 Compare June 22, 2017 20:07

sosiouxme requested changes Jun 23, 2017

View reviewed changes

rhcarvalho reviewed Jun 23, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 3 times, most recently from 0ad2bd4 to 8d40f83 Compare June 23, 2017 18:31

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from b431957 to 0a33c22 Compare June 27, 2017 21:18

sosiouxme approved these changes Jun 28, 2017

View reviewed changes

sosiouxme requested changes Jun 28, 2017

View reviewed changes

verify sane log times in logging stack

1871536

This patch verifies that logs sent from logging pods can be queried on the Elasticsearch pod within a reasonable amount of time.

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 0a33c22 to 1871536 Compare June 28, 2017 19:05

sosiouxme approved these changes Jun 28, 2017

View reviewed changes

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 1871536 to d72dc1d Compare June 28, 2017 23:26

rhcarvalho suggested changes Jul 3, 2017

View reviewed changes

stevekuznetsov reviewed Jul 3, 2017

View reviewed changes

update default logging_deployed value to False

7d7c942

In an Openshift-Ansible inventory file, logging is not set to deploy by default. This patch updates the LoggingCheck's assumption of logging_deployed to False, to match the inventory file's default.

juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from d72dc1d to 7d7c942 Compare July 5, 2017 15:42

juanvallejo closed this Jul 5, 2017

juanvallejo deleted the jvallejo/verify-pod-logs-query-time branch July 5, 2017 17:03

juanvallejo restored the jvallejo/verify-pod-logs-query-time branch July 5, 2017 17:04

juanvallejo mentioned this pull request Jul 5, 2017

verify sane log times in logging stack #4682

Merged

1 task

		from openshift_checks.logging.logging import LoggingCheck


		ES_CMD_TIMEOUT_SECONDS = 30

		@@ -0,0 +1,174 @@
		"""
		Module for ensuring logs from pods can be queried in a reasonable amount of time.

		from openshift_checks.logging.podlogs import PodLogQueryTime, OpenShiftCheckException


		SAMPLE_UID = "3ea53ca5-f60a-40e5-8828-983240aee8a1"



		@pytest.mark.parametrize('json_response, uuid, timeout, expect_exception, extra_words', [
		# test with invalid json response

		SAMPLE_UUID = "unique-test-uuid"


		def canned_podlogquerytime(exec_oc=None):

verify sane log times in logging stack #4519

verify sane log times in logging stack #4519

Conversation

juanvallejo commented Jun 20, 2017 • edited Loading

rhcarvalho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo Jun 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 21, 2017

juanvallejo commented Jun 22, 2017

openshift-bot commented Jun 22, 2017

openshift-bot commented Jun 22, 2017

openshift-bot commented Jun 22, 2017

openshift-bot commented Jun 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juanvallejo commented Jun 27, 2017

juanvallejo commented Jun 27, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

openshift-bot commented Jun 28, 2017

Choose a reason for hiding this comment

juanvallejo commented Jun 20, 2017 •

edited

Loading

juanvallejo Jun 21, 2017 •

edited

Loading