Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verify sane log times in logging stack #4519

Conversation

juanvallejo
Copy link
Contributor

@juanvallejo juanvallejo commented Jun 20, 2017

Related trello card: https://trello.com/c/7uOluOTW
Based on @richm 's script: https://bugzilla.redhat.com/attachment.cgi?id=1258225

This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.

TODO

  • add tests

cc @brenton @sosiouxme @rhcarvalho

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, too many comments

pod for pod in pods
if any(
condition['type'] == 'Ready' and condition['status'] == 'True'
for condition in pod['status']['conditions']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't recall the conditions field... is Ready==True equivalent to Phase==Running?

vendor/k8s.io/kubernetes/pkg/api/v1/types.go#L2453-L2461

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that it can be in the Running state, but its endpoint not be reachable unless its condition.status is true kubernetes/kubernetes#30785 (comment)

We do the same sort of check for not_running_pods

but thinking about this a bit more, I think for this method, it should be enough to just check if the pod's Phase == "Running"

uid = str(uuid4())
pod_name = kibanna_pod["metadata"]["name"]
exec_cmd = ("exec {pod_name} -c kibana -- curl --connect-timeout 1 -s "
"http://localhost:5601/{uid}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poor indentation style; maybe we could fit all in one longer line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the port is not user-configurable in the logging stack installation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to fine tune more flags for curl e.g. to cover failures (-f).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that apparently we create an entry with a GET request?!

Will this somehow pollute what users see, e.g. in the Web Console? Are they going to start seeing this mysterious UUIDs in their logs?

Copy link
Contributor Author

@juanvallejo juanvallejo Jun 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that apparently we create an entry with a GET request?!

Yes, this is how the script I am basing this off works: https://bugzilla.redhat.com/attachment.cgi?id=1258225

Will this somehow pollute what users see, e.g. in the Web Console? Are they going to start seeing this mysterious UUIDs in their logs?

I believe this will happen, the 404 request is logged in kibana, and later aggregated into the elasticsearch logs. @sosiouxme wondering your thoughts on this as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho

I suspect the port is not user-configurable in the logging stack installation?

I'm not sure, but did not see any config for them in the sample inventories

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will happen, the 404 request is logged in kibana, and later aggregated into the elasticsearch logs.

Ingenious trick!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the same trick in logging CI https://github.com/openshift/origin-aggregated-logging/blob/master/deployer/scripts/util.sh#L459

# $1 - unique value to search for in es
function add_test_message() {
    local kib_pod=`get_running_pod kibana`
    oc exec $kib_pod -c kibana -- curl --connect-timeout 1 -s \
       http://localhost:5601/$1 > /dev/null 2>&1
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it just causes Kibana itself to create a log entry. So it's only logs for Kibana that get polluted (only admins will see).

result = self._exec_oc(exec_cmd, [], task_vars)
return json.loads(result)["count"]

def _exec_oc(self, cmd_str, extra_args, task_vars):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the underscore?

return {"failed": True, "changed": False, "msg": ('No Kibanna pods were found to be'
'in the "Running" state.'
'At least one Kibana pod is required'
'in order to perform this check.')}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poor indentation style.

return {"failed": True, "changed": False, "msg": ('No Elasticsearch pods were found to be'
'in the "Running" state.'
'At least one Kibana pod is required'
'in order to perform this check.')}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PIS ;)

from openshift_checks.logging.logging import LoggingCheck


ES_CMD_TIMEOUT_SECONDS = 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source of uncertainty and flakiness of this check :-/

Monitoring the time it takes over time would be much better than a point-in-time check. What is "reasonable amount of time" for one cluster is not for another.

This could be user configurable, but then I also would not like us to end up with too many flags for people to tune.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source of uncertainty and flakiness of this check :-/

Monitoring the time it takes over time would be much better than a point-in-time check. What is > "reasonable amount of time" for one cluster is not for another.

While I agree, how would we report this information without failing the check? If I understand correctly, messages from a check do not always show up at the end of a successful check.

This could be user configurable, but then I also would not like us to end up with too many flags for people to tune.

I think at least for now we could try this, as well as set the default time to a much higher value.
I don't think this will become a problem of flakes, as this check is not meant to be run as pre-install or pre-upgrade; I was thinking it'd be more along the lines of the openshift image data check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was just thinking aloud how the stuff we do need not necessarily fit in the "check" format.

Let's start with this and iterate as we get feedback.

class Podlogs(LoggingCheck):
"""Check that logs from ElasticSearch pods can be queried within a reasonable amount of time."""

name = "podlogs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pod_logs (separate words with underscore as we do for other check names)? Or even something more descriptive/verbose.

Then make the class name a CamelCase version of the name.

raise OpenShiftCheckException(msg.format(uid, timeout_secs))

def curl_kibana_with_uuid(self, kibanna_pod, task_vars):
"""curl Kibanna with a unique uuid to search for in Elasticsearch."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mix the matters in the docstring. This is probably unrelated to Elasticsearch, even though in a greater context the latter will be involved.


def query_es_from_es(self, es_pod, es_host, es_index, uid, task_vars):
"""curl the Elasticsearch pod and look for the unique uuid that was set
as part of the error log from the curl request made in the Kibanna pod."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The knowledge of how ES and Kib. mix together should be in the caller that coordinates the use of these methods, not in the methods themselves. This method is only about ES. A caller does whatever it wants with it.

as part of the error log from the curl request made in the Kibanna pod."""
pod_name = es_pod["metadata"]["name"]
exec_cmd = ("exec {pod_name} -- curl "
"--connect-timeout 1 -s -k "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use -k and pass --cert and --key, are those somehow contradictory?

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 4 times, most recently from 0959b35 to 6cb7b3c Compare June 21, 2017 16:23
@juanvallejo
Copy link
Contributor Author

@rhcarvalho thanks for the feedback, comments addressed

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 3 times, most recently from 801df81 to 5f606a7 Compare June 21, 2017 19:41
if error:
return {"failed": True, "changed": False, "msg": error}

kibanna_pods = super(PodLogQueryTime, self).running_pods(pods)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not self.running_pods?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update

uuid from earlier, there should only be a single match.
Raise an OpenShiftCheckException if the timeout is reached before a result is found."""

interval = 1 # amount of time to wait between retries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/amount of time/seconds ?

if total == 1:
return

msg = "expecting match in Elasticsearch for message with uuid {}, but no matches were found after {}s."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we may have slept for more than orig_timeout, but that's probably irrelevant, agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queries should be pretty quick but in times of trouble they may not be... probably better if the loop condition was how much time had passed, not how many times through the loop. Would be even better if there were an actual timer set so that if something got stuck it wouldn't just sit there, but that may be too complicated for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much time had passed

The detail of doing that right is that we need a monotonic clock/timer.

The total time per loop iteration should be approx. the time slept of at least interval + the time running curl.
We can limit curl a bit more by setting --max-time.

I'm okay if this function can run for (a bit) longer than orig_timeout, specially if we lower that value as @sosiouxme proposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho @sosiouxme sounds good, updated curl commands to have a connect-timeout of 30s and a max-time of 60s. Also, updated total time per loop iteration to take into account total curl time as well:29149c2

def query_es_from_es(self, es_pod, uid, task_vars):
"""curl the Elasticsearch pod and look for a unique uuid in its logs."""
pod_name = es_pod["metadata"]["name"]
exec_cmd = ("exec {pod_name} -- curl --connect-timeout 1 -s -k -f "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question remain, why do we use -k and --cert and --key together?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because es requires client cert auth for admin access. So you can either use -k or --cacert /etc/elasticsearch/secret/admin-ca, your choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @richm. I see no reason not to use --cacert /etc/elasticsearch/secret/admin-ca then.

As for the --connect-timeout, 1 second seems too aggressive.
If we're willing to wait up to 10 minutes retrying until this succeeds, I'd suggest being more tolerant with slow connections.
--connect-timeout 60?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, will increase connect timeout

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 5f606a7 to ab2e09e Compare June 21, 2017 22:02
@juanvallejo
Copy link
Contributor Author

Thanks for the feedback, will begin adding tests

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 5 times, most recently from 6892b30 to 03e694a Compare June 22, 2017 19:47
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 03e694a to 920fbb1 Compare June 22, 2017 20:07
@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 03e694a (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 03e694a (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 03e694a (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 03e694a (logs)

@@ -0,0 +1,174 @@
"""
Module for ensuring logs from pods can be queried in a reasonable amount of time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a check, not a module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny thing is that in one sense it is a "Python module". I agree with @sosiouxme that in this context we better call it a check.


def run(self, tmp, task_vars):
"""Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs."""
self.pod_es_host = get_var(task_vars, "openshift_check_pod_elasticsearch_host", default=ES_DEFAULT_HOST)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a point in making this configurable; it will never be different.

If we ever have this check look at the ops instance then this would be logging-es-ops but I think we'd structure the check a little differently to do both.

def run(self, tmp, task_vars):
"""Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs."""
self.pod_es_host = get_var(task_vars, "openshift_check_pod_elasticsearch_host", default=ES_DEFAULT_HOST)
self.pod_es_index = get_var(task_vars, "openshift_check_pod_elasticsearch_index", default=ES_DEFAULT_INDEX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn't need to be configured, it needs to be based off the actual name of the logging project (logging_namespace below, defaults to logging)


logging_namespace = "logging"
pod_es_host = ES_DEFAULT_INDEX
pod_es_index = ES_DEFAULT_HOST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make sense to define these as properties and then pass them around as method parameters. I don't think these should be properties at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I passed them as arguments to curl_kibana_with_uuid and query_es_from_es methods, but was running into a pylint too-many-arguments error.

return {"failed": True, "changed": False, "msg": msg}

# get all Elasticsearch pods
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasn't changed since the last time you got it...


def wait_until_cmd_or_err(self, es_pod, uid, timeout_secs, task_vars):
"""Wait a maximum of timeout_secs for the uuid logged in Kibana to be found in
the Elasticsearch logs. Since we are querying for a message with the the unique
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the the

if total == 1:
return

msg = "expecting match in Elasticsearch for message with uuid {}, but no matches were found after {}s."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queries should be pretty quick but in times of trouble they may not be... probably better if the loop condition was how much time had passed, not how many times through the loop. Would be even better if there were an actual timer set so that if something got stuck it wouldn't just sit there, but that may be too complicated for now.

"""curl Kibana with a unique uuid."""
uid = str(self.generate_uuid())
pod_name = kibanna_pod["metadata"]["name"]
exec_cmd = "exec {pod_name} -c kibana -- curl --connect-timeout 60 -s http://localhost:5601/{uid}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it takes 60 seconds to connect something is really wrong... I'd say maybe 5? For that matter, probably want to add --max-time 15 to the curl as well so queries can't get stuck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous iteration it was --connect-timeout 1. The discussion went towards not making it too aggressive, because if the network is slow but otherwise functioning, we'd loop and try to call curl several times, but always giving up too quick before connecting.

Any number is a wild guess.

👍 to setting a --max-time. Or replacing --connect-timeout with only --max-time (doesn't really matter how long it takes to connect, but we want to set an upper bound to how long curl runs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a wild guess to some degree, but this is all intra-cluster networking. It would be a bad sign if connecting to a service IP took more than a second even if your cluster were federated around the world (although it could always be a transient network issue, I think it's much more likely to be something more significant). The actual retrieval of the request hits storage and processing so I could see a little more leeway on that, agree that it's a good idea to just use --max-time to cover the whole operation.


def curl_kibana_with_uuid(self, kibanna_pod, task_vars):
"""curl Kibana with a unique uuid."""
uid = str(self.generate_uuid())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, would this ever not already be a string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tests it is a string (when the method gets monkey patched), but otherwise it is UUID object...
For consistency, I suggest changing generate_uuid to always return a string.

def query_es_from_es(self, es_pod, uid, task_vars):
"""curl the Elasticsearch pod and look for a unique uuid in its logs."""
pod_name = es_pod["metadata"]["name"]
exec_cmd = ("exec {pod_name} -- curl --connect-timeout 60 -s -k -f "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need for -k here, should be able to just specify the CA and have a valid TLS connection... remind me to check into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got lost in a previous discussion. The conclusion was to replace -k with --ca-cert /etc/.....

@sosiouxme
Copy link
Member

Hah, I mostly just totally contradicted or repeated earlier feedback didn't I? Can discuss further...

except ValueError as error:
error = "Invalid response from pod {}".format(str(error))

return error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sosiouxme re: exception handling here, and add that we better not mix styles. In Go we pass around errors as values and the common idiom is to have a return value of type error. Not that we couldn't do something similar in Python, but if we are raising exceptions, let's use a single mechanism for communicating errors.


def curl_kibana_with_uuid(self, kibanna_pod, task_vars):
"""curl Kibana with a unique uuid."""
uid = str(self.generate_uuid())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In tests it is a string (when the method gets monkey patched), but otherwise it is UUID object...
For consistency, I suggest changing generate_uuid to always return a string.

"""curl Kibana with a unique uuid."""
uid = str(self.generate_uuid())
pod_name = kibanna_pod["metadata"]["name"]
exec_cmd = "exec {pod_name} -c kibana -- curl --connect-timeout 60 -s http://localhost:5601/{uid}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous iteration it was --connect-timeout 1. The discussion went towards not making it too aggressive, because if the network is slow but otherwise functioning, we'd loop and try to call curl several times, but always giving up too quick before connecting.

Any number is a wild guess.

👍 to setting a --max-time. Or replacing --connect-timeout with only --max-time (doesn't really matter how long it takes to connect, but we want to set an upper bound to how long curl runs).

def query_es_from_es(self, es_pod, uid, task_vars):
"""curl the Elasticsearch pod and look for a unique uuid in its logs."""
pod_name = es_pod["metadata"]["name"]
exec_cmd = ("exec {pod_name} -- curl --connect-timeout 60 -s -k -f "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got lost in a previous discussion. The conclusion was to replace -k with --ca-cert /etc/.....

from openshift_checks.logging.podlogs import PodLogQueryTime, OpenShiftCheckException


SAMPLE_UID = "3ea53ca5-f60a-40e5-8828-983240aee8a1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this doesn't need to be a proper UUID -- when the test fails, seeing this string can be more puzzling than enlightening.

import pytest
import json


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many empty lines, pylint doesn't run on test files yet...

Suggestion: separate imports of stdlib, 3rd party and local pkgs/modules, in that order:

import json

import pytest

from openshift_checks.logging.podlogs import PodLogQueryTime, OpenShiftCheckException

assert error
assert expect_error in error
else:
assert not error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code?



@pytest.mark.parametrize('json_response, uuid, timeout, expect_exception, extra_words', [
# test with invalid json response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these comments are to be useful, they should be part of the output when a test fail...
We should not need to manually match complex data in the output to identify which test case failed.

See https://docs.pytest.org/en/latest/parametrize.html#_pytest.python.Metafunc.parametrize.

I couldn't find a clean example right now, but I can type:

@pytest.mark.parametrize('name,json_response,uuid,timeout,exc,extra_words', [
    (
        'invalid json response',
        ...
     ),
    (
        'empty response',
        ...
    ),
    ...
], ids=lambda argval: argval[0])
def test_...(name, json_response, ...):
    ...

We also have one example in openshift-ansible, look for roles/openshift_certificate_expiry/test/conftest.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change, originally based this on the way we have written every other test so far, however

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's better to have the comments in the output than in the code, and I think I'm the one that started with the code comments so I'll take the blame.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch 3 times, most recently from 0ad2bd4 to 8d40f83 Compare June 23, 2017 18:31
@juanvallejo
Copy link
Contributor Author

@sosiouxme thanks for the review, comments addressed

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from b431957 to 0a33c22 Compare June 27, 2017 21:18
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 0a33c22 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 0a33c22 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 0a33c22 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 0a33c22 (logs)


# get all Kibana pods
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace)
pods, error = super(LoggingIndexTime, self).get_pods_for_component(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if there are no Kibana pods running, this comes back with:

There are no pods in the logging namespace. Is logging deployed?

Which is sort of a universal problem with using this method; if any component is completely down it says there are no pods in the whole namespace. Originally I think it got all the pods and then sorted them out by component and that error really meant there were no pods. Now it's misleading. Could the message be adjusted or handled differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the error returned by get_pods_for_component to mention the logging component instead, and added further clarification when handling the error in the check.

return {"failed": True, "changed": False, "msg": msg}

# get all Elasticsearch pods
pods, error = super(LoggingIndexTime, self).get_pods_for_component(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue here

'At least one Elasticsearch pod is required in order to perform this check.')
return {"failed": True, "changed": False, "msg": msg}

uid = self.curl_kibana_with_uuid(running_kibana_pods[0], task_vars)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while I'm at it... is there some reason not to call this uuid?

uid is user ID, uuid is universally unique ID.


def wait_until_cmd_or_err(self, es_pod, uid, timeout_secs, task_vars):
"""Wait a maximum of timeout_secs for the uuid logged in Kibana to be found in
the Elasticsearch logs. Since we are querying for a message with the unique
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, super nit but... "unique" is redundant when describing a uuid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.
@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 0a33c22 to 1871536 Compare June 28, 2017 19:05
@juanvallejo
Copy link
Contributor Author

@sosiouxme Thanks, review comments addressed

@sosiouxme
Copy link
Member

@juanvallejo Can't believe I'm just noticing this but:

https://github.com/juanvallejo/openshift-ansible/blob/18715362c46e21b044f5edbecefd9b7f32bae85b/roles/openshift_health_checker/openshift_checks/logging/logging.py#L23

I don't think the default is to deploy logging. So all these checks would be running even if logging is not deployed. Could you change that default to false with this PR? Then I think we're good to go.

@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from 1871536 to d72dc1d Compare June 28, 2017 23:26
@juanvallejo
Copy link
Contributor Author

@sosiouxme ping

@sosiouxme
Copy link
Member

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for d72dc1d (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for d72dc1d (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for d72dc1d (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for d72dc1d (logs)

@@ -20,7 +20,7 @@ def is_active(cls, task_vars):
@staticmethod
def is_first_master(task_vars):
"""Run only on first master and only when logging is configured. Returns: bool"""
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=True)
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme we need to make sure this change is in 3.6, right?

@juanvallejo let's split it into a separate commit with relevant commit message why we're changing it ("the default was incorrectly set, causing the check to run when it shouldn't" -- or sth else)


def run(self, tmp, task_vars):
"""Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs."""
pod_log_timeout = int(get_var(task_vars, "openshift_check_pod_logs_timeout", default=ES_CMD_TIMEOUT_SECONDS))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should handle a potential ValueError exception.

found in the Elasticsearch logs. Since we are querying for a message
with the uuid that was set earlier, there should only be a single match.
Raise an OpenShiftCheckException if timeout is reached finding a match."""
now = int(time.time())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why truncate it into an int?

return {}

def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars):
"""Wait a maximum of timeout_secs for the uuid logged in Kibana to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Wait a maximum of timeout_secs for ...": this is not what this method does.
We're actually retrying with at least 1s intervals, and the "wait time" is defined right in query_es_from_es in the call to curl.

We could really simplify this by simply stating a simpler algorithm: "retry N times with S seconds interval", and never call time.time().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could really simplify this by simply stating a simpler algorithm: "retry N times with S seconds interval", and never call time.time().

Not sure I follow your thinking here. The reason why I'm using time.time() is to ensure that delay between log creation and log aggregation does not exceed an overall time constraint.


@staticmethod
def running_pods(pods):
"""Returns: list of pods in a running state"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Filter out pods that are not running" ?
"Filter pods that are running"

Or at least s/Returns:/Return a

return [
pod for pod in pods
if pod['status']['phase'] == 'Running'
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be in one line?

SAMPLE_UUID = "unique-test-uuid"


def canned_podlogquerytime(exec_oc=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name here is reminiscent from a recent past... isn't it outdated?

if error:
assert word in str(error)
elif result:
assert word in result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find it easier to understand if "positive" and "negative" tests were separate.

Here, it is unclear what assertions are actually made and, more importantly, what do they mean. Maybe it's just me :/

)

if error:
msg = 'Unable to retrieve pods for the "Kibanna" logging component: {}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Kibanna/Kibana/

In an Openshift-Ansible inventory file, logging is not set to deploy by
default. This patch updates the LoggingCheck's assumption of
logging_deployed to False, to match the inventory file's default.
@juanvallejo juanvallejo force-pushed the jvallejo/verify-pod-logs-query-time branch from d72dc1d to 7d7c942 Compare July 5, 2017 15:42
@juanvallejo juanvallejo closed this Jul 5, 2017
@juanvallejo juanvallejo deleted the jvallejo/verify-pod-logs-query-time branch July 5, 2017 17:03
@juanvallejo juanvallejo restored the jvallejo/verify-pod-logs-query-time branch July 5, 2017 17:04
@juanvallejo
Copy link
Contributor Author

Did not mean to delete branch - re-opening here: #4682

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants