-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verify sane log times in logging stack #4519
verify sane log times in logging stack #4519
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, too many comments
pod for pod in pods | ||
if any( | ||
condition['type'] == 'Ready' and condition['status'] == 'True' | ||
for condition in pod['status']['conditions'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't recall the conditions
field... is Ready==True equivalent to Phase==Running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that it can be in the Running
state, but its endpoint not be reachable unless its condition.status
is true
kubernetes/kubernetes#30785 (comment)
We do the same sort of check for not_running_pods
but thinking about this a bit more, I think for this method, it should be enough to just check if the pod's Phase == "Running"
uid = str(uuid4()) | ||
pod_name = kibanna_pod["metadata"]["name"] | ||
exec_cmd = ("exec {pod_name} -c kibana -- curl --connect-timeout 1 -s " | ||
"http://localhost:5601/{uid}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poor indentation style; maybe we could fit all in one longer line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the port is not user-configurable in the logging stack installation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to fine tune more flags for curl
e.g. to cover failures (-f
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that apparently we create an entry with a GET request?!
Will this somehow pollute what users see, e.g. in the Web Console? Are they going to start seeing this mysterious UUIDs in their logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that apparently we create an entry with a GET request?!
Yes, this is how the script I am basing this off works: https://bugzilla.redhat.com/attachment.cgi?id=1258225
Will this somehow pollute what users see, e.g. in the Web Console? Are they going to start seeing this mysterious UUIDs in their logs?
I believe this will happen, the 404 request is logged in kibana, and later aggregated into the elasticsearch logs. @sosiouxme wondering your thoughts on this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the port is not user-configurable in the logging stack installation?
I'm not sure, but did not see any config for them in the sample inventories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will happen, the 404 request is logged in kibana, and later aggregated into the elasticsearch logs.
Ingenious trick!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the same trick in logging CI https://github.com/openshift/origin-aggregated-logging/blob/master/deployer/scripts/util.sh#L459
# $1 - unique value to search for in es
function add_test_message() {
local kib_pod=`get_running_pod kibana`
oc exec $kib_pod -c kibana -- curl --connect-timeout 1 -s \
http://localhost:5601/$1 > /dev/null 2>&1
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it just causes Kibana itself to create a log entry. So it's only logs for Kibana that get polluted (only admins will see).
result = self._exec_oc(exec_cmd, [], task_vars) | ||
return json.loads(result)["count"] | ||
|
||
def _exec_oc(self, cmd_str, extra_args, task_vars): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the underscore?
return {"failed": True, "changed": False, "msg": ('No Kibanna pods were found to be' | ||
'in the "Running" state.' | ||
'At least one Kibana pod is required' | ||
'in order to perform this check.')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poor indentation style.
return {"failed": True, "changed": False, "msg": ('No Elasticsearch pods were found to be' | ||
'in the "Running" state.' | ||
'At least one Kibana pod is required' | ||
'in order to perform this check.')} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PIS ;)
from openshift_checks.logging.logging import LoggingCheck | ||
|
||
|
||
ES_CMD_TIMEOUT_SECONDS = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source of uncertainty and flakiness of this check :-/
Monitoring the time it takes over time would be much better than a point-in-time check. What is "reasonable amount of time" for one cluster is not for another.
This could be user configurable, but then I also would not like us to end up with too many flags for people to tune.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source of uncertainty and flakiness of this check :-/
Monitoring the time it takes over time would be much better than a point-in-time check. What is > "reasonable amount of time" for one cluster is not for another.
While I agree, how would we report this information without failing the check? If I understand correctly, messages from a check do not always show up at the end of a successful check.
This could be user configurable, but then I also would not like us to end up with too many flags for people to tune.
I think at least for now we could try this, as well as set the default time to a much higher value.
I don't think this will become a problem of flakes, as this check is not meant to be run as pre-install or pre-upgrade; I was thinking it'd be more along the lines of the openshift image data check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was just thinking aloud how the stuff we do need not necessarily fit in the "check" format.
Let's start with this and iterate as we get feedback.
class Podlogs(LoggingCheck): | ||
"""Check that logs from ElasticSearch pods can be queried within a reasonable amount of time.""" | ||
|
||
name = "podlogs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe pod_logs (separate words with underscore as we do for other check names)? Or even something more descriptive/verbose.
Then make the class name a CamelCase version of the name.
raise OpenShiftCheckException(msg.format(uid, timeout_secs)) | ||
|
||
def curl_kibana_with_uuid(self, kibanna_pod, task_vars): | ||
"""curl Kibanna with a unique uuid to search for in Elasticsearch.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't mix the matters in the docstring. This is probably unrelated to Elasticsearch, even though in a greater context the latter will be involved.
|
||
def query_es_from_es(self, es_pod, es_host, es_index, uid, task_vars): | ||
"""curl the Elasticsearch pod and look for the unique uuid that was set | ||
as part of the error log from the curl request made in the Kibanna pod.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The knowledge of how ES and Kib. mix together should be in the caller that coordinates the use of these methods, not in the methods themselves. This method is only about ES. A caller does whatever it wants with it.
as part of the error log from the curl request made in the Kibanna pod.""" | ||
pod_name = es_pod["metadata"]["name"] | ||
exec_cmd = ("exec {pod_name} -- curl " | ||
"--connect-timeout 1 -s -k " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use -k
and pass --cert
and --key
, are those somehow contradictory?
0959b35
to
6cb7b3c
Compare
@rhcarvalho thanks for the feedback, comments addressed |
801df81
to
5f606a7
Compare
if error: | ||
return {"failed": True, "changed": False, "msg": error} | ||
|
||
kibanna_pods = super(PodLogQueryTime, self).running_pods(pods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not self.running_pods
?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will update
uuid from earlier, there should only be a single match. | ||
Raise an OpenShiftCheckException if the timeout is reached before a result is found.""" | ||
|
||
interval = 1 # amount of time to wait between retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/amount of time/seconds ?
if total == 1: | ||
return | ||
|
||
msg = "expecting match in Elasticsearch for message with uuid {}, but no matches were found after {}s." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we may have slept for more than orig_timeout
, but that's probably irrelevant, agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Queries should be pretty quick but in times of trouble they may not be... probably better if the loop condition was how much time had passed, not how many times through the loop. Would be even better if there were an actual timer set so that if something got stuck it wouldn't just sit there, but that may be too complicated for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much time had passed
The detail of doing that right is that we need a monotonic clock/timer.
The total time per loop iteration should be approx. the time slept of at least interval
+ the time running curl
.
We can limit curl
a bit more by setting --max-time
.
I'm okay if this function can run for (a bit) longer than orig_timeout
, specially if we lower that value as @sosiouxme proposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhcarvalho @sosiouxme sounds good, updated curl
commands to have a connect-timeout
of 30s
and a max-time
of 60s
. Also, updated total time per loop iteration to take into account total curl
time as well:29149c2
def query_es_from_es(self, es_pod, uid, task_vars): | ||
"""curl the Elasticsearch pod and look for a unique uuid in its logs.""" | ||
pod_name = es_pod["metadata"]["name"] | ||
exec_cmd = ("exec {pod_name} -- curl --connect-timeout 1 -s -k -f " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question remain, why do we use -k
and --cert
and --key
together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because es requires client cert auth for admin access. So you can either use -k
or --cacert /etc/elasticsearch/secret/admin-ca
, your choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @richm. I see no reason not to use --cacert /etc/elasticsearch/secret/admin-ca
then.
As for the --connect-timeout
, 1 second seems too aggressive.
If we're willing to wait up to 10 minutes retrying until this succeeds, I'd suggest being more tolerant with slow connections.
--connect-timeout 60
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, will increase connect timeout
5f606a7
to
ab2e09e
Compare
Thanks for the feedback, will begin adding tests |
6892b30
to
03e694a
Compare
aos-ci-test |
03e694a
to
920fbb1
Compare
@@ -0,0 +1,174 @@ | |||
""" | |||
Module for ensuring logs from pods can be queried in a reasonable amount of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a check, not a module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny thing is that in one sense it is a "Python module". I agree with @sosiouxme that in this context we better call it a check.
|
||
def run(self, tmp, task_vars): | ||
"""Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs.""" | ||
self.pod_es_host = get_var(task_vars, "openshift_check_pod_elasticsearch_host", default=ES_DEFAULT_HOST) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a point in making this configurable; it will never be different.
If we ever have this check look at the ops instance then this would be logging-es-ops
but I think we'd structure the check a little differently to do both.
def run(self, tmp, task_vars): | ||
"""Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs.""" | ||
self.pod_es_host = get_var(task_vars, "openshift_check_pod_elasticsearch_host", default=ES_DEFAULT_HOST) | ||
self.pod_es_index = get_var(task_vars, "openshift_check_pod_elasticsearch_index", default=ES_DEFAULT_INDEX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also doesn't need to be configured, it needs to be based off the actual name of the logging project (logging_namespace
below, defaults to logging
)
|
||
logging_namespace = "logging" | ||
pod_es_host = ES_DEFAULT_INDEX | ||
pod_es_index = ES_DEFAULT_HOST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to define these as properties and then pass them around as method parameters. I don't think these should be properties at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I passed them as arguments to curl_kibana_with_uuid
and query_es_from_es
methods, but was running into a pylint too-many-arguments
error.
return {"failed": True, "changed": False, "msg": msg} | ||
|
||
# get all Elasticsearch pods | ||
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasn't changed since the last time you got it...
|
||
def wait_until_cmd_or_err(self, es_pod, uid, timeout_secs, task_vars): | ||
"""Wait a maximum of timeout_secs for the uuid logged in Kibana to be found in | ||
the Elasticsearch logs. Since we are querying for a message with the the unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the the
if total == 1: | ||
return | ||
|
||
msg = "expecting match in Elasticsearch for message with uuid {}, but no matches were found after {}s." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Queries should be pretty quick but in times of trouble they may not be... probably better if the loop condition was how much time had passed, not how many times through the loop. Would be even better if there were an actual timer set so that if something got stuck it wouldn't just sit there, but that may be too complicated for now.
"""curl Kibana with a unique uuid.""" | ||
uid = str(self.generate_uuid()) | ||
pod_name = kibanna_pod["metadata"]["name"] | ||
exec_cmd = "exec {pod_name} -c kibana -- curl --connect-timeout 60 -s http://localhost:5601/{uid}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it takes 60 seconds to connect something is really wrong... I'd say maybe 5? For that matter, probably want to add --max-time 15
to the curl
as well so queries can't get stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous iteration it was --connect-timeout 1
. The discussion went towards not making it too aggressive, because if the network is slow but otherwise functioning, we'd loop and try to call curl several times, but always giving up too quick before connecting.
Any number is a wild guess.
👍 to setting a --max-time
. Or replacing --connect-timeout
with only --max-time
(doesn't really matter how long it takes to connect, but we want to set an upper bound to how long curl
runs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a wild guess to some degree, but this is all intra-cluster networking. It would be a bad sign if connecting to a service IP took more than a second even if your cluster were federated around the world (although it could always be a transient network issue, I think it's much more likely to be something more significant). The actual retrieval of the request hits storage and processing so I could see a little more leeway on that, agree that it's a good idea to just use --max-time
to cover the whole operation.
|
||
def curl_kibana_with_uuid(self, kibanna_pod, task_vars): | ||
"""curl Kibana with a unique uuid.""" | ||
uid = str(self.generate_uuid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, would this ever not already be a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tests it is a string (when the method gets monkey patched), but otherwise it is UUID object...
For consistency, I suggest changing generate_uuid
to always return a string.
def query_es_from_es(self, es_pod, uid, task_vars): | ||
"""curl the Elasticsearch pod and look for a unique uuid in its logs.""" | ||
pod_name = es_pod["metadata"]["name"] | ||
exec_cmd = ("exec {pod_name} -- curl --connect-timeout 60 -s -k -f " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a need for -k
here, should be able to just specify the CA and have a valid TLS connection... remind me to check into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got lost in a previous discussion. The conclusion was to replace -k
with --ca-cert /etc/....
.
Hah, I mostly just totally contradicted or repeated earlier feedback didn't I? Can discuss further... |
except ValueError as error: | ||
error = "Invalid response from pod {}".format(str(error)) | ||
|
||
return error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @sosiouxme re: exception handling here, and add that we better not mix styles. In Go we pass around errors as values and the common idiom is to have a return value of type error. Not that we couldn't do something similar in Python, but if we are raising exceptions, let's use a single mechanism for communicating errors.
|
||
def curl_kibana_with_uuid(self, kibanna_pod, task_vars): | ||
"""curl Kibana with a unique uuid.""" | ||
uid = str(self.generate_uuid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tests it is a string (when the method gets monkey patched), but otherwise it is UUID object...
For consistency, I suggest changing generate_uuid
to always return a string.
"""curl Kibana with a unique uuid.""" | ||
uid = str(self.generate_uuid()) | ||
pod_name = kibanna_pod["metadata"]["name"] | ||
exec_cmd = "exec {pod_name} -c kibana -- curl --connect-timeout 60 -s http://localhost:5601/{uid}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous iteration it was --connect-timeout 1
. The discussion went towards not making it too aggressive, because if the network is slow but otherwise functioning, we'd loop and try to call curl several times, but always giving up too quick before connecting.
Any number is a wild guess.
👍 to setting a --max-time
. Or replacing --connect-timeout
with only --max-time
(doesn't really matter how long it takes to connect, but we want to set an upper bound to how long curl
runs).
def query_es_from_es(self, es_pod, uid, task_vars): | ||
"""curl the Elasticsearch pod and look for a unique uuid in its logs.""" | ||
pod_name = es_pod["metadata"]["name"] | ||
exec_cmd = ("exec {pod_name} -- curl --connect-timeout 60 -s -k -f " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got lost in a previous discussion. The conclusion was to replace -k
with --ca-cert /etc/....
.
from openshift_checks.logging.podlogs import PodLogQueryTime, OpenShiftCheckException | ||
|
||
|
||
SAMPLE_UID = "3ea53ca5-f60a-40e5-8828-983240aee8a1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this doesn't need to be a proper UUID -- when the test fails, seeing this string can be more puzzling than enlightening.
import pytest | ||
import json | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many empty lines, pylint doesn't run on test files yet...
Suggestion: separate imports of stdlib, 3rd party and local pkgs/modules, in that order:
import json
import pytest
from openshift_checks.logging.podlogs import PodLogQueryTime, OpenShiftCheckException
assert error | ||
assert expect_error in error | ||
else: | ||
assert not error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code?
|
||
|
||
@pytest.mark.parametrize('json_response, uuid, timeout, expect_exception, extra_words', [ | ||
# test with invalid json response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these comments are to be useful, they should be part of the output when a test fail...
We should not need to manually match complex data in the output to identify which test case failed.
See https://docs.pytest.org/en/latest/parametrize.html#_pytest.python.Metafunc.parametrize.
I couldn't find a clean example right now, but I can type:
@pytest.mark.parametrize('name,json_response,uuid,timeout,exc,extra_words', [
(
'invalid json response',
...
),
(
'empty response',
...
),
...
], ids=lambda argval: argval[0])
def test_...(name, json_response, ...):
...
We also have one example in openshift-ansible
, look for roles/openshift_certificate_expiry/test/conftest.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change, originally based this on the way we have written every other test so far, however
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's better to have the comments in the output than in the code, and I think I'm the one that started with the code comments so I'll take the blame.
0ad2bd4
to
8d40f83
Compare
@sosiouxme thanks for the review, comments addressed |
b431957
to
0a33c22
Compare
aos-ci-test |
|
||
# get all Kibana pods | ||
self.logging_namespace = get_var(task_vars, "openshift_logging_namespace", default=self.logging_namespace) | ||
pods, error = super(LoggingIndexTime, self).get_pods_for_component( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually if there are no Kibana pods running, this comes back with:
There are no pods in the logging namespace. Is logging deployed?
Which is sort of a universal problem with using this method; if any component is completely down it says there are no pods in the whole namespace. Originally I think it got all the pods and then sorted them out by component and that error really meant there were no pods. Now it's misleading. Could the message be adjusted or handled differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the error returned by get_pods_for_component
to mention the logging component instead, and added further clarification when handling the error in the check.
return {"failed": True, "changed": False, "msg": msg} | ||
|
||
# get all Elasticsearch pods | ||
pods, error = super(LoggingIndexTime, self).get_pods_for_component( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue here
'At least one Elasticsearch pod is required in order to perform this check.') | ||
return {"failed": True, "changed": False, "msg": msg} | ||
|
||
uid = self.curl_kibana_with_uuid(running_kibana_pods[0], task_vars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while I'm at it... is there some reason not to call this uuid
?
uid
is user ID, uuid
is universally unique ID.
|
||
def wait_until_cmd_or_err(self, es_pod, uid, timeout_secs, task_vars): | ||
"""Wait a maximum of timeout_secs for the uuid logged in Kibana to be found in | ||
the Elasticsearch logs. Since we are querying for a message with the unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, super nit but... "unique" is redundant when describing a uuid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
This patch verifies that logs sent from logging pods can be queried on the Elasticsearch pod within a reasonable amount of time.
0a33c22
to
1871536
Compare
@sosiouxme Thanks, review comments addressed |
@juanvallejo Can't believe I'm just noticing this but: I don't think the default is to deploy logging. So all these checks would be running even if logging is not deployed. Could you change that default to false with this PR? Then I think we're good to go. |
1871536
to
d72dc1d
Compare
@sosiouxme ping |
aos-ci-test |
@@ -20,7 +20,7 @@ def is_active(cls, task_vars): | |||
@staticmethod | |||
def is_first_master(task_vars): | |||
"""Run only on first master and only when logging is configured. Returns: bool""" | |||
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=True) | |||
logging_deployed = get_var(task_vars, "openshift_hosted_logging_deploy", default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme we need to make sure this change is in 3.6, right?
@juanvallejo let's split it into a separate commit with relevant commit message why we're changing it ("the default was incorrectly set, causing the check to run when it shouldn't" -- or sth else)
|
||
def run(self, tmp, task_vars): | ||
"""Add log entry by making unique request to Kibana. Check for unique entry in the ElasticSearch pod logs.""" | ||
pod_log_timeout = int(get_var(task_vars, "openshift_check_pod_logs_timeout", default=ES_CMD_TIMEOUT_SECONDS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should handle a potential ValueError
exception.
found in the Elasticsearch logs. Since we are querying for a message | ||
with the uuid that was set earlier, there should only be a single match. | ||
Raise an OpenShiftCheckException if timeout is reached finding a match.""" | ||
now = int(time.time()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why truncate it into an int
?
return {} | ||
|
||
def wait_until_cmd_or_err(self, es_pod, uuid, timeout_secs, task_vars): | ||
"""Wait a maximum of timeout_secs for the uuid logged in Kibana to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Wait a maximum of timeout_secs for ..."
: this is not what this method does.
We're actually retrying with at least 1s intervals, and the "wait time" is defined right in query_es_from_es
in the call to curl
.
We could really simplify this by simply stating a simpler algorithm: "retry N times with S seconds interval", and never call time.time()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could really simplify this by simply stating a simpler algorithm: "retry N times with S seconds interval", and never call time.time().
Not sure I follow your thinking here. The reason why I'm using time.time()
is to ensure that delay between log creation and log aggregation does not exceed an overall time constraint.
|
||
@staticmethod | ||
def running_pods(pods): | ||
"""Returns: list of pods in a running state""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Filter out pods that are not running" ?
"Filter pods that are running"
Or at least s/Returns:/Return a
return [ | ||
pod for pod in pods | ||
if pod['status']['phase'] == 'Running' | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be in one line?
SAMPLE_UUID = "unique-test-uuid" | ||
|
||
|
||
def canned_podlogquerytime(exec_oc=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name here is reminiscent from a recent past... isn't it outdated?
if error: | ||
assert word in str(error) | ||
elif result: | ||
assert word in result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find it easier to understand if "positive" and "negative" tests were separate.
Here, it is unclear what assertions are actually made and, more importantly, what do they mean. Maybe it's just me :/
) | ||
|
||
if error: | ||
msg = 'Unable to retrieve pods for the "Kibanna" logging component: {}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Kibanna/Kibana/
In an Openshift-Ansible inventory file, logging is not set to deploy by default. This patch updates the LoggingCheck's assumption of logging_deployed to False, to match the inventory file's default.
d72dc1d
to
7d7c942
Compare
Did not mean to delete branch - re-opening here: #4682 |
Related trello card: https://trello.com/c/7uOluOTW
Based on @richm 's script: https://bugzilla.redhat.com/attachment.cgi?id=1258225
This patch verifies that logs sent from logging pods can be queried on
the Elasticsearch pod within a reasonable amount of time.
TODO
cc @brenton @sosiouxme @rhcarvalho