Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add check to detect increased etcd traffic #4316

Conversation

juanvallejo
Copy link
Contributor

@juanvallejo juanvallejo commented May 30, 2017

Related trello card: https://trello.com/c/7pJdHj4F
Related bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1415839

An update in OSE 3.4 caused an increase in etcd traffic due to increased quorum reads.
This check scans etcd logs (journalctl -ru <unit>) and recommends an upgrade to OSE 3.6 if the message etcd: sync duration of ..., expected less than 1s is found (this message is a direct symptom of the increased traffic update).

Based on oadm diagnostics

TODO

  • add unit tests

cc @brenton @rhcarvalho @sosiouxme

@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch from 899d89f to 733724b Compare May 30, 2017 23:06

cmd = [
'/bin/journalctl',
'-ru', 'etcd',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme will make this a module param. Not sure what the value should be

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit? It'll be etcd for multi-master clusters or the master unit name which depends on the deployment type

@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch 2 times, most recently from c52fb2e to 6a1ff5f Compare May 31, 2017 13:55
@juanvallejo
Copy link
Contributor Author

aos-ci-test

@juanvallejo
Copy link
Contributor Author

@sosiouxme will begin adding tests

@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch 2 times, most recently from 0bcd1aa to 9153160 Compare May 31, 2017 21:20
Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looking pretty good!


cmd = [
'/bin/journalctl',
'-ru', 'etcd',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit? It'll be etcd for multi-master clusters or the master unit name which depends on the deployment type

cmd = [
'/bin/journalctl',
'-ru', 'etcd',
'--output', module.params["output"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no real reason to parameterize this, you'll always want json because it's machine-readable that way

@@ -0,0 +1,95 @@
# pylint: disable=missing-docstring

"""Interface to journalctl"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to believe there isn't a module for journalctl... guess it just shows, no one is doing monitoring/diagnostics with ansible yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's actually nothing about this module that's specific to the task at hand; it could probably be renamed to search_journal.py or some such. although in a more generic implementation you'd probably search for several things at the same time in a single unit's journal. but we can cross that bridge later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although in a more generic implementation you'd probably search for several things at the same time in a single unit's journal. but we can cross that bridge later.

@sosiouxme yeah, was thinking of implementing this, but was not sure how to handle "callbacks" for different "matchers". Ideally I would send an array of dictionaries, with each dictionary containing a regexp field, and a function to be called if that regex was matched, to the remote module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just have it pass back what matched which regex and let the caller figure out how to respond.

try:
obj = json.loads(line.rstrip())
if start_matcher.match(obj["MESSAGE"]):
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# don't need to look past the most recent service restart

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just because that may not be obvious)

@@ -0,0 +1,35 @@
# pylint: disable=missing-docstring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do that...

def is_active(cls, task_vars):
"""Skip hosts that do not have etcd or masters in their group names."""
group_names = get_var(task_vars, "group_names", default=[])
active = "masters" in group_names or "etcd" in group_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, it's masters if there's only one master; etcd if there are multiple masters. there's a variable for figuring out what's in all the groups - maybe just groups?

it's not exactly critical to get this right in is_active but we might as well.

def run(self, tmp, task_vars):
match = self.module_executor("etcdlogs", {
"log_matcher": {
"regexp": "etcd: sync duration of \d+\.\d+s, expected less than 1s", # pylint: disable=anomalous-backslash-in-string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use r' so the backslashes aren't misinterpreted - or does that not work across the module invocation? i think it should be fine.

if match["matched"]:
msg = ("Higher than normal etcd traffic detected. "
"OpenShift 3.4 introduced an increase in etcd traffic by at least a factor of 4."
"\nUpgrading to OpenShift 3.6 is recommended in order to fix this issue.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending an upgrade there are a few tricks for mitigating this; can we mention those too? Or best, refer to a kbase if we have one.

Copy link
Contributor Author

@juanvallejo juanvallejo Jun 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.6_containerized for 6a1ff5f (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 6a1ff5f (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 6a1ff5f (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 6a1ff5f (logs)

@openshift-bot
Copy link

error: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 6a1ff5f (logs)

@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch 10 times, most recently from 1865f63 to afb6496 Compare June 6, 2017 15:51
@sosiouxme
Copy link
Member

@juanvallejo is this still [WIP]? looks pretty done-ish. got some lint/flake8 stuff to fix though.

@juanvallejo juanvallejo changed the title [WIP] add check to detect increased etcd traffic add check to detect increased etcd traffic Jun 6, 2017
@juanvallejo
Copy link
Contributor Author

@sosiouxme thanks, removed WIP tag. Will address lint stuff

@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch 3 times, most recently from 9cfdd5b to fbf31aa Compare June 6, 2017 20:48
@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch from 2b8d30b to 1372c59 Compare July 5, 2017 21:55
@juanvallejo
Copy link
Contributor Author

@rhcarvalho thanks for the feedback, comments addressed


module.exit_json(
changed=False,
failed=len(errors),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be less confusing for us if failed is a boolean. Please make it bool(errors) (note you don't need len to get what you mean, an empty sequence is false, otherwise true).

"""Iterate through a list of matchers, verify required fields exist on each one, and
retrieve log output for each matcher, appending its match pattern to a collection if
a matcher's regex pattern is found.
Return a collection of matched regular expressions, or raise a MatchErrorException if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No raise statements in this function body, is this comment true / up-to-date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update docstring


def get_log_input(matcher):
"""Run journalctl with the systemd unit specified in the matcher.
Return the command output as an iterator."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General suggestion for docstrings: speak about what a function does, and why somebody would want to call it. Often you can omit how it does it, leave it as an implementation detail unless the algorithm has particular importance.
That way, comments will be less prone to getting out of sync when the implementation changes, because the how tends to change more predominantly than why or what.

E.g.:

def get_log_output(matcher):
    """Return an iterator on the logs of a given matcher."""

BTW, why is the function called "log_input"? Shouldn't it be "output"?!

retrieve log output for each matcher, appending its match pattern to a collection if
a matcher's regex pattern is found.
Return a collection of matched regular expressions, or raise a MatchErrorException if
any errors occur."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(please consider this a continuation to the comment below about docstrings)

Another example of docstring:

def get_log_matches(matchers, log_count_limit, timestamp_limit_seconds):
    """Return a list of up to log_count_limit matches for each matcher.

    Log entries are only considered if older than timestamp_limit_seconds.
    """

The current doscstring seems to be spelling out a procedure, enumerating the algorithm that reflects the current implementation. It however does not explain what log_count_limit and timestamp_limit_seconds are meant for (and maybe my example above is documenting them wrong).

def find_matches(log_input, matcher, log_count_limit, timestamp_limit_seconds):
"""Receive iterable input, a matcher dictionary and scan journalctl at the
given systemd module for log messages matching a given regular expression.
Return the matched log message (or None) and an error string if any."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is outdated. Please review the docstrings. Some are really verbose and still miss fundamental points like explaining non-obvious arguments. Better to aim for concise strings that are still helpful in understanding the purpose of the functions/methods.

@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch from 1372c59 to 65758ab Compare July 13, 2017 21:39
@juanvallejo
Copy link
Contributor Author

@rhcarvalho thanks, comments addressed

@juanvallejo
Copy link
Contributor Author

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 65758ab (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_NOT_containerized, aos-ci-jenkins/OS_3.5_NOT_containerized_e2e_tests" for 65758ab (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.5_containerized, aos-ci-jenkins/OS_3.5_containerized_e2e_tests" for 65758ab (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 65758ab (logs)

Copy link
Contributor

@rhcarvalho rhcarvalho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, let's get this in and avoid another CI cycle.

"Please refer to https://access.redhat.com/solutions/2916381 for more information.")
return {"failed": True, "msg": msg}

if match["failed"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check this before match.get("matched")? It will probably work the same this way, but it seems logical to check for errors first, and then look at matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for word in extra_words:
assert word in result.get("msg", "")

assert result.get("failed", False) == failed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more logical to first check "failed" and only then check "extra_words".

@rhcarvalho
Copy link
Contributor

[merge]

@openshift-bot
Copy link

[test]ing while waiting on the merge queue

@rhcarvalho
Copy link
Contributor

Flake openshift/origin#13977

[merge]

@rhcarvalho
Copy link
Contributor

Not merging until the queue is open again.

@juanvallejo juanvallejo force-pushed the jvallejo/add-increased-etcd-traffic-check branch from 65758ab to 7895589 Compare July 19, 2017 21:39
@openshift-bot
Copy link

Evaluated for openshift ansible merge up to 7895589

@openshift-bot
Copy link

Evaluated for openshift ansible test up to 7895589

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_openshift_ansible/349/) (Base Commit: e432f8a) (PR Branch Commit: 7895589)

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/merge FAILURE (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/715/) (Base Commit: e432f8a) (PR Branch Commit: 7895589)

@rhcarvalho
Copy link
Contributor

Code was changed since last test run, need another run.

@rhcarvalho
Copy link
Contributor

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 7895589 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 7895589 (logs)

@rhcarvalho
Copy link
Contributor

[merge]

@openshift-bot
Copy link

The OpenShift Ansible merge job could not be run again for this pull request.

  • If the proposed changes in this pull request caused the job to fail, update this pull request with new code to fix the issue(s).
  • If flaky tests caused the job to fail, leave a comment with links to the GitHub issue(s) in the openshift/origin repository with the kind/test-flake label that are tracking the flakes. If no issue already exists for the flake you encountered, create one.
  • If something else like CI system downtime or maintenance caused the job to fail, contact a member of the Team Project Committers group to trigger the job again.

@rhcarvalho rhcarvalho merged commit e29eab4 into openshift:master Jul 20, 2017
@juanvallejo juanvallejo deleted the jvallejo/add-increased-etcd-traffic-check branch July 20, 2017 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants