Updated diagnostic_tools for openshift-ansible image #4713

adellape · 2017-07-05T21:43:01Z

Reworks #4579 and moves most of the content to the existing admin_guide/diagnostic_tools section, and adds subsections for ansible-playbook vs direct docker usage. Still adds mention of the openshift-ansible health checks to the scaling guide, but mostly links to the admin_guide for the details and usage. Still adds in the previously-undocumented checks (logging_index_time, etcd_traffic).

Preview:

http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/admin_guide/diagnostics_tool.html#additional-cluster-health-checks
http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/scaling_performance/optimizing_compute_resources.html#scaling-performance-debugging-using-openshift-ansible

adellape · 2017-07-05T21:46:12Z

PTAL @juanvallejo @sosiouxme @rhcarvalho

adellape · 2017-07-05T21:48:41Z

Document: [diagnostics] Document the use of the diagnostic tools in the official docs

adellape · 2017-07-05T21:50:13Z

admin_guide/diagnostics_tool.adoc

+
+----
+# docker run -u `id -u` \
+    -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \


Was musing with Juan we should add a callout explaining what's going on here.

Yeah... there's a fairly detailed explanation of what's going on at https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#usage

As usual you have to weigh how much to say about all this... I have to say that the ssh key permissions are particularly fragile and quite difficult to debug when they're not right (ssh is really persnickety and not very informative unless you turn debug up to 11), so we really want to guide them down a single path, not make them confident in how to do it differently. We definitely need to mention not running this as root, that's likely to be a nasty surprise.

sosiouxme · 2017-07-05T22:39:36Z

admin_guide/diagnostics_tool.adoc

+    -v /etc/ansible/hosts:/tmp/inventory:ro \ <1>
+    -e INVENTORY_FILE=/tmp/inventory \
+    -e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2>
+    -e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" \ <3>


pretty sure you need a second -e for the second parameter.

-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 -e etcd_max_image_data_size_bytes=40000000000" \

BTW are the callouts here going to interfere with copy/paste in the rendered version? Because they're really going to want to copy/paste this.

pretty sure you need a second -e for the second parameter.

👍

Or quotes, so that everything after -e goes as a single shell argument.

sosiouxme · 2017-07-05T22:41:30Z

admin_guide/diagnostics_tool.adoc

+    -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \
+    -v /etc/ansible/hosts:/tmp/inventory:ro \ <1>
+    -e INVENTORY_FILE=/tmp/inventory \
+    -e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2>


should be playbooks/byo/openshift-checks/health.yml

@sosiouxme Do the other two instances need to be /byo as well?

sosiouxme · 2017-07-05T22:47:19Z

admin_guide/diagnostics_tool.adoc

+
+----
+# docker run -u `id -u` \
+    -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \


Yeah... there's a fairly detailed explanation of what's going on at https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#usage

As usual you have to weigh how much to say about all this... I have to say that the ssh key permissions are particularly fragile and quite difficult to debug when they're not right (ssh is really persnickety and not very informative unless you turn debug up to 11), so we really want to guide them down a single path, not make them confident in how to do it differently. We definitely need to mention not running this as root, that's likely to be a nasty surprise.

sosiouxme · 2017-07-05T22:53:07Z

admin_guide/diagnostics_tool.adoc

+[[openshift-ansible-health-checks]]
+== openshift-ansible Health Checks
+
+Additional diagnostic health checks are available through *openshift-ansible*,


It's probably a bit confusing that we refer to it as openshift-ansible here, but later it has a different RPM name and a different image name. I'm not sure there's anything to be done about it, though... you have to call it something and origin/OCP have different names for everything.

We could use a conditional for the name, or in this particular case we could possible avoid giving it a name altogether:

Additional diagnostic health checks are available through the Ansible-based tooling used to install and manage {product-title} clusters.

sosiouxme · 2017-07-05T23:03:10Z

admin_guide/diagnostics_tool.adoc

+xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a
+link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks
+are provided with the *atomic-openshift-utils* RPM package. For the Docker CLI
+method,


That's the OCP RPM name... I don't think we ship an RPM for Origin. I think we expect people to run out of a git clone.

sosiouxme · 2017-07-06T19:39:48Z

admin_guide/diagnostics_tool.adoc

+----
+# docker run -u `id -u` \
+    -v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \
+    -v /etc/ansible/hosts:/tmp/inventory:ro \ <1>


turns out this needs a :Z as well (at least in my test today)

-v /etc/ansible/hosts:/tmp/inventory:Z,ro \

I believe you. Wondering why we don't have it in README_CONTAINER_IMAGE.md. The ro is also "new" here, makes sense. We may update README_CONTAINER_IMAGE.md to match.

I didn't need :Z here in my tests, and ro would be covered by standard file perms (we're running as non-root by default). Considering that :Z would relabel the file on the host, better to leave it out if not needed:

# ls -Z /etc/ansible/hosts -rw-r--r--. root root unconfined_u:object_r:net_conf_t:s0 /etc/ansible/hosts # sesearch -A -s svirt_lxc_net_t -t net_conf_t Found 2 semantic av rules: allow svirt_sandbox_domain file_type : filesystem getattr ; allow svirt_sandbox_domain file_type : dir { getattr search open } ;

If we do need it let's add it... just mentioning that I think it's worth double checking that it's necessary...

Leaving it alone for now, then.

sosiouxme · 2017-07-06T22:59:51Z

admin_guide/diagnostics_tool.adoc

+A user-defined timeout may be set by passing the
+`openshift_check_logging_index_timeout_seconds` variable. For example, setting
+`openshift_check_logging_index_timeout_seconds=30` will cause the check to fail
+if a newly-created Kibana log is not able to be queried via Elasticsearch after


"Kibana log" is confusing here and it's a detail they don't need. Suggest: "log entry"

sosiouxme · 2017-07-06T23:13:31Z

scaling_performance/optimizing_compute_resources.adoc

+method used during
+xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a
+link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks
+are provided with the *atomic-openshift-utils* RPM package. For the Docker CLI


again atomic-openshift-utils is OCP only

sosiouxme · 2017-07-06T23:13:47Z

scaling_performance/optimizing_compute_resources.adoc

+link:https://registry.access.redhat.com[Red Hat Container Registry].
+endif::[]
+ifdef::openshift-origin[]
+the *openshift/origin-ansible* container image is distirbuted via Docker Hub.


distributed

Is the content duplication intentional? This paragraph is the same in two files (typo included).

Re-did this so that I'm re-using the shared content w/ an include::.

sosiouxme · 2017-07-06T23:15:48Z

admin_guide/diagnostics_tool.adoc

+----
+# ansible-playbook -i <inventory_file> \
+    /usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \
+    -e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000"


needs a -e per param

-e openshift_check_logging_index_timeout_seconds=30 -e etcd_max_image_data_size_bytes=40000000000

I tested just to be sure. We can use a single -e if the argument is quoted (because of spaces).

-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" is also correct.

And so is using single quotes (avoids shell expansion):

-e 'openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000'

Making it separate -e entries on separate lines (without quotes) for readability.

rhcarvalho · 2017-07-07T07:51:31Z

admin_guide/diagnostics_tool.adoc

-The following health checks belong to a diagnostic task meant to be run against
-the Ansible inventory file for a deployed {product-title} cluster. They can
-report common problems for the current {product-title} installation.
+[[openshift-ansible-health-checks]]


Changing the anchor will break existing links, is it worth it?

There was only one xref in the docs repo and it's updated in this PR. This content has only appeared so far in the Origin docs, so not too worried about links out in the wild.

rhcarvalho · 2017-07-07T07:52:33Z

admin_guide/diagnostics_tool.adoc

-the Ansible inventory file for a deployed {product-title} cluster. They can
-report common problems for the current {product-title} installation.
+[[openshift-ansible-health-checks]]
+== openshift-ansible Health Checks


Note: same comment as https://github.com/openshift/openshift-docs/pull/4713/files#r125776778

rhcarvalho · 2017-07-07T07:55:36Z

admin_guide/diagnostics_tool.adoc

+Example usage for each method are provided in subsequent sections.
+
+The following health checks are a set of diagnostic tasks that run as part of
+the *openshift_health_checker* Ansible role. They are meant to be run against


The role name is internal implementation detail, should not be in the documentation.

rhcarvalho · 2017-07-07T07:58:33Z

admin_guide/diagnostics_tool.adoc


+|`logging_index_time`


Note: this is implemented in openshift/openshift-ansible#4682, not merged yet as of now.

Merged today

... meaning it will probably be in 3.6.1 not 3.6.0

rhcarvalho · 2017-07-07T08:00:28Z

admin_guide/diagnostics_tool.adoc

+if a newly-created Kibana log is not able to be queried via Elasticsearch after
+30 seconds.
+
+|`ovs_version`


This doesn't exist. We use a more generic package_version check to check versions of multiple packages, including Open vSwitch.

Hm, that was my fault. I believe I initially started the Open vSwitch check in a check of its own

Ack, will remove.

rhcarvalho · 2017-07-07T08:02:48Z

admin_guide/diagnostics_tool.adoc

+
+To disable specific checks, include the variable `openshift_disable_check` with
+a comma-delimited list of check names in your inventory file before running the
+playbook. For example:


FWIW it doesn't need to be in the inventory file, it can be with a -e flag as well (among the other ways to set variables in Ansible).

Adding mention of that method as well.

rhcarvalho · 2017-07-07T08:11:37Z

admin_guide/diagnostics_tool.adoc

+openshift_disable_check=ovs_version,etcd_volume
+----
+
+To set variables that accept user-define values, include the `-e` flag with any


user-defined

Or rather:

To set variables in the command-line, include ...

rhcarvalho · 2017-07-07T08:14:42Z

admin_guide/diagnostics_tool.adoc

+----
+# ansible-playbook -i <inventory_file> \
+    /usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \
+    -e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000"


I tested just to be sure. We can use a single -e if the argument is quoted (because of spaces).

-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" is also correct.

And so is using single quotes (avoids shell expansion):

-e 'openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000'

rhcarvalho · 2017-07-07T08:18:15Z

admin_guide/diagnostics_tool.adoc

+    -v /etc/ansible/hosts:/tmp/inventory:ro \ <1>
+    -e INVENTORY_FILE=/tmp/inventory \
+    -e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2>
+    -e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" \ <3>


pretty sure you need a second -e for the second parameter.

👍

Or quotes, so that everything after -e goes as a single shell argument.

rhcarvalho · 2017-07-07T08:20:11Z

scaling_performance/optimizing_compute_resources.adoc

+link:https://registry.access.redhat.com[Red Hat Container Registry].
+endif::[]
+ifdef::openshift-origin[]
+the *openshift/origin-ansible* container image is distirbuted via Docker Hub.


Is the content duplication intentional? This paragraph is the same in two files (typo included).

adellape · 2017-07-12T20:32:52Z

@juanvallejo @sosiouxme @rhcarvalho Comments addressed. Changes pushed and preview updated.

http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/admin_guide/diagnostics_tool.html#ansible-based-tooling-health-checks
http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/scaling_performance/optimizing_compute_resources.html#scaling-performance-debugging-using-ansible

juanvallejo · 2017-07-24T15:04:56Z

ping @rhcarvalho or @sosiouxme wondering if there are any more comments on this?

rhcarvalho · 2017-07-24T16:55:18Z

admin_guide/diagnostics_tool.adoc


+|`logging_index_time`


Merged today

rhcarvalho · 2017-07-24T16:56:40Z

admin_guide/diagnostics_tool.adoc

+established between the control host and the exposed Kibana URL. These checks
+will only run if the `openshift_hosted_logging_deploy` inventory variable is set
+to `true`, to ensure that they are executed in a deployment where a logging
+stack has been deployed.


Should we have some link in this paragraph tying back to the logging docs, or is it clear what "logging stack" we're referring to?

I think we could. And actually the name could use some standardization too. I see it called variously "cluster logging", "aggregated logging", "the logging stack", and "the EFK stack".

https://docs.openshift.org/latest/install_config/aggregate_logging.html is probably what we want to link to. I don't have a strong preference on standard name but "cluster logging" seems straightforward.

rhcarvalho · 2017-07-24T16:57:57Z

admin_guide/diagnostics_tool.adoc

+`openshift_check_logging_index_timeout_seconds` variable. For example, setting
+`openshift_check_logging_index_timeout_seconds=30` will cause the check to fail
+if a newly-created log entry is not able to be queried via Elasticsearch after
+30 seconds.


I'd consider that more important than knowing the variable name would be to explain when one should bother to change the default? I think we're missing that.

@sosiouxme When should one bother?

Don't think we need the long variable name twice. Don't like "is not able to be". And to Rodolfo's point:

Users that either require lower-latency log aggregation or are comfortable with higher latency may adjust this timeout with an Ansible variable. For example, setting openshift_check_logging_index_timeout_seconds=45 relaxes the timeout to 45 seconds.

rhcarvalho · 2017-07-24T16:59:24Z

admin_guide/diagnostics_tool.adoc

+openshift_disable_check=etcd_traffic,etcd_volume
+----
+
+Alternatively, set any checks you want to disable as environment variables with `-e openshift_disable_check` when running the `ansible-playbook` command.


This can be misleading, it should be -e openshift_disable_check=name1,name2,....

So "an" environment variable yes?

sosiouxme · 2017-07-24T21:32:12Z

admin_guide/diagnostics_tool.adoc

+// tag::ansible-based-health-checks-intro[]
+Additional diagnostic health checks are available through the
+xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Ansible-based tooling] used to install and manage {product-title} clusters. They can report
+common deployment problems for the current {product-title} installation.


There needs to be a substantial up-front warning about using these checks. They are not without side effects; running them can make changes to the hosts. Mostly those changes would be installing dependencies so the checks can gather needed information, and so they shouldn't be of much concern. My main concern is that some of the roles from the installer are invoked as pre-requisites to gather information the checks require, and those roles don't have a concept of not changing anything important. They try to ensure the system components they deal with are configured consistently with the inventory, even though you're not running from an installer playbook. So if the admin didn't install with Ansible, or ran the install with different/extra variables specified than what's in the inventory file, or made manual config changes after installing, they could easily find that running the checks re-configures their hosts according to the inventory file. Networking and Docker (for instance) can be affected.

So the checks should only be used on clusters that have been deployed with Ansible and using the same inventory it was deployed with. I'm not sure how to say this in a way that isn't terrifying to the user. Basically, if you wouldn't run the install with your inventory (and expect it to change nothing) then don't run the checks with it either, because the run may perform (some of) the same changes.

I consider this a bug in the check runner 😢
It is not the user's fault the implications of how the checks are implemented. We should fix it, since IIUC it did not start that way.

@sosiouxme How's this big ol warning:

http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/admin_guide/diagnostics_tool.html#ansible-based-tooling-health-checks

@rhcarvalho Oops, didn't see your comment.

@rhcarvalho well that would involve decomposing roles into "informational" versus "transformational". With the caveat that even "informational" roles can install dependencies. It's probably worth doing but I don't think it will be quick.

@adellape That warning LGTM.

adellape · 2017-07-25T15:23:27Z

@openshift/team-documentation PTAL:

http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/admin_guide/diagnostics_tool.html#additional-cluster-health-checks
http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/scaling_performance/optimizing_compute_resources.html#scaling-performance-debugging-using-openshift-ansible

ahardin-rh · 2017-07-25T15:45:46Z

LGTM

sosiouxme · 2017-07-25T17:36:35Z

admin_guide/diagnostics_tool.adoc

+|`logging_index_time`
+|This check detects higher than normal time delays between log creation and log
+aggregation by Elasticsearch in a logging stack deployment. It fails if a
+user-defined timeout is reached before logs are able to be queried through


I wouldn't say "user-defined" here as most users will use the default. And it's good to mention the default. So something like:

It fails if a new log entry cannot be queried through Elasticsearch within a timeout (by default, 30 seconds).

And below, I'd use a different configured timeout, e.g. 45 seconds.

sosiouxme · 2017-07-25T17:53:07Z

admin_guide/diagnostics_tool.adoc

+
+----
+# ansible-playbook -i <inventory_file> \
+    /usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml


This is the location for OCP... do we ship the installer RPM for Origin? I don't think we do. I don't want to complicate this but maybe for Origin we need to describe cloning the repo...

s/common/byo/

also s/common/byo/

I'll ifdef this for OCP vs Origin.

sosiouxme · 2017-07-25T18:18:31Z

admin_guide/diagnostics_tool.adoc

+<1> These options make the container run with the same UID as the current user,
+which is required for permissions so that the SSH key can be read inside the
+container (SSH private keys are expected to be readable only by their owner).
+<2> Mount SSH keys as a volume under *_/opt/app-root/src/.ssh_* under normal usage


The :Z requires some explanation too. As the repo docs explain:

Note that the ssh key is mounted with the :Z flag: this is also required so that the container can read the ssh key from its restricted SELinux context; this means that your original ssh key file will be re-labeled to something like system_u:object_r:container_file_t:s0:c113,c247. For more details about :Z please check the docker-run(1) man page. Please keep this in mind when providing these volume mount specifications because this could have unexpected consequences: for example, if you mount (and therefore re-label) your whole $HOME/.ssh directory you will block sshd from accessing your keys. This is a reason why you might want to work on a separate copy of the ssh key, so that the original file's labels remain untouched.

ssh is exceedingly picky about what it accepts as keys, and having the wrong SELinux label on the file (which is virtually inevitable at least on the initial run) causes it to silently skip the key and blithely fail to connect without further explanation. It's a pretty infuriating user experience. This is also why it's so important to have the container user matching the user ID on the keyfile.

The mention of mounting a .ssh directory leads to yet more discussion; I don't like it but this probably has to be pretty verbose because this is what everyone will get hung up on adapting to their specific usage. You might want to mount a whole .ssh directory for various reasons, among them because you want to use an ssh config to match keys with hosts or tweak other connection parameters, or because you want to provide a known_hosts file and have ssh validate host keys (which is disabled by default config and can be re-enabled by setting the envvar -e ANSIBLE_HOST_KEY_CHECKING=True on the cmdline or a few more complicated ways).

Finally we should probably mention providing a vault password. Probably in a future update because frankly I know nothing about it yet.

sosiouxme · 2017-07-25T18:23:45Z

admin_guide/diagnostics_tool.adoc

+when running the container as a non-root user.
+<3> Change *_/etc/ansible/hosts_* to the location of your cluster's inventory file,
+if different. This file will be bind-mounted to *_/tmp/inventory_*, which is
+used by the `INVENTORY_FILE` environment variable in the container.


Well, it's not used by it but I guess used according to it. Or as indicated by it?

The same need for :Z labeling can apply to this (or really anything that's mounted in). The label on /etc/ansible/hosts is typically suitable already for use in the container and they can get away without :Z relabeling. However if the inventory file is in a user home directory for instance, it will not be available to the user in the container and they will need to relabel it. I'm not sure there's a graceful way to explain this.

And we haven't even discussed dynamic inventory, which is a whole 'nother topic.

sosiouxme · 2017-07-25T18:45:29Z

admin_guide/diagnostics_tool.adoc

+if different. This file will be bind-mounted to *_/tmp/inventory_*, which is
+used by the `INVENTORY_FILE` environment variable in the container.
+<3> The `PLAYBOOK_FILE` environment variable is set to the location of the
+*_health.yml_* playbook inside the container.


It's probably worth saying that this is the location relative to /usr/share/ansible/openshift-ansible inside the container. You could also fully specify e.g. PLAYBOOK_FILE=/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-checks/health.yml

sosiouxme · 2017-07-25T18:51:33Z

admin_guide/diagnostics_tool.adoc

+used by the `INVENTORY_FILE` environment variable in the container.
+<3> The `PLAYBOOK_FILE` environment variable is set to the location of the
+*_health.yml_* playbook inside the container.
+<4> Set any desired variables that accept user-defined values in `key=value key=value` format.


All variables can be user-defined so that's kind of redundant. The example's clear enough but I'd suggest this say:

Set any variables desired for a single run with the -e key=value format.

Having two in there could be confusing.

sosiouxme · 2017-07-25T19:35:02Z

@sosiouxme <https://github.com/sosiouxme> Do the other two instances need to be /byo as well?

Yes... yes, they all do. No one should be running /common playbooks directly.

sosiouxme · 2017-07-25T19:41:33Z

admin_guide/diagnostics_tool.adoc

+
+----
+# ansible-playbook -i <inventory_file> \
+    /usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \


s/common/byo/

mburke5678 · 2017-07-27T15:14:10Z

admin_guide/diagnostics_tool.adoc

+====
+Due to potential changes the health checks could make to hosts, they should only
+be used on clusters that have been deployed using Ansible and using the same
+inventory file with which it was deployed. Changes mostly involve installing


"deployed using Ansible and using the same inventory file with which it was deployed"
Is there a programmatic way to determine this?

Err... run the playbook and see if it changes anything?

Seeing what would be changed without making the changes is what the --check ansible flag is for but I honestly don't know how accurate or useful it is with as much custom stuff as openshift-ansible includes. Could be fine, could be broken, could even make changes anyway. I don't think it's a use case anyone's paying attention to.

rhcarvalho · 2017-08-01T13:55:32Z

@adellape how far do you think we are from getting it merged? :-)

sosiouxme

Sorry, still a few things to fix up.

sosiouxme · 2017-08-03T23:06:19Z

admin_guide/diagnostics_tool.adoc

+
+[WARNING]
+====
+Due to potential changes the health checks could make to hosts, they should only


Just a small quibble I missed before.

s/health checks/health check playbooks/

(the health checks themselves don't make the problem changes, the playbooks have dependencies that make them)

sosiouxme · 2017-08-04T11:19:57Z

admin_guide/diagnostics_tool.adoc

+
+These checks can be run either using the `ansible-playbook` command (the same
+method used during
+xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a


If we're going to link to github instructions in the next paragraph, it's probably worth mentioning that you can run it as a system container as well.

"using the Docker CLI or atomic..."

Or, go the other way and remain agnostic:

"... or as a containerized version of openshift-ansible." And "containerized method" below.

sosiouxme · 2017-08-04T13:04:12Z

admin_guide/diagnostics_tool.adoc

+exceeds 40 GB.
+
+|`etcd_traffic`
+|This check detects higher than normal traffic on an etcd host. It fails if a


nit: how about "higher-than-normal"?

sosiouxme · 2017-08-04T13:13:59Z

admin_guide/diagnostics_tool.adoc

-limit for total percent usage can be set with a variable in your inventory file:
-`max_thinpool_data_usage_percent=90`.
-|===
+threshold defaults to 90% of the total size available.


It also checks that docker storage is configured in a supportable way. Meaning, with a storage driver and backing storage device that are supported for usage with Docker.

In particular, the default configuration of Docker storage is not supported. I'm not sure how much to say about this; probably worth linking to any docs we have on Docker storage configuration.

sosiouxme · 2017-08-04T13:15:38Z

admin_guide/diagnostics_tool.adoc

-openshift_disable_check=ovs_version,etcd_volume
----
+|`kibana`, `curator`, `elasticsearch`, `fluentd`
+|This set of checks verifies that Elasticsearch, Fluentd, and Curator pods have


sosiouxme · 2017-08-04T13:20:25Z

admin_guide/diagnostics_tool.adoc

+xref:../install_config/aggregate_logging.adoc#install-config-aggregate-logging[cluster
+logging] has been enabled.
+
+|`logging_index_time`


This check similarly only runs if logging is enabled. Should it be grouped with the others? They're all related but the focus of this one is a little different, it's end-to-end...

sosiouxme · 2017-08-04T13:21:58Z

admin_guide/diagnostics_tool.adoc

+----
+
+To set variables in the command line, include the `-e` flag with any desired
+variables in `key=value key=value` format. For example:


would still prefer a single key=value here

sosiouxme · 2017-08-04T13:34:36Z

admin_guide/diagnostics_tool.adoc

+therefore relabel) your *_$HOME/.ssh_* directory, you will block *sshd*
+from accessing your keys. This is a reason why you might want to work on a
+separate copy of the SSH key, so that the original file's labels remain
+untouched.


I would reword this just a bit to be clearer about the problem and stronger in the suggested remedy.

Keep this in mind for these volume mount specifications because it could have unexpected consequences. For example, if you mount (and therefore relabel) your $HOME/.ssh directory, sshd will become unable to access your public keys to allow remote login. To avoid altering the original file labels, mounting a copy of the SSH key (or directory) is recommended.

sosiouxme · 2017-08-04T13:39:26Z

admin_guide/diagnostics_tool.adoc

+hosts or modify other connection parameters. It would also allow you to provide
+a *_known_hosts_* file and have SSH validate host keys, which is disabled by the
+default configuration and can be re-enabled by setting the `envvar -e
+ANSIBLE_HOST_KEY_CHECKING=True` on the command line.


This will be confusing as to whether it needs to be added on the docker command or in the OPTS var. It needs to be set on the docker command.

... and can be re-enabled with an environment variable by adding -e ANSIBLE_HOST_KEY_CHECKING=True to the docker command line.

adellape · 2017-08-11T18:06:15Z

@sosiouxme OK thanks, changes made, see latest commit.

adellape · 2017-08-15T19:48:00Z

@sosiouxme bump

sosiouxme

A few nits but LGTM

sosiouxme · 2017-08-16T00:41:39Z

scaling_performance/optimizing_compute_resources.adoc

-* Allow users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
-* Provide debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.
+* Allows users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
+* Provides debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.


s/packet/package/ here; this supplies tools that can't be installed as packages on AH. Not sure what an immutable packet tree would mean :)

sosiouxme · 2017-08-16T00:49:30Z

admin_guide/diagnostics_tool.adoc

+openshift_disable_check=etcd_traffic,etcd_volume
+----
+
+Alternatively, set any checks you want to disable as environment variables with


Technically these aren't environment variables... I think you can safely s/environment//

On a docker command line, -e sets an environment variable. On an ansible command line, -e sets an extra variable or inventory variable. They look the same but they're accessed in completely different ways...

sosiouxme · 2017-08-16T01:02:39Z

admin_guide/diagnostics_tool.adoc

+=== Running Health Checks via Docker CLI
+
+It is possible a playbook may require dependencies that are not installed
+locally on the host running the `ansible-playbook` command. You can avoid this


Yeah, dependencies like Ansible and the playbooks :) Other dependencies aren't really much of a problem for this playbook, so I don't know that I'd put this as the motivation. The purpose of the image is to be able to run playbooks on a system with nothing more than Docker. You could even run it on a Mac or Atomic Host.

Not a big deal, but maybe something like:

"It is possible to run the openshift-ansible playbooks in a Docker container, avoiding the need for installing and configuring Ansible, on any host that can run the {ose,origin}-ansible image via the Docker CLI."

sosiouxme · 2017-08-16T01:06:27Z

It would be nice to have these changes made at least for 3.6.1 if not before...

adellape · 2017-08-16T19:56:43Z

@sosiouxme Thanks! Changes made and I'll merge after Travis finishes. It will get published w/ next Monday's release.

adellape · 2017-08-16T20:08:06Z

[rev_history]
|xref:../admin_guide/diagnostics_tool.adoc#admin-guide-diagnostics-tool[Diagnostics Tool]
|Enhanced the xref:../admin_guide/diagnostics_tool.adoc#ansible-based-tooling-health-checks[Ansible-based Health Checks] section with information on running via ansible-playbook or Docker CLI.
%
|xref:../scaling_performance/optimizing_compute_resources.adoc#scaling-performance-compute-resources[Optimizing Compute Resources]
|Added the xref:../scaling_performance/optimizing_compute_resources.adoc#scaling-performance-debugging-using-ansible[Debugging Using Ansible-based Health Checks] section.
%

adellape force-pushed the scaling_preinstall branch from dba9b7f to b6ce32a Compare July 5, 2017 21:43

adellape mentioned this pull request Jul 5, 2017

add mention of openshift-ansible image in Scaling and Performance Guide #4579

Closed

adellape changed the title ~~Updated diagnostic_tools to include direct docker usage~~ Updated diagnostic_tools for openshift-ansible image Jul 5, 2017

adellape commented Jul 5, 2017

View reviewed changes

adellape force-pushed the scaling_preinstall branch from b6ce32a to 69e97c0 Compare July 5, 2017 21:54

adellape added the branch/enterprise-3.6 label Jul 5, 2017

adellape added this to the Future Release milestone Jul 5, 2017

sosiouxme requested changes Jul 6, 2017

View reviewed changes

rhcarvalho reviewed Jul 7, 2017

View reviewed changes

adellape force-pushed the scaling_preinstall branch from 69e97c0 to 7544eb9 Compare July 12, 2017 20:32

juanvallejo approved these changes Jul 18, 2017

View reviewed changes

rhcarvalho reviewed Jul 24, 2017

View reviewed changes

sosiouxme reviewed Jul 24, 2017

View reviewed changes

adellape force-pushed the scaling_preinstall branch from 7544eb9 to 7d1f6c6 Compare July 25, 2017 15:22

sosiouxme reviewed Jul 25, 2017

View reviewed changes

mburke5678 reviewed Jul 27, 2017

View reviewed changes

sosiouxme reviewed Aug 4, 2017

View reviewed changes

adellape force-pushed the scaling_preinstall branch from 27e3da9 to c356e2a Compare August 11, 2017 18:05

adellape modified the milestones: Next Release, Future Release Aug 14, 2017

sosiouxme approved these changes Aug 16, 2017

View reviewed changes

Updated diagnostic_tools for openshift-ansible image

4aeec01

adellape force-pushed the scaling_preinstall branch from c356e2a to 4aeec01 Compare August 16, 2017 19:53

adellape merged commit bcdd31b into openshift:master Aug 16, 2017

bfallonf modified the milestones: Next Release, Staging, Published 25/08/2017 Aug 22, 2017

vikram-redhat modified the milestones: Published 25/08/2017, Staging Sep 24, 2017

adellape deleted the scaling_preinstall branch November 9, 2017 20:33

adellape restored the scaling_preinstall branch November 9, 2017 21:08

adellape deleted the scaling_preinstall branch November 9, 2017 21:09

adellape restored the scaling_preinstall branch November 10, 2017 13:57

adellape deleted the scaling_preinstall branch November 10, 2017 14:03


		\|`logging_index_time`


		\|`logging_index_time`

Updated diagnostic_tools for openshift-ansible image #4713

Updated diagnostic_tools for openshift-ansible image #4713

Conversation

adellape commented Jul 5, 2017

adellape commented Jul 5, 2017

adellape commented Jul 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adellape Jul 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jul 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adellape commented Jul 12, 2017 • edited Loading

juanvallejo commented Jul 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adellape commented Jul 25, 2017

ahardin-rh commented Jul 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jul 25, 2017 • edited Loading

Choose a reason for hiding this comment

adellape Jul 12, 2017 •

edited

Loading

sosiouxme Jul 24, 2017 •

edited

Loading

adellape commented Jul 12, 2017 •

edited

Loading

sosiouxme Jul 25, 2017 •

edited

Loading

sosiouxme Jul 25, 2017 •

edited

Loading