Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated diagnostic_tools for openshift-ansible image #4713

Merged
merged 1 commit into from
Aug 16, 2017

Conversation

adellape
Copy link
Contributor

@adellape adellape commented Jul 5, 2017

Reworks #4579 and moves most of the content to the existing admin_guide/diagnostic_tools section, and adds subsections for ansible-playbook vs direct docker usage. Still adds mention of the openshift-ansible health checks to the scaling guide, but mostly links to the admin_guide for the details and usage. Still adds in the previously-undocumented checks (logging_index_time, etcd_traffic).

Preview:

http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/admin_guide/diagnostics_tool.html#additional-cluster-health-checks
http://file.rdu.redhat.com/~adellape/070517/scaling_preinstall/scaling_performance/optimizing_compute_resources.html#scaling-performance-debugging-using-openshift-ansible

@adellape adellape changed the title Updated diagnostic_tools to include direct docker usage Updated diagnostic_tools for openshift-ansible image Jul 5, 2017
@adellape
Copy link
Contributor Author

adellape commented Jul 5, 2017

PTAL @juanvallejo @sosiouxme @rhcarvalho

@adellape
Copy link
Contributor Author

adellape commented Jul 5, 2017


----
# docker run -u `id -u` \
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was musing with Juan we should add a callout explaining what's going on here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... there's a fairly detailed explanation of what's going on at https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#usage

As usual you have to weigh how much to say about all this... I have to say that the ssh key permissions are particularly fragile and quite difficult to debug when they're not right (ssh is really persnickety and not very informative unless you turn debug up to 11), so we really want to guide them down a single path, not make them confident in how to do it differently. We definitely need to mention not running this as root, that's likely to be a nasty surprise.

-v /etc/ansible/hosts:/tmp/inventory:ro \ <1>
-e INVENTORY_FILE=/tmp/inventory \
-e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2>
-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" \ <3>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure you need a second -e for the second parameter.

-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 -e etcd_max_image_data_size_bytes=40000000000" \

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW are the callouts here going to interfere with copy/paste in the rendered version? Because they're really going to want to copy/paste this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure you need a second -e for the second parameter.

👍

Or quotes, so that everything after -e goes as a single shell argument.

-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \
-v /etc/ansible/hosts:/tmp/inventory:ro \ <1>
-e INVENTORY_FILE=/tmp/inventory \
-e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be playbooks/byo/openshift-checks/health.yml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme Do the other two instances need to be /byo as well?


----
# docker run -u `id -u` \
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... there's a fairly detailed explanation of what's going on at https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md#usage

As usual you have to weigh how much to say about all this... I have to say that the ssh key permissions are particularly fragile and quite difficult to debug when they're not right (ssh is really persnickety and not very informative unless you turn debug up to 11), so we really want to guide them down a single path, not make them confident in how to do it differently. We definitely need to mention not running this as root, that's likely to be a nasty surprise.

[[openshift-ansible-health-checks]]
== openshift-ansible Health Checks

Additional diagnostic health checks are available through *openshift-ansible*,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a bit confusing that we refer to it as openshift-ansible here, but later it has a different RPM name and a different image name. I'm not sure there's anything to be done about it, though... you have to call it something and origin/OCP have different names for everything.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a conditional for the name, or in this particular case we could possible avoid giving it a name altogether:

Additional diagnostic health checks are available through the Ansible-based tooling used to install and manage {product-title} clusters.

xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a
link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks
are provided with the *atomic-openshift-utils* RPM package. For the Docker CLI
method,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the OCP RPM name... I don't think we ship an RPM for Origin. I think we expect people to run out of a git clone.

----
# docker run -u `id -u` \
-v $HOME/.ssh/id_rsa:/opt/app-root/src/.ssh/id_rsa:Z,ro \
-v /etc/ansible/hosts:/tmp/inventory:ro \ <1>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out this needs a :Z as well (at least in my test today)

-v /etc/ansible/hosts:/tmp/inventory:Z,ro \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you. Wondering why we don't have it in README_CONTAINER_IMAGE.md. The ro is also "new" here, makes sense. We may update README_CONTAINER_IMAGE.md to match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't need :Z here in my tests, and ro would be covered by standard file perms (we're running as non-root by default). Considering that :Z would relabel the file on the host, better to leave it out if not needed:

# ls -Z /etc/ansible/hosts
-rw-r--r--. root root unconfined_u:object_r:net_conf_t:s0 /etc/ansible/hosts
# sesearch -A -s svirt_lxc_net_t -t net_conf_t
Found 2 semantic av rules:
   allow svirt_sandbox_domain file_type : filesystem getattr ; 
   allow svirt_sandbox_domain file_type : dir { getattr search open } ; 

If we do need it let's add it... just mentioning that I think it's worth double checking that it's necessary...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it alone for now, then.

A user-defined timeout may be set by passing the
`openshift_check_logging_index_timeout_seconds` variable. For example, setting
`openshift_check_logging_index_timeout_seconds=30` will cause the check to fail
if a newly-created Kibana log is not able to be queried via Elasticsearch after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kibana log" is confusing here and it's a detail they don't need. Suggest: "log entry"

method used during
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a
link:https://github.com/openshift/openshift-ansible/blob/master/README_CONTAINER_IMAGE.md[containerized version] of *openshift-ansible*. For the `ansible-playbook` method, the checks
are provided with the *atomic-openshift-utils* RPM package. For the Docker CLI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again atomic-openshift-utils is OCP only

link:https://registry.access.redhat.com[Red Hat Container Registry].
endif::[]
ifdef::openshift-origin[]
the *openshift/origin-ansible* container image is distirbuted via Docker Hub.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the content duplication intentional? This paragraph is the same in two files (typo included).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-did this so that I'm re-using the shared content w/ an include::.

----
# ansible-playbook -i <inventory_file> \
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \
-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a -e per param

-e openshift_check_logging_index_timeout_seconds=30 -e etcd_max_image_data_size_bytes=40000000000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested just to be sure. We can use a single -e if the argument is quoted (because of spaces).

-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" is also correct.

And so is using single quotes (avoids shell expansion):

-e 'openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it separate -e entries on separate lines (without quotes) for readability.

The following health checks belong to a diagnostic task meant to be run against
the Ansible inventory file for a deployed {product-title} cluster. They can
report common problems for the current {product-title} installation.
[[openshift-ansible-health-checks]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the anchor will break existing links, is it worth it?

Copy link
Contributor Author

@adellape adellape Jul 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was only one xref in the docs repo and it's updated in this PR. This content has only appeared so far in the Origin docs, so not too worried about links out in the wild.

the Ansible inventory file for a deployed {product-title} cluster. They can
report common problems for the current {product-title} installation.
[[openshift-ansible-health-checks]]
== openshift-ansible Health Checks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example usage for each method are provided in subsequent sections.

The following health checks are a set of diagnostic tasks that run as part of
the *openshift_health_checker* Ansible role. They are meant to be run against

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role name is internal implementation detail, should not be in the documentation.


|`logging_index_time`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is implemented in openshift/openshift-ansible#4682, not merged yet as of now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged today

Copy link
Member

@sosiouxme sosiouxme Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... meaning it will probably be in 3.6.1 not 3.6.0

if a newly-created Kibana log is not able to be queried via Elasticsearch after
30 seconds.

|`ovs_version`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't exist. We use a more generic package_version check to check versions of multiple packages, including Open vSwitch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that was my fault. I believe I initially started the Open vSwitch check in a check of its own

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will remove.


To disable specific checks, include the variable `openshift_disable_check` with
a comma-delimited list of check names in your inventory file before running the
playbook. For example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it doesn't need to be in the inventory file, it can be with a -e flag as well (among the other ways to set variables in Ansible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding mention of that method as well.

openshift_disable_check=ovs_version,etcd_volume
----

To set variables that accept user-define values, include the `-e` flag with any

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user-defined

Or rather:

To set variables in the command-line, include ...

----
# ansible-playbook -i <inventory_file> \
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \
-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested just to be sure. We can use a single -e if the argument is quoted (because of spaces).

-e "openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" is also correct.

And so is using single quotes (avoids shell expansion):

-e 'openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000'

-v /etc/ansible/hosts:/tmp/inventory:ro \ <1>
-e INVENTORY_FILE=/tmp/inventory \
-e PLAYBOOK_FILE=playbooks/common/openshift-checks/health.yml \ <2>
-e OPTS="-v -e openshift_check_logging_index_timeout_seconds=30 etcd_max_image_data_size_bytes=40000000000" \ <3>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure you need a second -e for the second parameter.

👍

Or quotes, so that everything after -e goes as a single shell argument.

link:https://registry.access.redhat.com[Red Hat Container Registry].
endif::[]
ifdef::openshift-origin[]
the *openshift/origin-ansible* container image is distirbuted via Docker Hub.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the content duplication intentional? This paragraph is the same in two files (typo included).

@juanvallejo
Copy link

ping @rhcarvalho or @sosiouxme wondering if there are any more comments on this?


|`logging_index_time`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged today

established between the control host and the exposed Kibana URL. These checks
will only run if the `openshift_hosted_logging_deploy` inventory variable is set
to `true`, to ensure that they are executed in a deployment where a logging
stack has been deployed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some link in this paragraph tying back to the logging docs, or is it clear what "logging stack" we're referring to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could. And actually the name could use some standardization too. I see it called variously "cluster logging", "aggregated logging", "the logging stack", and "the EFK stack".

https://docs.openshift.org/latest/install_config/aggregate_logging.html is probably what we want to link to. I don't have a strong preference on standard name but "cluster logging" seems straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

`openshift_check_logging_index_timeout_seconds` variable. For example, setting
`openshift_check_logging_index_timeout_seconds=30` will cause the check to fail
if a newly-created log entry is not able to be queried via Elasticsearch after
30 seconds.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider that more important than knowing the variable name would be to explain when one should bother to change the default? I think we're missing that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme When should one bother?

Copy link
Member

@sosiouxme sosiouxme Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need the long variable name twice. Don't like "is not able to be". And to Rodolfo's point:

Users that either require lower-latency log aggregation or are comfortable with higher latency may adjust this timeout with an Ansible variable. For example, setting openshift_check_logging_index_timeout_seconds=45 relaxes the timeout to 45 seconds.

openshift_disable_check=etcd_traffic,etcd_volume
----

Alternatively, set any checks you want to disable as environment variables with `-e openshift_disable_check` when running the `ansible-playbook` command.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be misleading, it should be -e openshift_disable_check=name1,name2,....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "an" environment variable yes?

// tag::ansible-based-health-checks-intro[]
Additional diagnostic health checks are available through the
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Ansible-based tooling] used to install and manage {product-title} clusters. They can report
common deployment problems for the current {product-title} installation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a substantial up-front warning about using these checks. They are not without side effects; running them can make changes to the hosts. Mostly those changes would be installing dependencies so the checks can gather needed information, and so they shouldn't be of much concern. My main concern is that some of the roles from the installer are invoked as pre-requisites to gather information the checks require, and those roles don't have a concept of not changing anything important. They try to ensure the system components they deal with are configured consistently with the inventory, even though you're not running from an installer playbook. So if the admin didn't install with Ansible, or ran the install with different/extra variables specified than what's in the inventory file, or made manual config changes after installing, they could easily find that running the checks re-configures their hosts according to the inventory file. Networking and Docker (for instance) can be affected.

So the checks should only be used on clusters that have been deployed with Ansible and using the same inventory it was deployed with. I'm not sure how to say this in a way that isn't terrifying to the user. Basically, if you wouldn't run the install with your inventory (and expect it to change nothing) then don't run the checks with it either, because the run may perform (some of) the same changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider this a bug in the check runner 😢
It is not the user's fault the implications of how the checks are implemented. We should fix it, since IIUC it did not start that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho Oops, didn't see your comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhcarvalho well that would involve decomposing roles into "informational" versus "transformational". With the caveat that even "informational" roles can install dependencies. It's probably worth doing but I don't think it will be quick.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adellape That warning LGTM.

@ahardin-rh
Copy link
Contributor

LGTM :shipit:

|`logging_index_time`
|This check detects higher than normal time delays between log creation and log
aggregation by Elasticsearch in a logging stack deployment. It fails if a
user-defined timeout is reached before logs are able to be queried through
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say "user-defined" here as most users will use the default. And it's good to mention the default. So something like:

It fails if a new log entry cannot be queried through Elasticsearch within a timeout (by default, 30 seconds).

And below, I'd use a different configured timeout, e.g. 45 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


----
# ansible-playbook -i <inventory_file> \
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the location for OCP... do we ship the installer RPM for Origin? I don't think we do. I don't want to complicate this but maybe for Origin we need to describe cloning the repo...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/common/byo/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also s/common/byo/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ifdef this for OCP vs Origin.

<1> These options make the container run with the same UID as the current user,
which is required for permissions so that the SSH key can be read inside the
container (SSH private keys are expected to be readable only by their owner).
<2> Mount SSH keys as a volume under *_/opt/app-root/src/.ssh_* under normal usage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The :Z requires some explanation too. As the repo docs explain:

Note that the ssh key is mounted with the :Z flag: this is also required so that the container can read the ssh key from its restricted SELinux context; this means that your original ssh key file will be re-labeled to something like system_u:object_r:container_file_t:s0:c113,c247. For more details about :Z please check the docker-run(1) man page. Please keep this in mind when providing these volume mount specifications because this could have unexpected consequences: for example, if you mount (and therefore re-label) your whole $HOME/.ssh directory you will block sshd from accessing your keys. This is a reason why you might want to work on a separate copy of the ssh key, so that the original file's labels remain untouched.

ssh is exceedingly picky about what it accepts as keys, and having the wrong SELinux label on the file (which is virtually inevitable at least on the initial run) causes it to silently skip the key and blithely fail to connect without further explanation. It's a pretty infuriating user experience. This is also why it's so important to have the container user matching the user ID on the keyfile.

The mention of mounting a .ssh directory leads to yet more discussion; I don't like it but this probably has to be pretty verbose because this is what everyone will get hung up on adapting to their specific usage. You might want to mount a whole .ssh directory for various reasons, among them because you want to use an ssh config to match keys with hosts or tweak other connection parameters, or because you want to provide a known_hosts file and have ssh validate host keys (which is disabled by default config and can be re-enabled by setting the envvar -e ANSIBLE_HOST_KEY_CHECKING=True on the cmdline or a few more complicated ways).

Finally we should probably mention providing a vault password. Probably in a future update because frankly I know nothing about it yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

when running the container as a non-root user.
<3> Change *_/etc/ansible/hosts_* to the location of your cluster's inventory file,
if different. This file will be bind-mounted to *_/tmp/inventory_*, which is
used by the `INVENTORY_FILE` environment variable in the container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not used by it but I guess used according to it. Or as indicated by it?

The same need for :Z labeling can apply to this (or really anything that's mounted in). The label on /etc/ansible/hosts is typically suitable already for use in the container and they can get away without :Z relabeling. However if the inventory file is in a user home directory for instance, it will not be available to the user in the container and they will need to relabel it. I'm not sure there's a graceful way to explain this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we haven't even discussed dynamic inventory, which is a whole 'nother topic.

if different. This file will be bind-mounted to *_/tmp/inventory_*, which is
used by the `INVENTORY_FILE` environment variable in the container.
<3> The `PLAYBOOK_FILE` environment variable is set to the location of the
*_health.yml_* playbook inside the container.
Copy link
Member

@sosiouxme sosiouxme Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth saying that this is the location relative to /usr/share/ansible/openshift-ansible inside the container. You could also fully specify e.g. PLAYBOOK_FILE=/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-checks/health.yml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

used by the `INVENTORY_FILE` environment variable in the container.
<3> The `PLAYBOOK_FILE` environment variable is set to the location of the
*_health.yml_* playbook inside the container.
<4> Set any desired variables that accept user-defined values in `key=value key=value` format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All variables can be user-defined so that's kind of redundant. The example's clear enough but I'd suggest this say:

Set any variables desired for a single run with the -e key=value format.

Having two in there could be confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sosiouxme
Copy link
Member

sosiouxme commented Jul 25, 2017 via email


----
# ansible-playbook -i <inventory_file> \
/usr/share/ansible/openshift-ansible/playbooks/common/openshift-checks/health.yml \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/common/byo/

====
Due to potential changes the health checks could make to hosts, they should only
be used on clusters that have been deployed using Ansible and using the same
inventory file with which it was deployed. Changes mostly involve installing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"deployed using Ansible and using the same inventory file with which it was deployed"
Is there a programmatic way to determine this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err... run the playbook and see if it changes anything?

Seeing what would be changed without making the changes is what the --check ansible flag is for but I honestly don't know how accurate or useful it is with as much custom stuff as openshift-ansible includes. Could be fine, could be broken, could even make changes anyway. I don't think it's a use case anyone's paying attention to.

@rhcarvalho
Copy link

@adellape how far do you think we are from getting it merged? :-)

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, still a few things to fix up.


[WARNING]
====
Due to potential changes the health checks could make to hosts, they should only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small quibble I missed before.

s/health checks/health check playbooks/

(the health checks themselves don't make the problem changes, the playbooks have dependencies that make them)


These checks can be run either using the `ansible-playbook` command (the same
method used during
xref:../install_config/install/advanced_install.adoc#install-config-install-advanced-install[Advanced Installation]) or using the Docker CLI to directly run a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to link to github instructions in the next paragraph, it's probably worth mentioning that you can run it as a system container as well.

"using the Docker CLI or atomic..."

Or, go the other way and remain agnostic:

"... or as a containerized version of openshift-ansible." And "containerized method" below.

exceeds 40 GB.

|`etcd_traffic`
|This check detects higher than normal traffic on an etcd host. It fails if a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about "higher-than-normal"?

limit for total percent usage can be set with a variable in your inventory file:
`max_thinpool_data_usage_percent=90`.
|===
threshold defaults to 90% of the total size available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also checks that docker storage is configured in a supportable way. Meaning, with a storage driver and backing storage device that are supported for usage with Docker.

In particular, the default configuration of Docker storage is not supported. I'm not sure how much to say about this; probably worth linking to any docs we have on Docker storage configuration.

openshift_disable_check=ovs_version,etcd_volume
----
|`kibana`, `curator`, `elasticsearch`, `fluentd`
|This set of checks verifies that Elasticsearch, Fluentd, and Curator pods have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and Kibana

xref:../install_config/aggregate_logging.adoc#install-config-aggregate-logging[cluster
logging] has been enabled.

|`logging_index_time`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check similarly only runs if logging is enabled. Should it be grouped with the others? They're all related but the focus of this one is a little different, it's end-to-end...

----

To set variables in the command line, include the `-e` flag with any desired
variables in `key=value key=value` format. For example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would still prefer a single key=value here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops.

therefore relabel) your *_$HOME/.ssh_* directory, you will block *sshd*
from accessing your keys. This is a reason why you might want to work on a
separate copy of the SSH key, so that the original file's labels remain
untouched.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reword this just a bit to be clearer about the problem and stronger in the suggested remedy.

Keep this in mind for these volume mount specifications because it could have unexpected consequences. For example, if you mount (and therefore relabel) your $HOME/.ssh directory, sshd will become unable to access your public keys to allow remote login. To avoid altering the original file labels, mounting a copy of the SSH key (or directory) is recommended.

hosts or modify other connection parameters. It would also allow you to provide
a *_known_hosts_* file and have SSH validate host keys, which is disabled by the
default configuration and can be re-enabled by setting the `envvar -e
ANSIBLE_HOST_KEY_CHECKING=True` on the command line.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be confusing as to whether it needs to be added on the docker command or in the OPTS var. It needs to be set on the docker command.

... and can be re-enabled with an environment variable by adding -e ANSIBLE_HOST_KEY_CHECKING=True to the docker command line.

@adellape
Copy link
Contributor Author

@sosiouxme OK thanks, changes made, see latest commit.

@adellape adellape modified the milestones: Next Release, Future Release Aug 14, 2017
@adellape
Copy link
Contributor Author

@sosiouxme bump

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits but LGTM

* Allow users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
* Provide debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.
* Allows users to deploy minimal footprint container hosts by moving packages out of the base distribution and into this support container.
* Provides debugging capabilities for Red Hat Enterprise Linux 7 Atomic Host, which has an immutable packet tree. *rhel-tools* includes utilities such as tcpdump, sosreport, git, gdb, perf, and many more common system administration utilities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/packet/package/ here; this supplies tools that can't be installed as packages on AH. Not sure what an immutable packet tree would mean :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😊

openshift_disable_check=etcd_traffic,etcd_volume
----

Alternatively, set any checks you want to disable as environment variables with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically these aren't environment variables... I think you can safely s/environment//

On a docker command line, -e sets an environment variable. On an ansible command line, -e sets an extra variable or inventory variable. They look the same but they're accessed in completely different ways...

=== Running Health Checks via Docker CLI

It is possible a playbook may require dependencies that are not installed
locally on the host running the `ansible-playbook` command. You can avoid this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, dependencies like Ansible and the playbooks :) Other dependencies aren't really much of a problem for this playbook, so I don't know that I'd put this as the motivation. The purpose of the image is to be able to run playbooks on a system with nothing more than Docker. You could even run it on a Mac or Atomic Host.

Not a big deal, but maybe something like:

"It is possible to run the openshift-ansible playbooks in a Docker container, avoiding the need for installing and configuring Ansible, on any host that can run the {ose,origin}-ansible image via the Docker CLI."

@sosiouxme
Copy link
Member

It would be nice to have these changes made at least for 3.6.1 if not before...

@adellape
Copy link
Contributor Author

@sosiouxme Thanks! Changes made and I'll merge after Travis finishes. It will get published w/ next Monday's release.

@adellape
Copy link
Contributor Author

adellape commented Aug 16, 2017

[rev_history]
|xref:../admin_guide/diagnostics_tool.adoc#admin-guide-diagnostics-tool[Diagnostics Tool]
|Enhanced the xref:../admin_guide/diagnostics_tool.adoc#ansible-based-tooling-health-checks[Ansible-based Health Checks] section with information on running via ansible-playbook or Docker CLI.
%
|xref:../scaling_performance/optimizing_compute_resources.adoc#scaling-performance-compute-resources[Optimizing Compute Resources]
|Added the xref:../scaling_performance/optimizing_compute_resources.adoc#scaling-performance-debugging-using-ansible[Debugging Using Ansible-based Health Checks] section.
%

@adellape adellape merged commit bcdd31b into openshift:master Aug 16, 2017
@bfallonf bfallonf modified the milestones: Next Release, Staging, Published 25/08/2017 Aug 22, 2017
@vikram-redhat vikram-redhat modified the milestones: Published 25/08/2017, Staging Sep 24, 2017
@adellape adellape deleted the scaling_preinstall branch November 9, 2017 20:33
@adellape adellape restored the scaling_preinstall branch November 9, 2017 21:08
@adellape adellape deleted the scaling_preinstall branch November 9, 2017 21:09
@adellape adellape restored the scaling_preinstall branch November 10, 2017 13:57
@adellape adellape deleted the scaling_preinstall branch November 10, 2017 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants