Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomis: DSOS-2318: add textfile monitoring role #389

Merged
merged 2 commits into from
Nov 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ while sleep "$INTERVAL"; do
if [[ "$SIDS" != "None" ]]; then
for SID in $(get_sids); do
db_connected $SID >/dev/null 2>&1
echo "PUTVAL $HOSTNAME/exec-db_connected/bool-$SID interval=$INTERVAL N:$?"
echo "PUTVAL $HOSTNAME/oracle_db_connected/exitcode-$SID interval=$INTERVAL N:$?"
done
fi
done
8 changes: 7 additions & 1 deletion ansible/roles/collectd-service-metrics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@ Intro to Collectd networking [here](https://collectd.org/wiki/index.php/Networki

## Finding metrics in Cloudwatch

Metrics collected by the Cloudwatch agent will appear in the 'metrics' panel under the CWAgent namespace as <cloudwatch_agent_config/metrics/metrics_collected/collectd/name_prefix>_<collectd_plugin_name>_value e.g. collectd_cpu_value, collectd_wlsadminserver_value, collectd_amazonssmagent_value etc.
Metrics collected by the Cloudwatch agent will appear in the 'metrics' panel under the CWAgent namespace

```
metric: collectd_service_status_value
type: exitcode
type_instance: Name of service, e.g. amazonssmagent
```

Cloudwatch metrics are easily filtered by instance_id so you can see all the metrics for a particular instance.
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ INTERVAL="${INTERVAL:-{{ collectd_script_interval }}}"
while sleep "$INTERVAL"; do
{% for item in collectd_monitored_services_role %}
({{ item.shell_cmd }}) >/dev/null 2>&1
echo "PUTVAL $HOSTNAME/{{ item.metric_name }}/bool interval=$INTERVAL N:$?"
echo "PUTVAL $HOSTNAME/service_status/exitcode-{{ item.metric_name }} interval=$INTERVAL N:$?"
{% endfor %}
{% for item in collectd_monitored_services_servertype %}
({{ item.shell_cmd }}) >/dev/null 2>&1
echo "PUTVAL $HOSTNAME/{{ item.metric_name }}/bool interval=$INTERVAL N:$?"
echo "PUTVAL $HOSTNAME/service_status/exitcode-{{ item.metric_name }} interval=$INTERVAL N:$?"
{% endfor %}
done
30 changes: 30 additions & 0 deletions ansible/roles/collectd-textfile-monitoring/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Role to import collectd metrics from textfile via collectd

This is similar to prometheus solution where values are imported from a text file
populated by another process. By default, the same directory is used

```
/opt/textfile_monitoring/
```

This role does not create the directory, it is assumed another role
will create this with the correct permissions. It needs to be readable by
`ec2-user`.

Files should contain a field and a value, e.g.

```
$ cat /opt/textfile_monitoring/nomis_batch_monitoring.prom

nomis_batch_failure_status 0
```

This will create 2 metrics

```
Metric type type_instance
collectd_textfile_monitoring_seconds duration nomis_batch_failure_status
collectd_textfile_monitoring_value gauge nomis_batch_failure_status
```

The `seconds` metric is the number of seconds since the file was last modified.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
collectd_script_path: /usr/local/bin
collectd_script_name: collectd_textfile_monitoring
collectd_script_user: ec2-user
collectd_script_interval: 60
collectd_textfile_monitoring_paths: /opt/textfile_monitoring/*
10 changes: 10 additions & 0 deletions ansible/roles/collectd-textfile-monitoring/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
- name: restart collectd
ansible.builtin.service:
name: collectd
state: restarted

- name: restart plugin script
ansible.builtin.shell: |
pkill -u {{ collectd_script_user }} -f {{ collectd_script_path }}/{{ collectd_script_name }}.sh
failed_when: false
4 changes: 4 additions & 0 deletions ansible/roles/collectd-textfile-monitoring/meta/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
---
dependencies:
- role: get-ec2-facts
- role: amazon-cloudwatch-agent-collectd
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
- name: copy collectd config
ansible.builtin.template:
src: "{{ collectd_script_name }}.conf.j2"
dest: "/etc/collectd.d/{{ collectd_script_name }}.conf"
owner: root
mode: 0644
notify:
- restart collectd

- name: copy collectd plugin script
ansible.builtin.template:
src: "{{ collectd_script_name }}.sh.j2"
dest: "{{ collectd_script_path }}/{{ collectd_script_name }}.sh"
owner: root
mode: 0755
notify:
- restart plugin script
6 changes: 6 additions & 0 deletions ansible/roles/collectd-textfile-monitoring/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
- import_tasks: configure_collectd.yml
tags:
- ec2provision
- ec2patch
when: ansible_distribution in ['RedHat', 'OracleLinux']
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
LoadPlugin exec
<Plugin exec>
Exec "{{ collectd_script_user }}" "{{ collectd_script_path }}/{{ collectd_script_name }}.sh"
</Plugin>
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash
# Managed by collectd-textfile-monitoring ansible role
# If manually editing, just kill script and collectd will respawn
# e.g. pkill -u {{ collectd_script_user }} -f {{ collectd_script_path }}/{{ collectd_script_name }}.sh

HOSTNAME="${HOSTNAME:-localhost}"
INTERVAL="${INTERVAL:-{{ collectd_script_interval }}}"

while sleep "$INTERVAL"; do
now=$(date +%s)
for file in {{ collectd_textfile_monitoring_paths }}; do
{% raw %}
IFS=$'\n'
metrics=($(grep -E "^[[:alnum:]_]+[[:space:]]+[[:digit:]]+" $file))
unset IFS
file_last_modified=$(date -r $file +%s)
secs_since_last_modified=$((now - file_last_modified))

num_metrics=${#metrics[@]}
for ((i=0; i<num_metrics; i++)); do
metric=(${metrics[i]})
echo "PUTVAL $HOSTNAME/textfile_monitoring/gauge-${metric[0]} interval=$INTERVAL N:${metric[1]}"
echo "PUTVAL $HOSTNAME/textfile_monitoring/duration-${metric[0]} interval=$INTERVAL N:${secs_since_last_modified}"
done
{% endraw %}
done
done
3 changes: 2 additions & 1 deletion ansible/roles/collectd/files/types.db.custom
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
bool value:GAUGE:0:1
bool value:GAUGE:0:1
exitcode value:GAUGE:U:U
Loading