[BUG] Logging component backup fails #2707

przemyslavic · 2021-10-14T14:57:46Z

Describe the bug
I'm not able to backup the logging component.
It fails when checking cluster health:

---
- name: Set helper facts
  set_fact:
    elasticsearch_endpoint: >-
      https://{{ ansible_default_ipv4.address }}:9200
    snapshot_name: >-
      {{ ansible_date_time.iso8601_basic_short | replace('T','-') }}
  vars:
    uri_template: &uri
      client_cert: /etc/elasticsearch/kirk.pem
      client_key: /etc/elasticsearch/kirk-key.pem
      validate_certs: false
      body_format: json

- debug: var=snapshot_name

- name: Check cluster health
  uri:
    <<: *uri
    url: "{{ elasticsearch_endpoint }}/_cluster/health"
    method: GET
  register: uri_response
  until: uri_response is success
  retries: 12
  delay: 5

14:29:27 INFO cli.engine.ansible.AnsibleCommand - TASK [backup : Check cluster health] *******************************************
14:29:28 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (12 retries left).
14:29:34 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (11 retries left).
14:29:39 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (10 retries left).
14:29:45 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (9 retries left).
14:29:51 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (8 retries left).
14:29:57 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (7 retries left).
14:30:03 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (6 retries left).
14:30:08 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (5 retries left).
14:30:14 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (4 retries left).
14:30:20 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (3 retries left).
14:30:26 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (2 retries left).
14:30:31 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (1 retries left).
14:30:37 ERROR cli.engine.ansible.AnsibleCommand - An exception occurred during task execution. To see the full traceback, use -vvv. The error was: IOError: [Errno 2] No such file or directory
14:30:37 ERROR cli.engine.ansible.AnsibleCommand - fatal: [qa-azuretestfull-logging-vm-0]: FAILED! => {"attempts": 12, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: An unknown error occurred: [Errno 2] No such file or directory", "redirected": false, "status": -1, "url": "https://10.1.3.4:9200/_cluster/health"}

When executed directly on the logging vm:

root@ec2-x-x-x-x:/etc/elasticsearch# curl -k --cert kirk.pem --key kirk-key.pem https://10.1.3.109:9200/_cluster/health
curl: (58) could not load PEM client certificate, OpenSSL error error:02001002:system library:fopen:No such file or directory, (no key found, wrong pass phrase, or wrong file format?)

This is due to removal of the demo certificates and now epiphany-admin-key.pem and epiphany-admin.pem should be used instead of kirk-key.pem and kirk.pem which don't exist.

root@ec2-x-x-x-x:/etc/elasticsearch# ls -la /etc/elasticsearch/
total 84
drwxr-s---   5 root elasticsearch  4096 Oct 15 07:37 .
drwxr-xr-x 100 root root           4096 Oct 15 07:40 ..
-rw-r--r--   1 root elasticsearch    76 Oct 15 07:36 .elasticsearch.keystore.initial_md5sum
drwxr-x---   2 root elasticsearch  4096 Oct 15 07:37 csr
-rw-rw----   1 root elasticsearch   199 Oct 15 07:36 elasticsearch.keystore
-rw-rw----   1 root elasticsearch  4621 Oct 15 07:37 elasticsearch.yml
-rw-rw----   1 root elasticsearch  4547 Oct 15 07:36 elasticsearch.yml.20433.2021-10-15@07:37:23~
-rw-------   1 root elasticsearch  1708 Oct 15 07:37 epiphany-admin-key.pem
-rw-r-----   1 root elasticsearch  1285 Oct 15 07:37 epiphany-admin.pem
-rw-r-----   1 root elasticsearch  1704 Oct 15 07:37 epiphany-node-ec2-x-x-x-x.eu-west-1.compute.amazonaws.com-key.pem
-rw-r--r--   1 root elasticsearch  1464 Oct 15 07:37 epiphany-node-ec2-x-x-x-x.eu-west-1.compute.amazonaws.com.pem
-rw-r-----   1 root elasticsearch  1233 Oct 15 07:37 epiphany-root-ca.pem
-rw-rw----   1 root elasticsearch  2600 Oct 15 07:36 jvm.options
-rw-rw----   1 root elasticsearch  2581 Oct 15 07:36 jvm.options.19923.2021-10-15@07:37:00~
drwxr-s---   2 root elasticsearch  4096 Jan 13  2021 jvm.options.d
-rw-rw----   1 root elasticsearch 11021 Jan 13  2021 log4j2.properties
drwxr-x---   2 root elasticsearch  4096 Oct 15 07:37 private

root@ec2-x-x-x-x:/etc/elasticsearch# curl -s -k --cert epiphany-admin.pem --key epiphany-admin-key.pem https://10.1.3.109:9200/_cluster/health | jq
{
  "cluster_name": "EpiphanyElastic",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 6,
  "active_shards": 6,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 4,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 60
}

Additionally, at this point, we only "support" single node backup. Consider a multi-node cluster backup:
#1416
#1417

How to reproduce
Steps to reproduce the behavior:

Deploy a cluster with 1 logging vm
Run backup command: epicli backup -b /shared/build/testbackup -f /shared/build/backup-config.yml

Expected behavior
Backup should be successful

Config files
Backup config:

kind: configuration/backup
title: Backup Config
name: default
provider: azure
specification:
  components:
    load_balancer:
      enabled: false
    logging:
      enabled: true
    monitoring:
      enabled: false
    postgresql:
      enabled: false
    rabbitmq:
      enabled: false
    kubernetes:
      enabled: false

Recovery config:

kind: configuration/recovery
title: Recovery Config
name: default
provider: azure
specification:
  components:
    load_balancer:
      enabled: false
      snapshot_name: latest
    logging:
      enabled: true
      snapshot_name: latest
    monitoring:
      enabled: false
      snapshot_name: latest
    postgresql:
      enabled: false
      snapshot_name: latest
    rabbitmq:
      enabled: false
      snapshot_name: latest

Environment

Cloud provider: [all]
OS: [all]

epicli version: [epicli --version]
1.3.0dev
Previous versions are probably also affected.

Additional context
Add any other context about the problem here.

DoD checklist

The text was updated successfully, but these errors were encountered:

rafzei · 2021-11-05T12:48:15Z

I think it's clear, cert and key are hardcoded in the backup role.

* Fix for #2707 * Fix logging service start

przemyslavic · 2021-11-24T09:25:39Z

✔️ The backup and recovery of the logging component works fine.
Note that only single node backup/recovery is supported for now.

plirglo · 2021-11-30T09:31:48Z

Created backport issue: #2764

przemyslavic added type/bug area/logs status/grooming-needed labels Oct 14, 2021

rafzei removed the status/grooming-needed label Nov 5, 2021

plirglo self-assigned this Nov 9, 2021

plirglo added a commit to plirglo/epiphany that referenced this issue Nov 15, 2021

fix for hitachienergy#2707

2930c1a

plirglo added a commit to plirglo/epiphany that referenced this issue Nov 15, 2021

fix for hitachienergy#2707

c103faa

plirglo added a commit to plirglo/epiphany that referenced this issue Nov 15, 2021

Fix for hitachienergy#2707

e15da77

plirglo mentioned this issue Nov 15, 2021

Logging component backup fix #2744

Merged

przemyslavic self-assigned this Nov 23, 2021

plirglo added a commit that referenced this issue Nov 23, 2021

Logging component backup fix (#2744)

256209b

* Fix for #2707 * Fix logging service start

plirglo closed this as completed Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Logging component backup fails #2707

[BUG] Logging component backup fails #2707

przemyslavic commented Oct 14, 2021 •

edited by plirglo

Loading

rafzei commented Nov 5, 2021

przemyslavic commented Nov 24, 2021

plirglo commented Nov 30, 2021 •

edited

Loading

[BUG] Logging component backup fails #2707

[BUG] Logging component backup fails #2707

Comments

przemyslavic commented Oct 14, 2021 • edited by plirglo Loading

rafzei commented Nov 5, 2021

przemyslavic commented Nov 24, 2021

plirglo commented Nov 30, 2021 • edited Loading

przemyslavic commented Oct 14, 2021 •

edited by plirglo

Loading

plirglo commented Nov 30, 2021 •

edited

Loading