Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Logging component backup fails #2707

Closed
7 of 10 tasks
przemyslavic opened this issue Oct 14, 2021 · 3 comments
Closed
7 of 10 tasks

[BUG] Logging component backup fails #2707

przemyslavic opened this issue Oct 14, 2021 · 3 comments

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Oct 14, 2021

Describe the bug
I'm not able to backup the logging component.
It fails when checking cluster health:

---
- name: Set helper facts
  set_fact:
    elasticsearch_endpoint: >-
      https://{{ ansible_default_ipv4.address }}:9200
    snapshot_name: >-
      {{ ansible_date_time.iso8601_basic_short | replace('T','-') }}
  vars:
    uri_template: &uri
      client_cert: /etc/elasticsearch/kirk.pem
      client_key: /etc/elasticsearch/kirk-key.pem
      validate_certs: false
      body_format: json

- debug: var=snapshot_name

- name: Check cluster health
  uri:
    <<: *uri
    url: "{{ elasticsearch_endpoint }}/_cluster/health"
    method: GET
  register: uri_response
  until: uri_response is success
  retries: 12
  delay: 5
14:29:27 INFO cli.engine.ansible.AnsibleCommand - TASK [backup : Check cluster health] *******************************************
14:29:28 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (12 retries left).
14:29:34 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (11 retries left).
14:29:39 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (10 retries left).
14:29:45 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (9 retries left).
14:29:51 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (8 retries left).
14:29:57 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (7 retries left).
14:30:03 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (6 retries left).
14:30:08 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (5 retries left).
14:30:14 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (4 retries left).
14:30:20 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (3 retries left).
14:30:26 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (2 retries left).
14:30:31 ERROR cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: Check cluster health (1 retries left).
14:30:37 ERROR cli.engine.ansible.AnsibleCommand - An exception occurred during task execution. To see the full traceback, use -vvv. The error was: IOError: [Errno 2] No such file or directory
14:30:37 ERROR cli.engine.ansible.AnsibleCommand - fatal: [qa-azuretestfull-logging-vm-0]: FAILED! => {"attempts": 12, "changed": false, "elapsed": 0, "msg": "Status code was -1 and not [200]: An unknown error occurred: [Errno 2] No such file or directory", "redirected": false, "status": -1, "url": "https://10.1.3.4:9200/_cluster/health"}

When executed directly on the logging vm:

root@ec2-x-x-x-x:/etc/elasticsearch# curl -k --cert kirk.pem --key kirk-key.pem https://10.1.3.109:9200/_cluster/health
curl: (58) could not load PEM client certificate, OpenSSL error error:02001002:system library:fopen:No such file or directory, (no key found, wrong pass phrase, or wrong file format?)

This is due to removal of the demo certificates and now epiphany-admin-key.pem and epiphany-admin.pem should be used instead of kirk-key.pem and kirk.pem which don't exist.

root@ec2-x-x-x-x:/etc/elasticsearch# ls -la /etc/elasticsearch/
total 84
drwxr-s---   5 root elasticsearch  4096 Oct 15 07:37 .
drwxr-xr-x 100 root root           4096 Oct 15 07:40 ..
-rw-r--r--   1 root elasticsearch    76 Oct 15 07:36 .elasticsearch.keystore.initial_md5sum
drwxr-x---   2 root elasticsearch  4096 Oct 15 07:37 csr
-rw-rw----   1 root elasticsearch   199 Oct 15 07:36 elasticsearch.keystore
-rw-rw----   1 root elasticsearch  4621 Oct 15 07:37 elasticsearch.yml
-rw-rw----   1 root elasticsearch  4547 Oct 15 07:36 elasticsearch.yml.20433.2021-10-15@07:37:23~
-rw-------   1 root elasticsearch  1708 Oct 15 07:37 epiphany-admin-key.pem
-rw-r-----   1 root elasticsearch  1285 Oct 15 07:37 epiphany-admin.pem
-rw-r-----   1 root elasticsearch  1704 Oct 15 07:37 epiphany-node-ec2-x-x-x-x.eu-west-1.compute.amazonaws.com-key.pem
-rw-r--r--   1 root elasticsearch  1464 Oct 15 07:37 epiphany-node-ec2-x-x-x-x.eu-west-1.compute.amazonaws.com.pem
-rw-r-----   1 root elasticsearch  1233 Oct 15 07:37 epiphany-root-ca.pem
-rw-rw----   1 root elasticsearch  2600 Oct 15 07:36 jvm.options
-rw-rw----   1 root elasticsearch  2581 Oct 15 07:36 jvm.options.19923.2021-10-15@07:37:00~
drwxr-s---   2 root elasticsearch  4096 Jan 13  2021 jvm.options.d
-rw-rw----   1 root elasticsearch 11021 Jan 13  2021 log4j2.properties
drwxr-x---   2 root elasticsearch  4096 Oct 15 07:37 private

root@ec2-x-x-x-x:/etc/elasticsearch# curl -s -k --cert epiphany-admin.pem --key epiphany-admin-key.pem https://10.1.3.109:9200/_cluster/health | jq
{
  "cluster_name": "EpiphanyElastic",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 1,
  "number_of_data_nodes": 1,
  "active_primary_shards": 6,
  "active_shards": 6,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 4,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 60
}

Additionally, at this point, we only "support" single node backup. Consider a multi-node cluster backup:
#1416
#1417

How to reproduce
Steps to reproduce the behavior:

  1. Deploy a cluster with 1 logging vm
  2. Run backup command: epicli backup -b /shared/build/testbackup -f /shared/build/backup-config.yml

Expected behavior
Backup should be successful

Config files
Backup config:

kind: configuration/backup
title: Backup Config
name: default
provider: azure
specification:
  components:
    load_balancer:
      enabled: false
    logging:
      enabled: true
    monitoring:
      enabled: false
    postgresql:
      enabled: false
    rabbitmq:
      enabled: false
    kubernetes:
      enabled: false

Recovery config:

kind: configuration/recovery
title: Recovery Config
name: default
provider: azure
specification:
  components:
    load_balancer:
      enabled: false
      snapshot_name: latest
    logging:
      enabled: true
      snapshot_name: latest
    monitoring:
      enabled: false
      snapshot_name: latest
    postgresql:
      enabled: false
      snapshot_name: latest
    rabbitmq:
      enabled: false
      snapshot_name: latest

Environment

  • Cloud provider: [all]
  • OS: [all]

epicli version: [epicli --version]
1.3.0dev
Previous versions are probably also affected.

Additional context
Add any other context about the problem here.


DoD checklist

  • Changelog updated (if affected version was released)
  • COMPONENTS.md updated / doesn't need to be updated
  • Automated tests passed (QA pipelines)
    • apply
    • upgrade
  • Case covered by automated test (if possible)
  • Idempotency tested
  • Documentation updated / doesn't need to be updated
  • All conversations in PR resolved
  • Backport tasks created / doesn't need to be backported
@rafzei
Copy link
Contributor

rafzei commented Nov 5, 2021

I think it's clear, cert and key are hardcoded in the backup role.

@plirglo plirglo self-assigned this Nov 9, 2021
plirglo added a commit to plirglo/epiphany that referenced this issue Nov 15, 2021
plirglo added a commit to plirglo/epiphany that referenced this issue Nov 15, 2021
plirglo added a commit to plirglo/epiphany that referenced this issue Nov 15, 2021
@przemyslavic przemyslavic self-assigned this Nov 23, 2021
plirglo added a commit that referenced this issue Nov 23, 2021
* Fix for #2707

* Fix logging service start
@przemyslavic
Copy link
Collaborator Author

✔️ The backup and recovery of the logging component works fine.
Note that only single node backup/recovery is supported for now.

@plirglo
Copy link
Contributor

plirglo commented Nov 30, 2021

Created backport issue: #2764

@plirglo plirglo closed this as completed Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants