Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Failed to retrieve PID of executing child process #12084

Open
KLIM8D opened this issue Jul 27, 2023 · 28 comments · Fixed by #13697
Open

Error: Failed to retrieve PID of executing child process #12084

KLIM8D opened this issue Jul 27, 2023 · 28 comments · Fixed by #13697
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@KLIM8D
Copy link

KLIM8D commented Jul 27, 2023

  • Distribution: Ubuntu
  • Distribution version: 20.04
    • Kernel version: 5.15.0-76-generic #83~20.04.1-Ubuntu
    • LXC version: 5.0.2
    • LXD version: 5.15 (snap)
    • Storage backend in use: ceph

I have a cluster with 4 nodes. There are 44 containers spread across the nodes and atm I'm getting this error message trying to run lxc exec on 19 of the containers. Besides 3 containers, the other 41 are using ceph rbd for storage.

I can start the container and exec works for a while and then all of the sudden I'm getting the error message. Restarting the container works and I can get a shell or whatever again, but then after X amount of hours, it happens again.

$ lxc exec amqp-1 bash
Error: Failed to retrieve PID of executing child process

lxc info

config:
  cluster.https_address: 10.0.0.4:8443
  core.https_address: 10.0.0.4:8443
  images.auto_update_interval: "0"
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: REDACTED
auth_user_method: unix
environment:
  addresses:
  - 10.0.0.4:8443
  architectures:
  - x86_64
  - i686
  certificate: |
      REDACTED
  certificate_fingerprint: REDACTED
  driver: lxc | qemu
  driver_version: 5.0.2 | 8.0.0
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    shiftfs: "false"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.15.0-76-generic
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: true
  server_event_mode: full-mesh
  server_name: compute-lxc-host-1
  server_pid: 743649
  server_version: "5.15"
  storage: ceph | cephfs | lvm
  storage_version: 17.2.5 | 17.2.5 | 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08)
    / 4.45.0
  storage_supported_drivers:
  - name: ceph
    version: 17.2.5
    remote: true
  - name: cephfs
    version: 17.2.5
    remote: true
  - name: cephobject
    version: 17.2.5
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.45.0
    remote: false
  - name: zfs
    version: 2.1.5-1ubuntu6~22.04.1
    remote: false
  - name: btrfs
    version: 5.16.2
    remote: false

lxc config show amqp-1 --expanded

architecture: x86_64
config:
  image.build_ver: 10f16957b38b3df1cb525889296835c4868c3d4661a7fcd1040d78da1c364379
  image.name: base_image
  image.os: ubuntu
  image.version: "20.04"
  limits.cpu: 12-19
  limits.memory: 12GB
  security.devlxd: "false"
  security.nesting: "true"
  security.protection.delete: "true"
  security.syscalls.deny_compat: "true"
  volatile.base_image: e03956f34d66c8dd43c7bc85d07345335e89d6b1b28b3fef2c70ced96017749e
  volatile.cloud-init.instance-id: a076608b-aca3-472a-9e81-5e7c5c0f7ed0
  volatile.eth0.host_name: vethed6ade5b
  volatile.eth0.hwaddr: 00:16:3e:8c:59:71
  volatile.eth1.host_name: vethc72de7a7
  volatile.eth1.hwaddr: 00:16:3e:12:ad:e4
  volatile.idmap.base: "0"
  volatile.idmap.current: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.idmap.next: '[{"Isuid":true,"Isgid":false,"Hostid":1000000,"Nsid":0,"Maprange":1000000000},{"Isuid":false,"Isgid":true,"Hostid":1000000,"Nsid":0,"Maprange":1000000000}]'
  volatile.last_state.idmap: '[]'
  volatile.last_state.power: RUNNING
  volatile.uuid: e8b176c3-7b65-4696-a098-fb27086498d3
  volatile.uuid.generation: e8b176c3-7b65-4696-a098-fb27086498d3
devices:
  eth0:
    name: eth0
    network: lxdbr0
    type: nic
  eth1:
    name: eth1
    nictype: bridged
    parent: br0
    type: nic
  root:
    path: /
    pool: cephpool
    type: disk
ephemeral: false
profiles:
- ceph
- vlan
stateful: false
description: ""

lxc monitor

location: compute-lxc-host-1
metadata:
  context:
    id: c02c13cd-3839-44d7-b6ff-5918ab19d4b3
    local: /var/snap/lxd/common/lxd/unix.socket
    remote: '@'
  level: debug
  message: Event listener server handler started
timestamp: "2023-07-27T18:17:15.501516869+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.502796588+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/events
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.505814424+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: POST
    protocol: unix
    url: /1.0/instances/amqp-1/exec
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.506461362+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    id: 21a8c997-1f44-4367-b5de-e9d9ef9a0ad3
    local: /var/snap/lxd/common/lxd/unix.socket
    remote: '@'
  level: debug
  message: Event listener server handler started
timestamp: "2023-07-27T18:17:16.507238122+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: New operation
timestamp: "2023-07-27T18:17:16.523119245+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Started operation
timestamp: "2023-07-27T18:17:16.523232686+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context: {}
  level: debug
  message: Waiting for exec websockets to connect
timestamp: "2023-07-27T18:17:16.523382177+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: ""
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Pending
  status_code: 105
  updated_at: "2023-07-27T18:17:16.517874299+02:00"
project: default
timestamp: "2023-07-27T18:17:16.523198189+02:00"
type: operation


location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: ""
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Running
  status_code: 103
  updated_at: "2023-07-27T18:17:16.517874299+02:00"
project: default
timestamp: "2023-07-27T18:17:16.523295708+02:00"
type: operation


location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/operations/47030eed-2f0d-4cd8-bb3c-b4eea03f6a83/websocket?secret=fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.527555079+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connecting to operation
timestamp: "2023-07-27T18:17:16.527615106+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connected to operation
timestamp: "2023-07-27T18:17:16.527857206+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connecting to operation
timestamp: "2023-07-27T18:17:16.528521127+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/operations/47030eed-2f0d-4cd8-bb3c-b4eea03f6a83/websocket?secret=46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.528447454+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Connected to operation
timestamp: "2023-07-27T18:17:16.528669575+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    ip: '@'
    method: GET
    protocol: unix
    url: /1.0/operations/47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    username: user
  level: debug
  message: Handling API request
timestamp: "2023-07-27T18:17:16.529379566+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    instance: amqp-1
    instanceType: container
    project: default
  level: error
  message: Failed to retrieve PID of executing child process
timestamp: "2023-07-27T18:17:16.53379002+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Updated metadata for operation
timestamp: "2023-07-27T18:17:16.533915644+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: ""
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
    return: -1
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Running
  status_code: 103
  updated_at: "2023-07-27T18:17:16.533893962+02:00"
project: default
timestamp: "2023-07-27T18:17:16.533984728+02:00"
type: operation


location: compute-lxc-host-1
metadata:
  context:
    class: websocket
    description: Executing command
    err: Failed to retrieve PID of executing child process
    operation: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
    project: default
  level: debug
  message: Failure for operation
timestamp: "2023-07-27T18:17:16.534054213+02:00"
type: logging


location: compute-lxc-host-1
metadata:
  class: websocket
  created_at: "2023-07-27T18:17:16.517874299+02:00"
  description: Executing command
  err: Failed to retrieve PID of executing child process
  id: 47030eed-2f0d-4cd8-bb3c-b4eea03f6a83
  location: compute-lxc-host-1
  may_cancel: false
  metadata:
    command:
    - ls
    environment:
      HOME: /root
      LANG: C.UTF-8
      PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
      TERM: screen-256color
      USER: root
    fds:
      "0": 46ac3977e021d59fb28e57013fd66cecee35e7ebfe62253f8efe4eab123c7921
      control: fb5aa23ec0b33d34be70fc80f5228d4b41dde01359be1a887a770de9b3e8514b
    interactive: true
    return: -1
  resources:
    containers:
    - /1.0/instances/amqp-1
    instances:
    - /1.0/instances/amqp-1
  status: Failure
  status_code: 400
  updated_at: "2023-07-27T18:17:16.533893962+02:00"
project: default
timestamp: "2023-07-27T18:17:16.534132375+02:00"
type: operation


location: compute-lxc-host-1
metadata:
  context:
    listener: 21a8c997-1f44-4367-b5de-e9d9ef9a0ad3
    local: /var/snap/lxd/common/lxd/unix.socket
    remote: '@'
  level: debug
  message: Event listener server handler stopped
timestamp: "2023-07-27T18:17:16.537145693+02:00"
type: logging

@tomponline
Copy link
Member

This comes up from time to time and is usually a process on your system periodically cleaning the tmp directory.

@tomponline tomponline added the Incomplete Waiting on more information from reporter label Jul 27, 2023
@KLIM8D
Copy link
Author

KLIM8D commented Jul 27, 2023

@tomponline Thanks! That seems about right, could also be the reason why I experience this now when switching to the snap lxd version.

But I'm a bit confused, what should the contents of /usr/lib/tmpfiles.d/snapd.conf be, atm it contains:

D! /tmp/snap-private-tmp 0700 root root -

Should I change the D! to an x or append the line x /tmp/snap-private-tmp/snap.lxd above? What should be restarted afterwards?

@KLIM8D
Copy link
Author

KLIM8D commented Jul 27, 2023

I've added it above and tested with SYSTEMD_LOG_LEVEL=debug /usr/bin/systemd-tmpfiles --clean

Ignoring entry D! "/tmp/snap-private-tmp" because --boot is not specified.
Running clean action for entry x /tmp/snap-private-tmp/snap.lxd
Running clean action for entry D /tmp

systemd-tmpfiles pages states

If the exclamation mark ("!") is used, this line is only safe to execute during boot, and can break a running system. Lines without the exclamation mark are presumed to be safe to execute at any time, e.g. on package upgrades. systemd-tmpfiles(8) will take lines with an exclamation mark only into consideration, if the --boot option is given.

But does the third line mean that the snap.lxd directory were still cleaned, even though D! was specified for its parent dir? Just trying to understand and make sure it wont happen again.

@tomponline
Copy link
Member

Looking at man tmpfiles.d it says:

D     /directory/to/create-and-remove          mode user group cleanup-age -
x     /path-or-glob/to/ignore/recursively      -    -    -     -           -
...
If the exclamation mark ("!") is used, this line is only safe to execute during boot, and can break a running system. Lines without the exclamation mark are presumed to be safe to execute at any
       time, e.g. on package upgrades.  systemd-tmpfiles will take lines with an exclamation mark only into consideration, if the --boot option is given.

So you would want to be adding a new line to /usr/lib/tmpfiles.d/snapd.conf or adding a new file to /usr/lib/tmpfiles.d/.
Whether to use D! or x, I'm not sure, but looking now, LXD creates /tmp/snap-private-tmp/snap.lxd/tmp/ so its possible that using D! /tmp/snap-private-tmp/snap.lxd/ will still cause /tmp/snap-private-tmp/snap.lxd/tmp to be cleaned normally, so using x is probably the safer choice.

@tomponline
Copy link
Member

Chatting with the snapd team theres been a suggestion that perhaps LXD should use the XDG_RUNTIME_DIR path for its runtime state files (i.e. /run/user/0/snap. in case of daemons) to avoid systemd-tmpfiles from cleaning up these files periodically.

@tomponline tomponline added this to the lxd-5.17 milestone Jul 31, 2023
@KLIM8D
Copy link
Author

KLIM8D commented Aug 1, 2023

Unfortunately it didn't solve the error first described, after a few days the "Error: Failed to retrieve PID..." is back.

Contents of /usr/lib/tmpfiles.d/snapd.conf

x /tmp/snap-private-tmp/snap.lxd
D! /tmp/snap-private-tmp 0700 root root -

Trying console instead

$ lxc console mongo-2
To detach from the console, press: <ctrl>+a q
Error: Error opening config file: "loading config file for the container failed"
Error: write /dev/pts/ptmx: file already closed

@tomponline
Copy link
Member

You reloaded the systemd-tmpfiles service after making the change?

@KLIM8D
Copy link
Author

KLIM8D commented Aug 1, 2023

Yes, prettry sure I did. Reloaded systemd with systemctl daemon-reload, just to be sure and systemctl restart systemd-tmpfiles-clean.timer

Just did it again on all LXD hosts. I'll restart all the faulty containers and see if it happens again within a day or two.

@tomponline
Copy link
Member

You could try stopping and disabling the service for a while too and see if that helps.

@KLIM8D
Copy link
Author

KLIM8D commented Aug 3, 2023

Instances restarted two days ago, are now giving this error again. It's not all the containers that were restarted, that gives this error again, for some which were restarted, I'm still able to use "lxc exec". So a bit inconsistent.

I'm beginning to have my doubts it has something to do with the systemd-tmpfiles-clean job. Are there happening anything special with or "inside" the snap package? Older version though, but never had the issue with non-snap LXD.

@KLIM8D
Copy link
Author

KLIM8D commented Aug 6, 2023

What's strange is that I have one LXD cluster host, where not a single instance has this issue and containers have been running for many days. Haven't spotted the difference between that cluster host and the others yet, other than is has "database standby" role.

@KLIM8D
Copy link
Author

KLIM8D commented Aug 14, 2023

Found this thread (https://discuss.linuxcontainers.org/t/forkexec-failed-to-load-config-file/16220/9) and looked at my hosts to see if this also was the case, that this lxc.conf file was missing. What was interesting is that on all of the 3 hosts, where containers fail, this config file is missing in the /var/snap/lxd/common/lxd/logs/CONTAINER directory. And on the host where I don't see this problem, the lxc.conf file is there for all the containers.

On one of the failing LXD hosts, this is the content of forkexec.log in the container directory

Failed to load config file /var/snap/lxd/common/lxd/logs/db-lb-3/lxc.conf for /var/snap/lxd/common/lxd/containers/db-lb-3

There are only forkexec.log and lxc.log any of the container directories within .../lxd/logs

# ls -l /var/snap/lxd/common/lxd/logs/db-lb-3/
/var/snap/lxd/common/lxd/logs/db-lb-3:
forkexec.log  lxc.log

Could this be the reason I'm getting this error?

Restarting the container, all these file are present again

# ls -l /var/snap/lxd/common/lxd/logs/db-lb-3/
total 100
-rw------- 1 root root  6483 Aug 14 10:36 console.log
-rw-r--r-- 1 root root   122 Aug 14 10:10 forkexec.log
-rw-r--r-- 1 root root     0 Aug 14 10:36 forkstart.log
-rw-r----- 1 root root  2508 Aug 14 10:36 lxc.conf
-rw-r----- 1 root root 80587 Aug 14 10:36 lxc.log
-rw-r----- 1 root root  2939 Aug 14 10:36 lxc.log.old

@jonoinnz
Copy link

We've also been experiencing the same problem in our environments, basically, we created clusters, and even out of the box if we leave the cluster for a while we are seeing the same error while trying to exec into the containers, we can reboot the containers to bash into them again but that is not convenient in a production environment for obvious reasons. It would be great if we can be provided some instructions on how to go about troubleshooting this.

@tomponline tomponline modified the milestones: lxd-5.17, lxd-5.18 Aug 21, 2023
@KLIM8D
Copy link
Author

KLIM8D commented Aug 21, 2023

Manually restoring lxc.conf in /var/snap/lxd/common/lxd/logs/<CONTAINER> makes a faulty container, that gives this error, exec commands again. Just took NIC-values from lxc config show and grapped the PID for the lxc.hook.pre-start line.

So the questions is, why does this file suddenly disappear...?

@petrosssit
Copy link

Thank you @KLIM8D

We have the same issue and think of a workaround for now, to keep a copy of lxc.conf when a container starts, and have a way of alerting when lxc.conf disappears, and of sending it back.

To have it in a logs directory is certainly odd, a /run subdir would be more appropriate I would say.

On top of that we seem to have at times issues with some logrotation in this directory. We had lxc.log files recreated with 0 bytes while the lxc monitor process keeps the old file open, filing up /var, and only a restart of the container releases the space of the already deleted file.

I am not sure whether this misbehaviour of logrotation may also delete the lxc.conf in the same directory. I have not figured out where the logrotation configuration for the log files here is kept.

@KLIM8D
Copy link
Author

KLIM8D commented Aug 31, 2023

Apparently it is the entire content, or at least all the containers directories in /var/snap/lxd/common/lxd/logs/ that is cleaned up by something. Seems OK if it wasn't the fact that the lxc.conf is needed..

Tried setting a rule in auditd, to see what deleted the lxc.conf file and to my surprise, no containers directory was in there. Running exec for either of the containers creates a directory for the container in /var/snap/lxd/common/lxd/logs/ but it only contains forkexec.log and lxc.log.

@petrosssit So maybe you should check in your workaround script, if the directory is there or not.

# ls /var/snap/lxd/common/lxd/logs/
dnsmasq.lxdbr0.log  lxd.log 

After exec for test-redis container

# ls /var/snap/lxd/common/lxd/logs/*
/var/snap/lxd/common/lxd/logs/dnsmasq.lxdbr0.log  /var/snap/lxd/common/lxd/logs/lxd.log

/var/snap/lxd/common/lxd/logs/test-redis:
forkexec.log  lxc.log

@tomponline tomponline modified the milestones: lxd-5.18, lxd-5.19 Sep 20, 2023
@tomponline tomponline removed this from the lxd-5.19 milestone Oct 10, 2023
@nfournil
Copy link

Same issue with LXC/D 5.0.2 from Debian Bookworm ... lxc.conf (writen in /var/log/lxd/[container]/lxc.conf disapears.

More answer on this ?

@nfournil
Copy link

To be more precise on my setup I work with CEPH as storage backend.

To go further, I've added a script who periodically check lxc.conf (1 test / 10s) presence and keep all API logs : the file disapears when the cleaning of expired backup task started.

I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0

Hope we could find more on this bug...

Nice workaround to regenerate lxc.conf : lxc config show [container] > /var/log/lxd/[container]/lxc.conf

@capriciousduck
Copy link

capriciousduck commented Oct 30, 2023

@nfournil A simple empty lxc.conf file works too.

touch /var/snap/lxd/common/lxd/logs/[container]/lxc.conf

¯_(ツ)_/¯

Edit: "lxc exec" with an empty lxc.conf works, but later data is added to that lxc.conf file (right after I did an lxc exec). So it's not really empty all the while.

@tomponline
Copy link
Member

I've also seen this commit who is not present on the refactored code (security on lxc.conf on delete operations) : https://github.com/canonical/lxd/pull/4010/files?diff=split&w=0

Hi, please could you clarify what you mean here?

@nfournil
Copy link

nfournil commented Nov 6, 2023 via email

@nfournil
Copy link

nfournil commented Nov 8, 2023

Had added a more agressive (1 check / s if lxc.conf file exists) logging and I confirm that file disapears when the API calls :

location: srv05-r2b-fl1
metadata:
context:
class: task
description: Expiring log files
operation: 42d4066c-3b40-4649-867b-637556984e7a
project: ""
level: debug
message: Started operation
timestamp: "2023-11-08T11:00:47.092443683+01:00"
type: logging

less than 1/2s next, file doesn't exists anymore...

@tomponline tomponline added Bug Confirmed to be a bug and removed Incomplete Waiting on more information from reporter labels Nov 10, 2023
@tomponline tomponline self-assigned this Mar 21, 2024
@tomponline
Copy link
Member

@mihalicyn I think if we cherry-pick this lxc/incus#361 it should help.

@tomponline
Copy link
Member

Reopening due to #12084 (comment)

@tomponline tomponline reopened this Jul 2, 2024
@tomponline
Copy link
Member

Looks likeinstanceLogDelete is preventing lxc.cong from being deleted

lxd/lxd/instance_logs.go

Lines 344 to 346 in 8f2d07c

if !strings.HasSuffix(file, ".log") || file == "lxc.log" || file == "qemu.log" {
return response.BadRequest(fmt.Errorf("Only log files excluding qemu.log and lxc.log may be deleted"))
}

@tomponline
Copy link
Member

One solution to this might be to stop using tmp and log dirs for .conf files, similar to lxc/incus#426

@C0rn3j
Copy link

C0rn3j commented Sep 19, 2024

Ran into the same error on snap LXD 6.1, but seemingly not for the same reason, deleting a container was stuck and broke it.

A simple LXD daemon reload (not restart, that kills containers) helped there - systemctl reload snap.lxd.daemon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants