Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal as FITKO PoC #544

Closed
anjastrunk opened this issue Apr 2, 2024 · 34 comments
Assignees
Labels
SCS-VP10 Related to tender lot SCS-VP10

Comments

@anjastrunk
Copy link
Contributor

anjastrunk commented Apr 2, 2024

Provide a productive SCS cluster as PoC for FITKO. The cluster MUST be set up with the open source Lifecycle Management Tool for OpenStack and K8S Yaook and must be SCS compliant.

In contrast to #414, productive SCS cluster is set up on bare metal.

@anjastrunk anjastrunk added question Further information is requested SCS-VP10 Related to tender lot SCS-VP10 and removed question Further information is requested labels Apr 2, 2024
@anjastrunk anjastrunk changed the title [Other] Setup productive SCS cluster at Cloud&Heat with Yaook [Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal Apr 2, 2024
@martinmo
Copy link
Member

martinmo commented Apr 8, 2024

(Note: we initially used #414 as the sole issue to track both the provisioning of the virtual and the bare metal test clusters. For the sake of better documentation, we retroactively created this separate issue for the bare metal setup.)

Current state is that the installation is completed and we have a working bare metal Yaook cluster. Thanks to the contributions of @cah-hbaum this OpenStack cluster is already SCS compliant, i.e., it fulfills the stabilized standards of the IaaS track (details can be found in #415 and in https://github.com/SovereignCloudStack/standards/tree/do-not-merge/scs-compliant-yaook/Informational). Here's a recap of what happened until we got to this point:

Initial preparation for the Yaook bare metal installation started at the beginning of March. This involved a "rehearsal" of the installation procedure on an existing, simpler test cluster, because for me this was the first time conducting such an install and the hardware for the actual installation was not yet commisioned.

During the rehearsal we already had network setup issues we needed to work around:

  • The first issues were related to VLAN tagging and caused connection issues between the Yaook Management Cluster and the install node. These were (probably) caused by some incompatibilities or misconfiguration of the network interfaces of the virtual install node and/or KVM (Proxmox). The exact cause is not known, because we gave up debugging at some point and switched to a bare metal install node, which immediately worked.
  • The second set of issues were related to PXE booting for the actual nodes that are supposed to be provisioned by ironic. These turned out to be firmware issues (a subset of the servers "forgot" its PXE configuration after reboot) as well as misleading and very well hidden BIOS settings (especially: PXE boot timeouts).

The installation of the bare metal test cluster was conducted between March 11th and March 19th but we again bumped into a lot of technical difficulties. Debugging and fixing these was a bit more time consuming than usual because I am not yet 100% accustomed to the interactions of all the components.

  • Because this is going to be a productive cluster, the network setup is a lot more complex than the rehearsal install (more redundancy, stricter isolation). In addition to separate VLANs and subnets for different purposes, we also use several dedicated switches (e.g., for the Ceph nodes). This took several iterations and debugging sessions to get everything right, i.e., that all components supposed to be communicating could talk to each other.
  • Some trial and error was needed to get the netplan part of the cloud-init configuration right, also partly because I misunderstood the configuration. (This is very unfortunate and we will make this more robust and easier to verify in the future, e.g., by switching to interface selection by MAC address via the match keyword.)
  • During provisioning with ironic, a subset of the nodes repeatedly ended up in the clean failed state. It took some time to debug, but the Yaook bare metal logs contained the hint ("certificate not yet valid") and we finally figured out this was caused by an extremely out of sync hardware clock.
  • A similar, firmware setup/BIOS related error that cost us time was a still active hardware raid configuration on another subset of the nodes, which also lead to provisioning failures with ironic.
  • Finally, we had to troubleshoot some software issues that worked flawlessly before: e.g., during the automated install we ran into the problem that the K8s APT repositories were moved. Additionally, the CNI plugin (calico) installation failed initially, which we just fixed by switching to a different release.

The next step will involve moving the hardware from our facility to its final location.

@anjastrunk
Copy link
Contributor Author

@cah-hbaum Please provide YAML output for SCS compatible IaaS v4, to proof cluster is SCS compliant.

@anjastrunk anjastrunk changed the title [Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal [Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal as FitKom PoC Apr 10, 2024
@anjastrunk anjastrunk changed the title [Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal as FitKom PoC [Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal as FitKo PoC Apr 10, 2024
@anjastrunk anjastrunk changed the title [Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal as FitKo PoC [Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal as FITKO PoC Apr 10, 2024
@martinmo
Copy link
Member

The next step will involve moving the hardware from our facility to its final location.

Status update:

  • Hardware was successfully moved to the final location (I am unsure if I am allowed to reveal the location yet – please be patient, more infos will come soon) and the cluster is stable.
  • The next step is a proper network link, we currently use a temporary link.

@shmelkin
Copy link

shmelkin commented May 2, 2024

Status update for multiple working days:

  • Uplink configured
  • Providernet (1000) configured on computes and controllers
  • Network, Subnet, Router configured in OpenStack
  • Firewalls configured for routing
  • tested "Ubuntu 22.04" VM with public-routable IP
  • Configured and used SCS conformity-tests on cluster

@shmelkin
Copy link

shmelkin commented May 2, 2024

Status update:

  • configured openStack API and made it publicly available
  • configured monitoring on site-wide cluster and connected it to c&h global monitoring
  • started installing yk8s for functionality-proof
  • benchmarked VM
------------------------------------------------------------------------
Benchmark Run: Thu May 02 2024 07:01:31 - 07:29:25
8 CPUs in system; running 1 parallel copy of tests

Dhrystone 2 using register variables       49374945.7 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                     7331.2 MWIPS (8.8 s, 7 samples)
Execl Throughput                               4367.4 lps   (29.7 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks       1700297.0 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks          465706.0 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks       4516692.6 KBps  (30.0 s, 2 samples)
Pipe Throughput                             2643777.7 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                 249035.2 lps   (10.0 s, 7 samples)
Process Creation                               4239.2 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                   3404.9 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   7774.8 lpm   (60.0 s, 2 samples)
System Call Overhead                        2399837.6 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0   49374945.7   4230.9
Double-Precision Whetstone                       55.0       7331.2   1333.0
Execl Throughput                                 43.0       4367.4   1015.7
File Copy 1024 bufsize 2000 maxblocks          3960.0    1700297.0   4293.7
File Copy 256 bufsize 500 maxblocks            1655.0     465706.0   2813.9
File Copy 4096 bufsize 8000 maxblocks          5800.0    4516692.6   7787.4
Pipe Throughput                               12440.0    2643777.7   2125.2
Pipe-based Context Switching                   4000.0     249035.2    622.6
Process Creation                                126.0       4239.2    336.4
Shell Scripts (1 concurrent)                     42.4       3404.9    803.0
Shell Scripts (8 concurrent)                      6.0       7774.8  12958.0
System Call Overhead                          15000.0    2399837.6   1599.9
                                                                   ========
System Benchmarks Index Score                                        1995.8

------------------------------------------------------------------------
Benchmark Run: Thu May 02 2024 07:29:25 - 07:57:36
8 CPUs in system; running 8 parallel copies of tests

Dhrystone 2 using register variables      396360992.9 lps   (10.0 s, 7 samples)
Double-Precision Whetstone                    58344.0 MWIPS (9.8 s, 7 samples)
Execl Throughput                              20889.7 lps   (29.9 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks      12927118.6 KBps  (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks         3677514.2 KBps  (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks      22497528.7 KBps  (30.0 s, 2 samples)
Pipe Throughput                            21037325.6 lps   (10.0 s, 7 samples)
Pipe-based Context Switching                1958050.1 lps   (10.0 s, 7 samples)
Process Creation                              44864.0 lps   (30.0 s, 2 samples)
Shell Scripts (1 concurrent)                  65052.2 lpm   (60.0 s, 2 samples)
Shell Scripts (8 concurrent)                   9420.2 lpm   (60.0 s, 2 samples)
System Call Overhead                       19695065.7 lps   (10.0 s, 7 samples)

System Benchmarks Index Values               BASELINE       RESULT    INDEX
Dhrystone 2 using register variables         116700.0  396360992.9  33964.1
Double-Precision Whetstone                       55.0      58344.0  10608.0
Execl Throughput                                 43.0      20889.7   4858.1
File Copy 1024 bufsize 2000 maxblocks          3960.0   12927118.6  32644.2
File Copy 256 bufsize 500 maxblocks            1655.0    3677514.2  22220.6
File Copy 4096 bufsize 8000 maxblocks          5800.0   22497528.7  38788.8
Pipe Throughput                               12440.0   21037325.6  16911.0
Pipe-based Context Switching                   4000.0    1958050.1   4895.1
Process Creation                                126.0      44864.0   3560.6
Shell Scripts (1 concurrent)                     42.4      65052.2  15342.5
Shell Scripts (8 concurrent)                      6.0       9420.2  15700.3
System Call Overhead                          15000.0   19695065.7  13130.0
                                                                   ========
System Benchmarks Index Score                                       13756.0

@berendt
Copy link
Member

berendt commented May 2, 2024

@shmelkin Can you please share how you benchmarked the VM? I would like to add this to the docs as a sample for a benchmark. We only documented fio at the moment.

@shmelkin
Copy link

shmelkin commented May 3, 2024

@berendt I generally use the open source tool UnixBench for this.

@shmelkin
Copy link

shmelkin commented May 3, 2024

  • issued LetsEncrypt certificates for public openstack api endpoints
  • debugging network issue with OVN in the cluster, see post from @horazont for more info
  • fixed image-parameter minRam for all scs-compliant images
  • did tempest tests for the cluster (WIP)

@horazont
Copy link
Member

horazont commented May 3, 2024

Okay, this is interesting. Basically it's a "regression" from the OVS setups we are used to.

In OpenvSwitch/L3-Agent based setups, the NAT rules for ingress (and I suppose also egress) traffic for floating IPs were set up no matter whether the port to which the floating IP was bound was ACTIVE or DOWN.

In OVN, the NAT rules are only set up when the port is up.

That breaks a specific use case, which is the use of VRRP/keepalived in VMs to implement custom load balancers or other HA solutions.

(In particular, this breaks yaook/k8s which we tried to run as a "burn in" test.)

I'll bring this up in next week's IaaS call.

@shmelkin
Copy link

shmelkin commented May 7, 2024

  • Installed healthmonitor on l1.cloudandheat.com https://health.l1.cloudandheat.sovereignit.cloud:3000/
  • created credentials/projects for SCS members so that they can prepare scs-summit providerexchange
  • further investigated OVN NAT topic (no solution yet...)
  • Finished rollout of yk8s as burn-in test

@horazont
Copy link
Member

horazont commented May 7, 2024

We looked more into the OVN issue and it seems the only viable workaround is using allowed-address on the non VRRP port. This is somewhat sad, we'll discuss it in the IaaS call tomorrow.

@berendt
Copy link
Member

berendt commented May 8, 2024

In osism/terraform-base (used by osism/testbed) we do it this way (allowed-address) as well (VRRP is only used inside the virtual network and the managed VIP is only accessed from inside the same virtual network):

We do not reserve the VIPs by creating unassigned Neutron ports because we work with static IP addresses in osism/terraform-base. This is therefore not necessary.

It also looks as if this has always been the way independent of OVN. At least https://www.codecentric.de/wissens-hub/blog/highly-available-vips-openstack-vms-vrrp comes from a time when IMO there was no OVN in OpenStack (or OVN itself?).

Searched for some more references:

@artificial-intelligence
Copy link
Contributor

I think @berendt is right, if this would work without allowed_address_pairs you would have a security issue I think.

By default strict filters only allow the configured subnets and associated macs to pass.
Via allowed_address_pairs you are given an allowlist to extend this where needed, e.g. for VRRP.

If it would work the other way around arbitrary L2 or L3 traffic would be allowed to flow, which is of course insecure?

@shmelkin
Copy link

shmelkin commented May 8, 2024

Summary of working day:

  • prepared renaming of api endpoints
  • tested using multiple fqdn's via ingress redirect (not successfull, yet)
  • tested using multiple ingresses to serve the same endpoint (not successfull, yet)
  • installed horizon dashboard (horizon.l1.cloudandheat.com, temporary name)
  • removing of error-ish and stuck volumes

@shmelkin
Copy link

Summary of multiple working days:

  • performed maintenance (deletion of stuck volumes) x2
  • renamed API for horizon
  • rolled out users for maintenance on firewalls, ymc, installnode and all controllers, computes, and storages (for C&H Rufbereitschaft) to make support for SCS Summit possible
  • looked into issue that cinder volumes get stuck constantly (ongoing...)

@shmelkin
Copy link

We will upgrade the deployment to the current release of yaook and kubernetes 1.27.10 to probably solve the cinder issue and provide a up-to-date-ish version.

Today, I started by preparing necessary artifacts (WIP)

  • mgmt-cluster repository and manifests
  • netplan configuration for different parts of the infrastructure
  • bare-metal-k8s repository and manifests
  • ch-lbaas
  • netbox backup

We also need to backup/snapshot the healthmonitor, which will be tested later today and done right before the upgrade.

I will document here further

@martinmo
Copy link
Member

martinmo commented Jun 4, 2024

As outlined in the last comment, the deployment was upgraded to the current release of Yaook in an effort to solve the Cinder issues.

Right now, the health monitor VM isn't running anymore and we're in the progress of restoring its operation (this means, the links and the badges in the README currently don't work or show the compliance test as failed).

@shmelkin
Copy link

shmelkin commented Jun 4, 2024

After a whole lot of trouble, the deployment is ready and functional.
We

  • updated to 1.27.10 k8s and OpenStack Zed with yaook
  • issues with ceph rollout via yaook lcm were resolved
  • issues with cinder communication with ceph were resolved
  • issues with glance communication with ceph were resolved
  • networking was fixed (pxe)

@berendt
Copy link
Member

berendt commented Jun 4, 2024

  • updated to 1.27.10 k8s and OpenStack Zed with yaook

How do we deal with this? From SCS's point of view, we require OpenStack 2023.2 as the minimum version for OpenStack and not Zed.

@cah-hbaum
Copy link
Contributor

  • updated to 1.27.10 k8s and OpenStack Zed with yaook

How do we deal with this? From SCS's point of view, we require OpenStack 2023.2 as the minimum version for OpenStack and not Zed.

Oh really? Where exactly is that mentioned, I couldn't find that during my quick search.

@fkr
Copy link
Member

fkr commented Jun 4, 2024

@berendt in regards to the OpenStack version there is nothing (to my knowledge) currently in the standards. We require the OpenStack powered compute 2022.11 alongside the standards referenced here: https://docs.scs.community/standards/scs-compatible-iaas

Is there any functional reason that Zed (as in this case) would not be sufficient?

@berendt
Copy link
Member

berendt commented Jun 4, 2024

So why do we put ourselves under the burden of going through every upgrade in the reference implementation if it is not necessary? I had actually already assumed that we want to demand very up-to-date Openstack (and also Ceph and Kubernetes)? We also require the CSPs to have upgraded within 1 month. So that wouldn't be necessary?

@fkr
Copy link
Member

fkr commented Jun 4, 2024

The discussion for the update-window of the reference implementation is (imho) not directly connected, since that discussion is about providing support, security-updates etc. for the reference implementation.
If there is no functional reason and our conformance tests succeed (as well as the openstack powered compute 2022.11) and as such the established standards are complied with, I see no reason to require a specific OpenStack version. Especially not if we require that certain APIs are fulfilled - since that basically allows a compatibility on the API-level (and would even allow an API-wrapper if it behaves correctly). Am I overlooking something?

Having an up-to-date reference implementation is worth pursuing outside of pure standard conformance, imho.

@martinmo
Copy link
Member

Update for multiple working days:

  • Health monitor is up and running and can be accessed via GitHub auth again
  • Issued new credentials for the compliance tests in GitHub and Zuul (Update auth_url endpoint and app credential id for poc-wgcloud #608)
  • Cinder issues have been resolved with the upgrade and the IaaS compliance tests run successfully (and including IaaS scope v4 this time 🎉)
  • Fixed a monitoring issue with prometheus caused by insufficient pod resource limits leading to OOMKilled crashloop
  • Looked into Ceph performance issues, but found no obvious mistakes or "low-hanging fruits" (issues can be seen "Resource wait" times in the OSHM)

@martinmo
Copy link
Member

Today I noticed, via the health monitor, that the apimon loop couldn't create volumes anymore. apimon logs showed that the volume quota was exceeded. Usually, APImon performs cleanup of old volumes, but it always failed with:

volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots, awaiting a transfer, or be disassociated from snapshots after volume transfer

This was caused by dangling volumes which were still shown as attaching, attached or in-use, although openstack server list didn't show any running APIMonitor instances anymore. These, in turn, seem to appear because instance creation fails with errors like:

Build of instance 68d71a43-e26e-4b34-8d93-37d5b29ab6bb aborted: Unable to update the attachment. (HTTP 500)

which – I think – leaves the Cinder database in an inconsistent state.

I stopped the apimon temporarily. Using cinder reset-state --state available --attach-status detached VOLUME_UUID for the respective volumes, I was able to reset the volume states so they could be deleted. I am now looking into the HTTP 500 "Unable to update the attachment" error.

@martinmo
Copy link
Member

I can only reproduce this with the load generated by the apimon. Volume operations do not always, but quite reliably fail. Based on my findings there is something going on with the message queue.

In the cinder API logs, I see timeouts waiting for a reply for an operation, such as attachment_update. The volume manager says it couldn't send a reply because the queue was gone in the meantime. (Maybe the whole operation is just too slow and some kind of garbage collection for the queue kicks in too early?)

Here is an relevant example snippet from the cinder volume manager logs:

...
2024-06-11 14:12:49 INFO cinder.volume.manager Created volume successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:12:51 INFO cinder.volume.manager attachment_update completed successfully.
2024-06-11 14:13:51 WARNING oslo_messaging._drivers.amqpdriver reply_3a06eb20db5d46318534af475f7c46c5 doesn't exist, drop reply to 1ffd3e1fb22d4007bc6b16c5d784f430
2024-06-11 14:13:51 ERROR oslo_messaging._drivers.amqpdriver The reply 1ffd3e1fb22d4007bc6b16c5d784f430 failed to send after 60 seconds due to a missing queue (reply_3a06eb20db5d46318534af475f7c46c5). Abandoning...
2024-06-11 14:13:51 INFO cinder.volume.manager Terminate volume connection completed successfully.
2024-06-11 14:13:51 WARNING oslo_messaging._drivers.amqpdriver reply_3a06eb20db5d46318534af475f7c46c5 doesn't exist, drop reply to 73b1e4390fd74484ab8cbfbb7e376ad2
...

And the matching ERROR with the same message ID 1ffd3e1fb22d4007bc6b16c5d784f430 from the cinder API logs:

...
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 ERROR cinder.api.v3.attachments Unable to update the attachment.                                             
Traceback (most recent call last):                                                                                                                            
  File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get                                                       
    return self._queues[msg_id].get(block=True, timeout=timeout)                                                                                              
  File "/usr/local/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get                                                                           
    return waiter.wait()                                                                                                                                      
  File "/usr/local/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait                                                                          
    return get_hub().switch()                                                                                                                                 
  File "/usr/local/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch                                                                     
    return self.greenlet.switch()                                                                                                                             
_queue.Empty                                                                                                                                                  
During handling of the above exception, another exception occurred:                                                                                           
Traceback (most recent call last):                                                                                                                            
  File "/usr/local/lib/python3.8/site-packages/cinder/api/v3/attachments.py", line 250, in update                                                             
    self.volume_api.attachment_update(context,                                                                                                                
  File "/usr/local/lib/python3.8/site-packages/decorator.py", line 232, in fun                                                                                
    return caller(func, *(extras + args), **kw)                                                                                                               
  File "/usr/local/lib/python3.8/site-packages/cinder/coordination.py", line 200, in _synchronized                                                            
    return f(*a, **k)                                                                                                                                         
  File "/usr/local/lib/python3.8/site-packages/cinder/volume/api.py", line 2535, in attachment_update                                                         
    self.volume_rpcapi.attachment_update(ctxt,                                                                                                                
  File "/usr/local/lib/python3.8/site-packages/cinder/rpc.py", line 200, in _wrapper                                                                          
    return f(self, *args, **kwargs)                                                                                                                           
  File "/usr/local/lib/python3.8/site-packages/cinder/volume/rpcapi.py", line 479, in attachment_update                                                       
    return cctxt.call(ctxt,                                                                                                                                   
  File "/usr/local/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 189, in call                                                               
    result = self.transport._send(                                                                                                                            
  File "/usr/local/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send                                                               
    return self._driver.send(target, ctxt, message,                                                                                                           
  File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send                                                      
    return self._send(target, ctxt, message, wait_for_reply, timeout,                                                                                         
  File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 678, in _send                                                     
    result = self._waiter.wait(msg_id, timeout,                                                                                                               
  File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in wait                                                      
    message = self.waiters.get(msg_id, timeout=timeout)                                                                                                       
  File "/usr/local/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 443, in get                                                       
    raise oslo_messaging.MessagingTimeout(                                                                                                                    
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 1ffd3e1fb22d4007bc6b16c5d784f430                                      
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 INFO cinder.api.openstack.wsgi HTTP exception thrown: Unable to update the attachment.                       
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 INFO cinder.api.openstack.wsgi https://cinder-api.yaook.svc:8776/v3/2fc014a6dd014c0bb3b53494a5b86fa9/attachme
nts/a788db16-75b0-46e8-b407-e7415b455427 returned with HTTP 500                                                                                               
cinder-api-6494ffc69b-7jcwm: 2024-06-11 14:13:50 INFO eventlet.wsgi.server 10.2.1.10,127.0.0.1 "PUT /v3/2fc014a6dd014c0bb3b53494a5b86fa9/attachments/a788db16-
75b0-46e8-b407-e7415b455427 HTTP/1.1" status: 500  len: 400 time: 60.0660729  
...

martinmo added a commit to SovereignCloudStack/openstack-health-monitor that referenced this issue Jun 11, 2024
* this seems to be more stable, let's observe it further
* see SovereignCloudStack/standards#544

Signed-off-by: Martin Morgenstern <martin.morgenstern@cloudandheat.com>
@martinmo
Copy link
Member

martinmo commented Jun 17, 2024

Summary for multiple working days:

  • Investigated the volume attachment troubles and did a cleanup of the leaked volumes from time to time.
  • Sanity checked ceph cluster health again -> everything happy.
  • Sanity checked the cinder RabbitMQ with rabbitmqctl cluster_health and with rabbitmq-check.py from the yaook debugbox -> everything happy.
  • Switched to raw Debian 12 image for the apimon tests as suggested by Kurt, this helped speeding up the volume creation a lot.
  • Cross-checked/compared cinder and rabbitMQ configuration with other working deployments -> nothing suspicious.
  • Read through oslo_messaging and cinder sources to grasp how queues are created -> culprit found, it's a caching issue.
    • Previously, I assumed reply queues for RPC calls are always created on the fly according to the RPC reference from the nova docs: "Every consumer connects to a unique direct-based exchange via a unique exclusive queue; its life-cycle is limited to the message delivery; the exchange and queue identifiers are determined by a UUID generator".
    • However, this is not exactly true: oslo_messaging library creates reply queues with a random name on first use and caches the reply queue name in the RabbitDriver/AMQPDriverBase object, to avoid creating too many queues (see method _get_reply_q() in amqpdriver.py). So basically, there is a cached reply queue name for every cinder API worker (green)thread.
    • There is no code in cinder/oslo_messaging that ensures that the corresponding cached queue still exists.
      • side note: I also suspect there is no check if the queue was physically created and replicated to all rabbitmq replicas in the first place (and I reckon this would be hard to do, if performance matters).
    • During the weekly certificate rotation conducted by the Yaook operator, rabbitmq is reloaded/restarted to pick up the new certs. During this time, some of the reply queues get dropped but cinder API tries to use them anyways and fails.
      • e.g., [error] <0.27146.0> operation queue.declare caused a channel exception not_found: queue 'reply_ed3fbf13f2e7488c9cbd19f0ad13588d' in vhost '/' process is stopped by supervisor
      • observed problems with volume attachments start exactly at that point
  • Ultimately, a simple restart of the cinder-api processes helps (in yaook terms, a rolling restart of the cinder-api deployment) -> of course that's not a long term solution.
    • The last spike of resource errors in the health monitor was caused again by this issue, because cert rotation was yesterday night.
    • Other services that use RPC calls can be affected by this as well, such as nova-scheduler, nova-conductor and so on.

@martinmo
Copy link
Member

martinmo commented Jul 3, 2024

Update for the last few days:

  • Rolled out fixes for "RegreSSHion" vulnerability in openssh-server on cluster infra and oshm-driver instance (which hosts the health monitor).
  • Rolled out fixes for OSSA-2024-001 aka CVE-2024-32498 (see Yaook Security Advisory). We used the YAOOK_OP_VERSIONS_OVERRIDE approach to roll out the patched images for nova, nova-compute, cinder and glance without having to wait for the release of new Yaook Operator Helm Charts.
  • Fixed volume attachment quirk ("attached on None") of the oshm-driver instance.
  • Removed leaked volumes and restarted cinder-api again after the last cert rotation.
  • Further investigated a long term fix for the occassional volume attachment issue caused by stopped RabbitMQ queues:
    • There are different types of RabbitMQ queues, such as classic and quorum. For HA clusters, the vendor recommends quorum queues. Classic queues are deprecated and known to cause troubles during restarts, e.g., see discussions here, here, here and here.
    • Yaook, by default, configures OpenStack components to use quorum queues via the rabbit_quorum_queue=true setting in the [oslo_messaging_rabbit] section.
    • However, checking with rabbitmqctl list_queues name type state reveals that for the reply queues and some others, classic queues are still used. (It appears these are the transient queues.)
    • In oslo.messaging, there are some semi recent changes related to this:
    • I am about to evaluate these settings in our virtualized Yaook setup, but I had to upgrade that to Zed first.
    • Note there is, to my knowledge, no other way to force RabbitMQ to only use quorum queues (e.g., via policy).
  • There are some other tuning knobs which I am currently evaluating, especially worker counts in the affected services. It turns out, that basically all services, except glance, use the CPU count as the default.

@martinmo
Copy link
Member

martinmo commented Jul 5, 2024

Update wrt RabbitMQ / volume service issues:

  • rabbit_transient_quorum_queue can't be used, this setting is too new :(
  • Thanks to a coworker, we found another oslo.messaging setting worth tweaking: heartbeat_rate -> in Zed, default is 2, recommended is 3 (see https://review.opendev.org/c/openstack/oslo.messaging/+/875615)
  • In addition to that, I assembled all necessary settings for sensible worker counts in OpenStack services and carefully tested all of that in our virtualized SCS Yaook cluster.
  • I rolled out the corresponding changes in the productive cluster (nova, neutron, glance, cinder, barbican). Because Nova, Neutron and Cinder APIs were restarting, there was a very short interruption which can be seen in the Health Monitor, but everything is okay now again.
  • We will see on Monday, if these measures helped (remember: certificate rotation + RabbitMQ rolling restart is Sunday night).

@martinmo
Copy link
Member

Update:

  • Our APImon now uses a dedicated OpenStack project, i.e., not the same project where the oshm-driver instance is running. Because we need to cleanup leaked volume leftovers from time to time, this separation makes sense.
  • The oslo.messaging config and worker number changes from my last comment didn't help, unfortunately. The RabbitMQ reply queue issues that appear after regular certificate rotation + rolling restart remain. But, as I've written above, at least we know when it's happening and what can be done to restore service.

garloff pushed a commit to SovereignCloudStack/openstack-health-monitor that referenced this issue Jul 12, 2024
* Add scripts and configuration for poc-wgcloud
* Cleanup commented out stuff
* Ignore temporary apimon files
* Ignore compressed log files
* Revert changes to run-apimon-in-tmux.sh
* Reduce the load and use n=2 VMs
  -  this seems to be more stable, let's observe it further
  -  see SovereignCloudStack/standards#544

Signed-off-by: Martin Morgenstern <martin.morgenstern@cloudandheat.com>
@martinmo
Copy link
Member

Update: today we rolled out new RabbitMQ version and applied tuned RabbitMQ settings for all four (colocated) RabbitMQ instances (each in turn 3x replicated) to reduce CPU contention on the (OpenStack) control plane nodes.

@martinmo
Copy link
Member

martinmo commented Jul 22, 2024

The last change finally did resolve our RabbitMQ issue. There are no more stopped reply queues left after the rolling restart (i.e., after RabbitMQ failover) and there are also no OpenStack API errors anymore, even during the restart.

@mbuechse
Copy link
Contributor

Can this be closed?

@martinmo
Copy link
Member

Actually, yes 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SCS-VP10 Related to tender lot SCS-VP10
Projects
Status: Done
Development

No branches or pull requests

9 participants