-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Other] Setup productive SCS cluster at Cloud&Heat with Yaook on bare metal as FITKO PoC #544
Comments
(Note: we initially used #414 as the sole issue to track both the provisioning of the virtual and the bare metal test clusters. For the sake of better documentation, we retroactively created this separate issue for the bare metal setup.) Current state is that the installation is completed and we have a working bare metal Yaook cluster. Thanks to the contributions of @cah-hbaum this OpenStack cluster is already SCS compliant, i.e., it fulfills the stabilized standards of the IaaS track (details can be found in #415 and in https://github.com/SovereignCloudStack/standards/tree/do-not-merge/scs-compliant-yaook/Informational). Here's a recap of what happened until we got to this point: Initial preparation for the Yaook bare metal installation started at the beginning of March. This involved a "rehearsal" of the installation procedure on an existing, simpler test cluster, because for me this was the first time conducting such an install and the hardware for the actual installation was not yet commisioned. During the rehearsal we already had network setup issues we needed to work around:
The installation of the bare metal test cluster was conducted between March 11th and March 19th but we again bumped into a lot of technical difficulties. Debugging and fixing these was a bit more time consuming than usual because I am not yet 100% accustomed to the interactions of all the components.
The next step will involve moving the hardware from our facility to its final location. |
@cah-hbaum Please provide YAML output for SCS compatible IaaS v4, to proof cluster is SCS compliant. |
Status update:
|
Status update for multiple working days:
|
Status update:
|
@shmelkin Can you please share how you benchmarked the VM? I would like to add this to the docs as a sample for a benchmark. We only documented fio at the moment. |
|
Okay, this is interesting. Basically it's a "regression" from the OVS setups we are used to. In OpenvSwitch/L3-Agent based setups, the NAT rules for ingress (and I suppose also egress) traffic for floating IPs were set up no matter whether the port to which the floating IP was bound was ACTIVE or DOWN. In OVN, the NAT rules are only set up when the port is up. That breaks a specific use case, which is the use of VRRP/keepalived in VMs to implement custom load balancers or other HA solutions. (In particular, this breaks yaook/k8s which we tried to run as a "burn in" test.) I'll bring this up in next week's IaaS call. |
|
We looked more into the OVN issue and it seems the only viable workaround is using |
In osism/terraform-base (used by osism/testbed) we do it this way (allowed-address) as well (VRRP is only used inside the virtual network and the managed VIP is only accessed from inside the same virtual network):
We do not reserve the VIPs by creating unassigned Neutron ports because we work with static IP addresses in osism/terraform-base. This is therefore not necessary. It also looks as if this has always been the way independent of OVN. At least https://www.codecentric.de/wissens-hub/blog/highly-available-vips-openstack-vms-vrrp comes from a time when IMO there was no OVN in OpenStack (or OVN itself?). Searched for some more references:
|
I think @berendt is right, if this would work without By default strict filters only allow the configured subnets and associated macs to pass. If it would work the other way around arbitrary L2 or L3 traffic would be allowed to flow, which is of course insecure? |
Summary of working day:
|
Summary of multiple working days:
|
We will upgrade the deployment to the current release of yaook and kubernetes 1.27.10 to probably solve the cinder issue and provide a up-to-date-ish version. Today, I started by preparing necessary artifacts (WIP)
We also need to backup/snapshot the healthmonitor, which will be tested later today and done right before the upgrade. I will document here further |
As outlined in the last comment, the deployment was upgraded to the current release of Yaook in an effort to solve the Cinder issues. Right now, the health monitor VM isn't running anymore and we're in the progress of restoring its operation (this means, the links and the badges in the README currently don't work or show the compliance test as failed). |
After a whole lot of trouble, the deployment is ready and functional.
|
How do we deal with this? From SCS's point of view, we require OpenStack 2023.2 as the minimum version for OpenStack and not Zed. |
Oh really? Where exactly is that mentioned, I couldn't find that during my quick search. |
@berendt in regards to the OpenStack version there is nothing (to my knowledge) currently in the standards. We require the OpenStack powered compute 2022.11 alongside the standards referenced here: https://docs.scs.community/standards/scs-compatible-iaas Is there any functional reason that Zed (as in this case) would not be sufficient? |
So why do we put ourselves under the burden of going through every upgrade in the reference implementation if it is not necessary? I had actually already assumed that we want to demand very up-to-date Openstack (and also Ceph and Kubernetes)? We also require the CSPs to have upgraded within 1 month. So that wouldn't be necessary? |
The discussion for the update-window of the reference implementation is (imho) not directly connected, since that discussion is about providing support, security-updates etc. for the reference implementation. Having an up-to-date reference implementation is worth pursuing outside of pure standard conformance, imho. |
Update for multiple working days:
|
Today I noticed, via the health monitor, that the
This was caused by dangling volumes which were still shown as
which – I think – leaves the Cinder database in an inconsistent state. I stopped the apimon temporarily. Using |
I can only reproduce this with the load generated by the apimon. Volume operations do not always, but quite reliably fail. Based on my findings there is something going on with the message queue. In the cinder API logs, I see timeouts waiting for a reply for an operation, such as Here is an relevant example snippet from the cinder volume manager logs:
And the matching ERROR with the same message ID 1ffd3e1fb22d4007bc6b16c5d784f430 from the cinder API logs:
|
* this seems to be more stable, let's observe it further * see SovereignCloudStack/standards#544 Signed-off-by: Martin Morgenstern <martin.morgenstern@cloudandheat.com>
Summary for multiple working days:
|
Update for the last few days:
|
Update wrt RabbitMQ / volume service issues:
|
Update:
|
* Add scripts and configuration for poc-wgcloud * Cleanup commented out stuff * Ignore temporary apimon files * Ignore compressed log files * Revert changes to run-apimon-in-tmux.sh * Reduce the load and use n=2 VMs - this seems to be more stable, let's observe it further - see SovereignCloudStack/standards#544 Signed-off-by: Martin Morgenstern <martin.morgenstern@cloudandheat.com>
Update: today we rolled out new RabbitMQ version and applied tuned RabbitMQ settings for all four (colocated) RabbitMQ instances (each in turn 3x replicated) to reduce CPU contention on the (OpenStack) control plane nodes. |
The last change finally did resolve our RabbitMQ issue. There are no more stopped reply queues left after the rolling restart (i.e., after RabbitMQ failover) and there are also no OpenStack API errors anymore, even during the restart. |
Can this be closed? |
Actually, yes 🎉 |
Provide a productive SCS cluster as PoC for FITKO. The cluster MUST be set up with the open source Lifecycle Management Tool for OpenStack and K8S Yaook and must be SCS compliant.
In contrast to #414, productive SCS cluster is set up on bare metal.
The text was updated successfully, but these errors were encountered: