Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky e2e test TestFailoverPlayground/*stop_podinfo_on_eu_cluster #1684

Merged
merged 1 commit into from
Aug 2, 2024

Conversation

abaguas
Copy link
Collaborator

@abaguas abaguas commented Aug 2, 2024

The FailoverPlaygroundTest where podinfo is stopped on one of the clusters has been flaky. Example run 1 and run2.

The test deploys a GSLB with failover strategy on two clusters. Then, the app is stopped on the EU cluster and the test sometimes fails because the following line returns an error: err = instanceEU.WaitForAppIsStopped(). This function calls a chain of functions that ends up on the WaitForApp function expecting the app to have 0 replicas and 0 DNSEndpoint targets. In some situations this is the case. However, since the app is running on the US cluster, in other situations the failover happens very quickly (good feature of K8GB) and the US targets are already part of the DNSEndpoints targets: example 1, example 2.
The DNSEndpoint taken from the links above looks as follows, where the targets are the IP addresses of the US cluster:

{
     "apiVersion": "externaldns.k8s.io/v1alpha1",
     "kind": "DNSEndpoint",
     ...
     "spec": {
         "endpoints": [
             {
                 "dnsName": "playground-failover.cloud.example.com",
                 "labels": {
                     "strategy": "failover"
                 },
                 "recordTTL": 5,
                 "recordType": "A",
                 "targets": [
                     "172.18.0.7",
                     "172.18.0.8"
                 ]
             }
         ]
     }
 }

This is an expected output, and should be accepted. To better understand the proposed solution here is how the DNSEndpoint resource looked like before the failover:

 {
     "apiVersion": "externaldns.k8s.io/v1alpha1",
     "kind": "DNSEndpoint",
     ...
     "spec": {
         "endpoints": [
             {
                 "dnsName": "localtargets-playground-failover.cloud.example.com",
                 "recordTTL": 5,
                 "recordType": "A",
                 "targets": [
                     "172.18.0.4",
                     "172.18.0.5"
                 ]
             },
             {
                 "dnsName": "playground-failover.cloud.example.com",
                 "labels": {
                     "strategy": "failover"
                 },
                 "recordTTL": 5,
                 "recordType": "A",
                 "targets": [
                     "172.18.0.4",
                     "172.18.0.5"
                 ]
             }
         ]
     }
 }

Above we can see that in addition to the main playground-failover.cloud.example.com domain there is also a localtargets-playground-failover.cloud.example.com. This localtargets-* domain disappears once the app is stopped, which indicates that the controller learnt that the app was scaled to 0 replicas.
The proposed fix is therefore to check for the targets of the localtargets-* domain. This shows that the K8GB controller behaved as expected and does not depend on the synchronization of records between clusters.

@abaguas abaguas marked this pull request as draft August 2, 2024 15:29
@abaguas abaguas force-pushed the fix/stoppodinfotest branch from b85d0e8 to c8226c2 Compare August 2, 2024 15:30
@@ -449,7 +449,7 @@ func (i *Instance) waitForApp(predicate func(instances int) bool, stop bool) (er
i.w.t.Logf("Wait for coreDNS to be filled by local targets %s", i.w.state.gslb.host)
for n := 0; n < maxRetries/2; n++ {
localTargets := i.GetLocalTargets()
if len(localTargets) == 0 {
if (!stop && len(localTargets) == 0) || (stop && len(localTargets) != 0) {
Copy link
Collaborator Author

@abaguas abaguas Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is rather a follow up of #1682 since the number of localTargets should be 0 if the action is stop. It helps keeping the code more understandable but it didn't cause any issues since in that case the code would return already on line 431.

@abaguas abaguas force-pushed the fix/stoppodinfotest branch from c8226c2 to b03e6d9 Compare August 2, 2024 16:05
Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>
@abaguas abaguas force-pushed the fix/stoppodinfotest branch from b03e6d9 to ac5e10b Compare August 2, 2024 16:06
@abaguas abaguas marked this pull request as ready for review August 2, 2024 16:31
Copy link
Member

@ytsarev ytsarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another great find 👍 Thank you so much!

@ytsarev ytsarev merged commit ba5dea3 into k8gb-io:master Aug 2, 2024
15 checks passed
@abaguas abaguas deleted the fix/stoppodinfotest branch November 1, 2024 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants