Fix flaky e2e test `TestFailoverPlayground/*stop_podinfo_on_eu_cluster` #1684

abaguas · 2024-08-02T15:29:41Z

The FailoverPlaygroundTest where podinfo is stopped on one of the clusters has been flaky. Example run 1 and run2.

The test deploys a GSLB with failover strategy on two clusters. Then, the app is stopped on the EU cluster and the test sometimes fails because the following line returns an error: err = instanceEU.WaitForAppIsStopped(). This function calls a chain of functions that ends up on the WaitForApp function expecting the app to have 0 replicas and 0 DNSEndpoint targets. In some situations this is the case. However, since the app is running on the US cluster, in other situations the failover happens very quickly (good feature of K8GB) and the US targets are already part of the DNSEndpoints targets: example 1, example 2.
The DNSEndpoint taken from the links above looks as follows, where the targets are the IP addresses of the US cluster:

{
     "apiVersion": "externaldns.k8s.io/v1alpha1",
     "kind": "DNSEndpoint",
     ...
     "spec": {
         "endpoints": [
             {
                 "dnsName": "playground-failover.cloud.example.com",
                 "labels": {
                     "strategy": "failover"
                 },
                 "recordTTL": 5,
                 "recordType": "A",
                 "targets": [
                     "172.18.0.7",
                     "172.18.0.8"
                 ]
             }
         ]
     }
 }

This is an expected output, and should be accepted. To better understand the proposed solution here is how the DNSEndpoint resource looked like before the failover:

 {
     "apiVersion": "externaldns.k8s.io/v1alpha1",
     "kind": "DNSEndpoint",
     ...
     "spec": {
         "endpoints": [
             {
                 "dnsName": "localtargets-playground-failover.cloud.example.com",
                 "recordTTL": 5,
                 "recordType": "A",
                 "targets": [
                     "172.18.0.4",
                     "172.18.0.5"
                 ]
             },
             {
                 "dnsName": "playground-failover.cloud.example.com",
                 "labels": {
                     "strategy": "failover"
                 },
                 "recordTTL": 5,
                 "recordType": "A",
                 "targets": [
                     "172.18.0.4",
                     "172.18.0.5"
                 ]
             }
         ]
     }
 }

Above we can see that in addition to the main playground-failover.cloud.example.com domain there is also a localtargets-playground-failover.cloud.example.com. This localtargets-* domain disappears once the app is stopped, which indicates that the controller learnt that the app was scaled to 0 replicas.
The proposed fix is therefore to check for the targets of the localtargets-* domain. This shows that the K8GB controller behaved as expected and does not depend on the synchronization of records between clusters.

abaguas · 2024-08-02T16:04:42Z

terratest/utils/extensions.go

@@ -449,7 +449,7 @@ func (i *Instance) waitForApp(predicate func(instances int) bool, stop bool) (er
 	i.w.t.Logf("Wait for coreDNS to be filled by local targets %s", i.w.state.gslb.host)
 	for n := 0; n < maxRetries/2; n++ {
 		localTargets := i.GetLocalTargets()
-		if len(localTargets) == 0 {
+		if (!stop && len(localTargets) == 0) || (stop && len(localTargets) != 0) {


This part is rather a follow up of #1682 since the number of localTargets should be 0 if the action is stop. It helps keeping the code more understandable but it didn't cause any issues since in that case the code would return already on line 431.

Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>

ytsarev

Another great find 👍 Thank you so much!

abaguas requested review from donovanmuller, k0da, kuritka, ytsarev and jkremser as code owners August 2, 2024 15:29

abaguas marked this pull request as draft August 2, 2024 15:29

abaguas force-pushed the fix/stoppodinfotest branch from b85d0e8 to c8226c2 Compare August 2, 2024 15:30

abaguas commented Aug 2, 2024

View reviewed changes

abaguas force-pushed the fix/stoppodinfotest branch from c8226c2 to b03e6d9 Compare August 2, 2024 16:05

Fix flaky e2e test TestFailoverPlayground/*stop_podinfo_on_eu_cluster

ac5e10b

Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>

abaguas force-pushed the fix/stoppodinfotest branch from b03e6d9 to ac5e10b Compare August 2, 2024 16:06

abaguas marked this pull request as ready for review August 2, 2024 16:31

ytsarev approved these changes Aug 2, 2024

View reviewed changes

ytsarev merged commit ba5dea3 into k8gb-io:master Aug 2, 2024
15 checks passed

abaguas deleted the fix/stoppodinfotest branch November 1, 2024 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky e2e test `TestFailoverPlayground/*stop_podinfo_on_eu_cluster` #1684

Fix flaky e2e test `TestFailoverPlayground/*stop_podinfo_on_eu_cluster` #1684

abaguas commented Aug 2, 2024 •

edited

Loading

abaguas Aug 2, 2024 •

edited

Loading

ytsarev left a comment

Fix flaky e2e test TestFailoverPlayground/*stop_podinfo_on_eu_cluster #1684

Fix flaky e2e test TestFailoverPlayground/*stop_podinfo_on_eu_cluster #1684

Conversation

abaguas commented Aug 2, 2024 • edited Loading

abaguas Aug 2, 2024 • edited Loading

Choose a reason for hiding this comment

ytsarev left a comment

Choose a reason for hiding this comment

Fix flaky e2e test `TestFailoverPlayground/*stop_podinfo_on_eu_cluster` #1684

Fix flaky e2e test `TestFailoverPlayground/*stop_podinfo_on_eu_cluster` #1684

abaguas commented Aug 2, 2024 •

edited

Loading

abaguas Aug 2, 2024 •

edited

Loading