Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle stale IMDS metadata for secondary IPs #1177

Merged
merged 1 commit into from
Sep 1, 2020

Conversation

mogren
Copy link
Contributor

@mogren mogren commented Aug 28, 2020

Related issues: #1070, #1094

Description of changes:
If the ENI metadata returned from IMDS does not match with what we have in the datastore, we need to verify with EC2 API.

  • Always store the Primary IP of each ENI that gets added
  • If SetupENINetwork(), always remove the ENI we just tried to add from the datastore
  • Call EC2 if the number of secondary IPs is not what we expect
  • If IMDS returns no ENIs at all, abort the reconcile since the primary ENI should never be gone
  • When a Pod IP is out of cool down, only call EC2 once for that ENI
  • When an ENI is completely missing from IMDS, still remove it from the datastore

Successful E2E tests: https://app.circleci.com/pipelines/github/mogren/amazon-vpc-cni-k8s/812/workflows/673a9bc3-0122-4972-b276-dec44ec8f546/jobs/1639

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mogren mogren force-pushed the handle-stale-imds branch 6 times, most recently from 7acdd38 to 8923096 Compare August 29, 2020 00:11
pkg/ipamd/ipamd.go Show resolved Hide resolved
if err != nil {
// Failed to set up the ENI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you planning to test this? One way to test this would be add a sleep before netLink.LinkSetUp and delete the ENI and make sure eniConfig doesn't have this ENI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all we should add a unit test for it.

For manual testing, deleting the ENI and check the logs and introspection endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you need to delete ENI in such a way that setupENI calls fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added debug code:

	if rand.Intn(2) == 0 {
		log.Debug("XXXX Failing randomly!")
		return errors.New("random linkByMac failure injected")
	}

Output:

{"level":"debug","ts":"2020-08-31T21:29:50.258Z","caller":"ipamd/ipamd.go:476","msg":"Reconcile existing ENI eni-01d689e8a83c6feba IP pool"}
{"level":"debug","ts":"2020-08-31T21:29:50.258Z","caller":"ipamd/ipamd.go:476","msg":"XXXXX Deleted the last IP from attachedENI: %+v"}
{"level":"warn","ts":"2020-08-31T21:29:50.258Z","caller":"ipamd/ipamd.go:968","msg":"Instance metadata does not match data store! ipPool: [192.168.159.31 192.168.134.205 192.168.152.224 192.168.134.35 192.168.129.132 192.168.134.155 192.168.149.75 192.168.128.237 192.168.150.15 192.168.147.91 192.168.156.197 192.168.131.24 192.168.133.50 192.168.152.119], metadata: [{\n  Primary: true,\n  PrivateIpAddress: \"192.168.131.160\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.149.75\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.134.205\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.128.237\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.150.15\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.152.224\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.134.35\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.129.132\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.156.197\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.131.24\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.134.155\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.147.91\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.159.31\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.133.50\"\n}]"}
{"level":"debug","ts":"2020-08-31T21:29:50.258Z","caller":"ipamd/ipamd.go:968","msg":"We need to check the ENI status by calling the EC2 control plane."}
{"level":"debug","ts":"2020-08-31T21:29:50.352Z","caller":"ipamd/ipamd.go:1026","msg":"Reconcile and skip primary IP 192.168.131.160 on ENI eni-01d689e8a83c6feba"}
{"level":"debug","ts":"2020-08-31T21:29:50.352Z","caller":"ipamd/ipamd.go:476","msg":"Reconcile existing ENI eni-074c28aba8340f164 IP pool"}
{"level":"debug","ts":"2020-08-31T21:29:50.352Z","caller":"ipamd/ipamd.go:476","msg":"XXXXX Deleted the last IP from attachedENI: %+v"}
{"level":"warn","ts":"2020-08-31T21:29:50.352Z","caller":"ipamd/ipamd.go:968","msg":"Instance metadata does not match data store! ipPool: [192.168.145.78 192.168.145.183 192.168.133.18 192.168.155.72 192.168.143.232 192.168.132.174 192.168.145.239 192.168.133.98 192.168.157.218 192.168.143.209 192.168.153.245 192.168.139.151 192.168.150.195 192.168.150.150], metadata: [{\n  Primary: true,\n  PrivateIpAddress: \"192.168.153.162\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.155.72\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.143.232\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.132.174\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.145.78\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.145.239\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.133.98\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.150.195\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.157.218\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.143.209\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.133.18\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.153.245\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.150.150\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.145.183\"\n}]"}
{"level":"debug","ts":"2020-08-31T21:29:50.352Z","caller":"ipamd/ipamd.go:968","msg":"We need to check the ENI status by calling the EC2 control plane."}
{"level":"debug","ts":"2020-08-31T21:29:50.453Z","caller":"ipamd/ipamd.go:1026","msg":"Reconcile and skip primary IP 192.168.153.162 on ENI eni-074c28aba8340f164"}
{"level":"debug","ts":"2020-08-31T21:29:50.453Z","caller":"ipamd/ipamd.go:476","msg":"Reconcile and add a new ENI {eni-0c57b7eecd642377d 06:ec:7b:a5:81:5f %!s(int=3) 192.168.128.0/19 [{\n  Primary: true,\n  PrivateIpAddress: \"192.168.159.161\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.144.13\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.154.110\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.147.175\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.138.96\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.145.34\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.146.227\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.155.154\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.137.157\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.149.94\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.142.62\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.129.31\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.149.50\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.134.178\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.140.51\"\n}]}"}
{"level":"debug","ts":"2020-08-31T21:29:50.453Z","caller":"ipamd/ipamd.go:742","msg":"DataStore Add an ENI eni-0c57b7eecd642377d"}
{"level":"info","ts":"2020-08-31T21:29:50.453Z","caller":"networkutils/network.go:704","msg":"Setting up network for an ENI with IP address 192.168.159.161, MAC address 06:ec:7b:a5:81:5f, CIDR 192.168.128.0/19 and route table 3"}
{"level":"debug","ts":"2020-08-31T21:29:50.453Z","caller":"networkutils/network.go:704","msg":"XXXX Failing randomly!"}
{"level":"info","ts":"2020-08-31T21:29:50.453Z","caller":"ipamd/ipamd.go:754","msg":"RemoveENIFromDataStore eni-0c57b7eecd642377d: IP address pool stats: free 0 addresses, total: 28, assigned: 17"}
{"level":"error","ts":"2020-08-31T21:29:50.453Z","caller":"ipamd/ipamd.go:476","msg":"IP pool reconcile: Failed to set up ENI eni-0c57b7eecd642377d network: failed to set up ENI eni-0c57b7eecd642377d network: random linkByMac failure injected"}
{"level":"debug","ts":"2020-08-31T21:29:50.453Z","caller":"ipamd/ipamd.go:476","msg":"Successfully Reconciled ENI/IP pool"}
{"level":"debug","ts":"2020-08-31T21:29:50.453Z","caller":"ipamd/ipamd.go:476","msg":"IP Address Pool stats: total: 28, assigned: 17"}

pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
log.Errorf("Failed to fetch ENI IP addresses! Aborting reconcile of ENI %s", eni)
return
}
attachedENIIPs = ec2Addresses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the IP is just got allocated? It will be assigned to attachedENIIPs and might get deleted right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the secondary IP just got added outside of IPAMD, so it's not yet seen in the ipPool, but shows up in IMDS metadata. Then we will also call EC2 to verify, then this reconcile function will add it when looping over attachedENIIPs:

c.dataStore.AddIPv4AddressToStore(eni, strPrivateIPv4)

log.Warnf("Instance metadata does not match data store! ipPool: %v, metadata: %v", ipPool, attachedENIIPs)
log.Debugf("We need to check the ENI status by calling the EC2 control plane.")
// Call EC2 to verify IPs on this ENI
ec2Addresses, err := c.awsClient.GetIPv4sFromEC2(eni)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we planning to testing this code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a unit test step:

{"level":"debug","ts":"2020-08-29T19:52:32.526-0700","caller":"ipamd/ipamd_test.go:669","msg":"IP Address Pool stats: total: 2, assigned: 0"}
{"level":"debug","ts":"2020-08-29T19:52:32.526-0700","caller":"ipamd/ipamd_test.go:689","msg":"Reconciling ENI/IP pool info because time since last 81.451µs <= 0s"}
{"level":"debug","ts":"2020-08-29T19:52:32.526-0700","caller":"ipamd/ipamd_test.go:689","msg":"Reconcile existing ENI eni-00000000 IP pool"}
{"level":"warn","ts":"2020-08-29T19:52:32.526-0700","caller":"ipamd/ipamd.go:963","msg":"Instance metadata does not match data store! ipPool: [10.10.10.12 10.10.10.13], metadata: [{\n  Primary: true,\n  PrivateIpAddress: \"10.10.10.11\"\n}]"}
{"level":"debug","ts":"2020-08-29T19:52:32.526-0700","caller":"ipamd/ipamd.go:963","msg":"We need to check the ENI status by calling the EC2 control plane."}
{"level":"error","ts":"2020-08-29T19:52:32.526-0700","caller":"ipamd/ipamd.go:963","msg":"Failed to fetch ENI IP addresses! Aborting reconcile of ENI eni-00000000"}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking if we can manually test the change once ? (Could be done by adding some sleep and triggering the code path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding a temporary change where we just delete the last IP returned by IMDS should work to verify this code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug build with the following change:

	log.Debugf("Reconcile existing ENI %s IP pool", attachedENI.ENIID)
	badIMDS := attachedENI
	addressCount := len(badIMDS.IPv4Addresses)
	if addressCount > 1 {
		badIMDS.IPv4Addresses = badIMDS.IPv4Addresses[:addressCount-1]
	}
	log.Debugf("XXXXX Deleted the last IP from attachedENI: %+v")
	// Reconcile IP pool
	c.eniIPPoolReconcile(eniIPPool, badIMDS)
{"level":"debug","ts":"2020-08-31T21:25:39.922Z","caller":"ipamd/ipamd.go:476","msg":"Reconciling ENI/IP pool info because time since last 1m0.127726217s <= 1m0s"}
{"level":"debug","ts":"2020-08-31T21:25:39.923Z","caller":"ipamd/ipamd.go:908","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2020-08-31T21:25:39.923Z","caller":"awsutils/awsutils.go:568","msg":"Found ENI MAC address: 06:66:05:b5:c1:71"}
{"level":"debug","ts":"2020-08-31T21:25:39.924Z","caller":"awsutils/awsutils.go:582","msg":"Using device number 0 for primary eni: eni-01d689e8a83c6feba"}
{"level":"debug","ts":"2020-08-31T21:25:39.924Z","caller":"awsutils/awsutils.go:568","msg":"Found ENI: eni-01d689e8a83c6feba, MAC 06:66:05:b5:c1:71, device 0"}
{"level":"debug","ts":"2020-08-31T21:25:39.924Z","caller":"awsutils/awsutils.go:588","msg":"Found CIDR 192.168.128.0/19 for ENI 06:66:05:b5:c1:71"}
{"level":"debug","ts":"2020-08-31T21:25:39.924Z","caller":"awsutils/awsutils.go:588","msg":"Found IP addresses [192.168.131.160 192.168.149.75 192.168.134.205 192.168.128.237 192.168.150.15 192.168.152.224 192.168.134.35 192.168.129.132 192.168.156.197 192.168.131.24 192.168.134.155 192.168.147.91 192.168.159.31 192.168.133.50 192.168.152.119] on ENI 06:66:05:b5:c1:71"}
{"level":"debug","ts":"2020-08-31T21:25:39.924Z","caller":"awsutils/awsutils.go:568","msg":"Found ENI MAC address: 06:8f:fe:97:a2:9b"}
{"level":"debug","ts":"2020-08-31T21:25:39.925Z","caller":"awsutils/awsutils.go:568","msg":"Found ENI: eni-074c28aba8340f164, MAC 06:8f:fe:97:a2:9b, device 2"}
{"level":"debug","ts":"2020-08-31T21:25:39.926Z","caller":"awsutils/awsutils.go:588","msg":"Found CIDR 192.168.128.0/19 for ENI 06:8f:fe:97:a2:9b"}
{"level":"debug","ts":"2020-08-31T21:25:39.926Z","caller":"awsutils/awsutils.go:588","msg":"Found IP addresses [192.168.153.162 192.168.155.72 192.168.143.232 192.168.132.174 192.168.145.78 192.168.145.239 192.168.133.98 192.168.150.195 192.168.157.218 192.168.143.209 192.168.133.18 192.168.153.245 192.168.150.150 192.168.145.183 192.168.139.151] on ENI 06:8f:fe:97:a2:9b"}
{"level":"debug","ts":"2020-08-31T21:25:39.926Z","caller":"ipamd/ipamd.go:476","msg":"Reconcile existing ENI eni-01d689e8a83c6feba IP pool"}
{"level":"debug","ts":"2020-08-31T21:25:39.926Z","caller":"ipamd/ipamd.go:476","msg":"XXXXX Deleted the last IP from attachedENI: %+v"}
{"level":"warn","ts":"2020-08-31T21:25:39.926Z","caller":"ipamd/ipamd.go:968","msg":"Instance metadata does not match data store! ipPool: [192.168.134.35 192.168.129.132 192.168.134.155 192.168.159.31 192.168.134.205 192.168.152.224 192.168.150.15 192.168.147.91 192.168.149.75 192.168.128.237 192.168.133.50 192.168.156.197 192.168.131.24 192.168.152.119], metadata: [{\n  Primary: true,\n  PrivateIpAddress: \"192.168.131.160\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.149.75\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.134.205\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.128.237\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.150.15\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.152.224\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.134.35\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.129.132\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.156.197\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.131.24\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.134.155\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.147.91\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.159.31\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"192.168.133.50\"\n}]"}
{"level":"debug","ts":"2020-08-31T21:25:39.926Z","caller":"ipamd/ipamd.go:968","msg":"We need to check the ENI status by calling the EC2 control plane."}
{"level":"debug","ts":"2020-08-31T21:25:39.993Z","caller":"ipamd/ipamd.go:1026","msg":"Reconcile and skip primary IP 192.168.131.160 on ENI eni-01d689e8a83c6feba"}
{"level":"debug","ts":"2020-08-31T21:25:39.993Z","caller":"ipamd/ipamd.go:476","msg":"Reconcile existing ENI eni-074c28aba8340f164 IP pool"}

pkg/ipamd/ipamd.go Show resolved Hide resolved
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
@mogren mogren force-pushed the handle-stale-imds branch 2 times, most recently from 4d632b0 to ad210ef Compare August 30, 2020 03:41
@mogren mogren requested a review from SaranBalaji90 August 30, 2020 03:49
@mogren mogren force-pushed the handle-stale-imds branch from ad210ef to 7ddba73 Compare August 31, 2020 16:59
@mogren mogren changed the title Don't trust IMDS metadata for secondary IPs Handle stale IMDS metadata for secondary IPs Aug 31, 2020
pkg/ipamd/ipamd.go Outdated Show resolved Hide resolved
@@ -635,3 +637,182 @@ func TestPodENIConfigFlag(t *testing.T) {
disabled = enablePodENI()
assert.False(t, disabled)
}

func TestNodeIPPoolReconcileBadIMDSData(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Copy link
Contributor

@jayanthvn jayanthvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@SaranBalaji90 SaranBalaji90 merged commit 32c8ee7 into aws:master Sep 1, 2020
}

// Add all known attached IPs to the datastore
seenIPs := c.verifyAndAddIPsToDatastore(eni, attachedENIIPs, needEC2Reconcile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have given different name for this - ipFromEC2, ownedByEC2

// from the EC2 instance outside of the control of ipamd. If this happens, there's nothing
// we can do other than force all pods to be unassigned from the IPs on this eni.
// This scenario can occur if the reconciliation process discovered this ENI was detached
// from the EC2 instance outside of the control of ipamd. If this happens, there's nothing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

French spacing? That's quite a controversial typographical decision ... :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants