-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
instances: initial implementation of instancesV2 interface #131
instances: initial implementation of instancesV2 interface #131
Conversation
Hi @nicolehanjing. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
// parseInstanceIDFromProviderID parses the node's instance ID based on the well-known provider ID format: | ||
// * aws://<availability-zone>/<instance-id> | ||
// This function always assumes a valid providerID format was provided. | ||
func parseInstanceIDFromProviderID(providerID string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim is that intended to only parse well-formatted providerID? Should I take care of invalid cases?
pkg/providers/v2/instances.go
Outdated
|
||
if ec2Instance.State != nil { | ||
state := aws.StringValue(ec2Instance.State.Name) | ||
if state == ec2.InstanceStateNameTerminated || state == ec2.InstanceStateNameStopping || state == ec2.InstanceStateNameStopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim A little confused on how we define "shutdown"
I feel that states after "shutting down" (terminated, stopping, stopped) should all be considered as a "shutdown" state, is that right? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "shutdown" is really referring to "stopped" here. The key difference from "terminated" is that a stopped instance can go back to running state, where-as terminated instances are gone for good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should remove the check for ec2.InstanceStateNameTerminated
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks for the explanation!
f82b19d
to
0d1ef93
Compare
pkg/providers/v2/instances.go
Outdated
var err error | ||
var ec2Instance *ec2.Instance | ||
if node.Spec.ProviderID == "" { | ||
// TODO: support node name policy other than private DNS names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I have enough contexts here, can you share more inputs? :) @andrewsykim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the existing implementation, only private DNS is allowed for a node's name (see kubernetes/kubernetes#52241). We should allow other naming policies. I think a reasonable starting point is allowing the node name to be either the private DNS or the instance name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that maybe we need a ConfigMap or something to store this information. For this PR maybe just implementing with private DNS is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks for the contexts!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking into kops and EKS in some detail, I don't think we would want "arbitrary" names, or at least if we do, it would have to come with huge caveats around node security.
In particular, EKS and kops-controller match a instance ID attestation with the privateDNSName, via an STS based authentication webhook in EKS using aws-iam-authenticator, or via issuance of a certificate with the privateDNSname as the node name with presentation of the AWS instance identity document in kops-controller's case (CAPI will likely also implement the latter).
In theory, this means we can securely support instanceID, privateDNSName or other unique identifiers on the EC2 DescribeInstances, but beyond that, it would have to come with big "you don't want to use this" warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why would we use a ConfigMap vs. say ComponentConfig for the controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EKS currently has a lot of built-in assumptions on private DNS. That being said, instance ID is attractive because its guaranteed to be unique. I'm ok with making it possible for other names, but we will need to be able to restrict it for EKS to stuff available to DescribeInstances, like @randomvariable said. I don't think a ConfigMap is a good idea though. Its seems like that suggests we want to allow on-the-fly reconfiguration, which for EKS we definitely wouldn't want. Guessing, but if we allowed customers to make changes here then we'd probably want it protected behind an EKS API, which would be easier to do if it was configuration passed in via flags/ComponentConfig file on disk, and required a restart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why would we use a ConfigMap vs. say ComponentConfig for the controller?
I'm not particularly tied to using ConfigMap here, but if we use a config file it should ideally be yaml/json and not INI. I'm not sure ComponentConfig is relevant here since this is a config file read by aws-cloud-controller-manager, not a config for any of its options/flags.
EKS currently has a lot of built-in assumptions on private DNS. That being said, instance ID is attractive because its guaranteed to be unique. I'm ok with making it possible for other names, but we will need to be able to restrict it for EKS to stuff available to DescribeInstances
Sounds like there needs to be a broader discussion on this topic for sure. @nicolehanjing since we know for sure that we still want to support private DNS, let's get this PR only working with that and make sure we have a follow-up PR to support instance ID and other naming policies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha! thanks for the info!
0d1ef93
to
ad6dcd6
Compare
pkg/providers/v2/instances.go
Outdated
var err error | ||
var ec2Instance *ec2.Instance | ||
if node.Spec.ProviderID == "" { | ||
// TODO: support node name policy other than private DNS names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the existing implementation, only private DNS is allowed for a node's name (see kubernetes/kubernetes#52241). We should allow other naming policies. I think a reasonable starting point is allowing the node name to be either the private DNS or the instance name
pkg/providers/v2/instances.go
Outdated
var err error | ||
var ec2Instance *ec2.Instance | ||
if node.Spec.ProviderID == "" { | ||
// TODO: support node name policy other than private DNS names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that maybe we need a ConfigMap or something to store this information. For this PR maybe just implementing with private DNS is sufficient.
pkg/providers/v2/instances.go
Outdated
|
||
if ec2Instance.State != nil { | ||
state := aws.StringValue(ec2Instance.State.Name) | ||
if state == ec2.InstanceStateNameTerminated || state == ec2.InstanceStateNameStopping || state == ec2.InstanceStateNameStopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "shutdown" is really referring to "stopped" here. The key difference from "terminated" is that a stopped instance can go back to running state, where-as terminated instances are gone for good.
pkg/providers/v2/instances.go
Outdated
|
||
if ec2Instance.State != nil { | ||
state := aws.StringValue(ec2Instance.State.Name) | ||
if state == ec2.InstanceStateNameTerminated || state == ec2.InstanceStateNameStopping || state == ec2.InstanceStateNameStopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should remove the check for ec2.InstanceStateNameTerminated
here.
a14a944
to
72c968f
Compare
} | ||
} | ||
|
||
func TestInstanceExists(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim Updated the unit tests, let me know if you have any other suggestions :)
We'll need some follow up issues to do the following, but not blocking this PR:
|
/ok-to-test |
+1, we should definitely add the first one before removing the alpha gating env var (but in a follow-up PR): cloud-provider-aws/cmd/aws-cloud-controller-manager/main.go Lines 104 to 108 in 4eef54c
|
pkg/providers/v2/instances.go
Outdated
func (i *instances) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) { | ||
var err error | ||
if node.Spec.ProviderID == "" { | ||
_, err = i.getInstanceByPrivateDNSName(ctx, node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind some level 4 (or maybe higher) logging here that printed a line "looking for node X by private DNS name".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, will add!
pkg/providers/v2/instances.go
Outdated
} | ||
} | ||
|
||
_, err = i.getInstanceByProviderID(ctx, node.Spec.ProviderID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another log line at 4 or higher for "looking for node by provider ID".
pkg/providers/v2/instances.go
Outdated
if err != nil { | ||
return false, err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the instance does exist by private DNS name, and the provider ID is empty, we are still falling through this block and calling getInstanceByProviderID()
? Should we have an else statement for the provider ID check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated! I wrap the logic in an if-else block so that ensures we only call getInstanceByProviderID()
when the providerID is not empty
pkg/providers/v2/instances.go
Outdated
if node.Spec.ProviderID == "" { | ||
_, err = i.getInstanceByPrivateDNSName(ctx, node.Name) | ||
if err == cloudprovider.InstanceNotFound { | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we group this entire thing into an if else, then we can also add a single err != nil
and err == cloudprovider.InstanceNotFound
at the end of the function, and add logs (maybe at level 6?) for not found instances.
pkg/providers/v2/instances.go
Outdated
var ec2Instance *ec2.Instance | ||
if node.Spec.ProviderID == "" { | ||
// TODO: support node name policy other than private DNS names | ||
ec2Instance, err = i.getInstanceByPrivateDNSName(ctx, node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having the getInstanceByPrivateDNSName
and then getInstanceByProviderID
logic in multiple places, can we put all that logic into a getInstance(ctx context.Context, node *Node) (*ec2.Instance, error)
function? That way it will be easier to refactor or add other forms of getInstance
in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only difference in these two functions is the ec2 request, I think we could put the logic into a getInstance(ctx context.Context, node *Node) (*ec2.Instance, error) function but inside that function we need to differentiate the request based on the type of given node info
94c075e
to
7ca94a4
Compare
|
||
// getInstance returns the instance if the instance with the given node info still exists. | ||
// If false an error will be returned, the instance will be immediately deleted by the cloud controller manager. | ||
func (i *instances) getInstance(ctx context.Context, node *v1.Node) (*ec2.Instance, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nckturner Updated!
- added the logs
- unified two functions
getInstanceByProviderID
andgetInstanceByPrivateDNSName
intogetInstance
and the only difference is the ec2 request input - likely unified two functions
InstanceShutdownByProviderID
andInstanceShutdownByPrivateDNSName
intoInstanceShutdown
PTAL! :)
pkg/providers/v2/instances.go
Outdated
request = &ec2.DescribeInstancesInput{ | ||
InstanceIds: []*string{aws.String(instanceID)}, | ||
Filters: []*ec2.Filter{ | ||
newEc2Filter("instance-state-name", aliveFilter...), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to describe without this aliveFilter and instead filter out terminated instances locally like v1 did. I do not have numbers or documentation but my understanding is that filters can affect describeinstances performance. cc kubernetes/kubernetes#78140
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, will update! Thanks for the contexts
7ca94a4
to
c7dafbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really minor comments
pkg/providers/v2/instances.go
Outdated
_, err := i.getInstance(ctx, node) | ||
|
||
if err == cloudprovider.InstanceNotFound { | ||
klog.V(6).Infof("instance not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the node name here for the instance:
klog.V(6).Infof("instance not found for node: %s", node.Name)
|
||
nodeName := "ip-192-168-0-1.ec2.internal" | ||
|
||
tests := []struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are looking great :)
pkg/providers/v2/instances_test.go
Outdated
tests := []struct { | ||
name string | ||
node *v1.Node | ||
expectedEc2Output *ec2.DescribeInstancesOutput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this field mockedEC2Output
instead, since we are not actually validating against this as an "expected" valuie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same applies for other tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, will update!
c7dafbc
to
1c1ac06
Compare
/retest |
ccc5ee3
to
278380e
Compare
@andrewsykim updated! PTAL :) |
pkg/providers/v2/instances.go
Outdated
|
||
if err == ErrInstanceTerminated { | ||
klog.V(6).Infof("instance terminated for node: %s", node.Name) | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ErrInstanceTerminated is for terminated instances which I don't think apply for the "shutdown" case. Checking ec2.InstanceStateNameStopped
might be enough for checking shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see you're already checking that below. In that case for terminated state this should return false, nil
or false, err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this whole block can be simplified actually to:
ec2Instance, err := i.getInstance(ctx, node)
if err != nil {
return false, err
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, updated!
278380e
to
9945452
Compare
func (i *instances) InstanceShutdown(ctx context.Context, node *v1.Node) (bool, error) { | ||
ec2Instance, err := i.getInstance(ctx, node) | ||
if err != nil { | ||
return false, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim Updated the shutdown checks, PTAL!
Let me know if you have any other suggestions :)
} | ||
|
||
state := instances[0].State.Name | ||
if *state == ec2.InstanceStateNameTerminated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DescribeInstances
returns only 1 instance and that instance has state InstanceStateNameTerminated
, then I think we should treat it similar to the case of len(instances) == 0
and here return nil, cloudprovider.InstanceNotFound
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, updated!
I was thinking to have a different log for terminated instances, but as "terminated" is just an intermediate state I agree that we should treat it the same as 'not exist'
pkg/providers/v2/instances.go
Outdated
} | ||
|
||
if len(instances) > 1 { | ||
return nil, fmt.Errorf("getInstance: multiple instances found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should use errors.New
9945452
to
c71bade
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable starting point. Nothing added here is binding so we can revisit some decisions in a follow-up PR. Some functionality we need to revisit is:
- Cluster tagging support
- Support instance ID as the node name
- Configuration API to store the various thing that are configurable.
Thanks @nicolehanjing
/approve
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, nicolehanjing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Take over the initial work here: #127
This is a first pass at implementing Instances.
Some TODOs:
Which issue(s) this PR fixes:
Part of #125
Special notes for your reviewer:
Does this PR introduce a user-facing change?: