Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: ability to emit Kubernetes events #411

Merged
merged 17 commits into from
Apr 28, 2021
Merged

feat: ability to emit Kubernetes events #411

merged 17 commits into from
Apr 28, 2021

Conversation

trutx
Copy link
Contributor

@trutx trutx commented Apr 15, 2021

Fixes #414

Description of changes:

This PR adds the ability to emit Kubernetes events when interruption signals are received and when actions like cordon or drain are taken on nodes.

Kubernetes events are very useful for monitoring. At New Relic we push all cluster events using an eventrouter up to New Relic One and then we create alerts and dashboards based on such events. This PR makes NTH emit a Kubernetes event every time a signal is received from AWS and also every time an action is taken on the node NTH is running on, regardless if it ends in success or failure.

The involvedObject for these events is the Kubernetes node NTH runs on, so they are published in the default namespace. To get them run kubectl get events --field-selector source=aws-node-termination-handler.

Additionally annotations can be attached to each event. Such annotations are useful for filtering and discovery. I.e. each event can be stamped with an annotation stating the cluster name the event belongs to, the team owning the cluster, etc. These annotations can later be used in the New Relic UI to filter out events belonging to a specific subset via a NRQL query, or can be used to filter the results in a kubectl get events query piped to a processor like jq or yq.

@@ -320,11 +335,14 @@ func drainOrCordonIfNecessary(interruptionEventStore *interruptioneventstore.Sto
log.Err(err).Msgf("node '%s' not found in the cluster", nodeName)
} else {
log.Err(err).Msg("There was a problem while trying to cordon and drain the node")
metrics.NodeActionsInc("cordon-and-drain", nodeName, err)
Copy link
Contributor Author

@trutx trutx Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the metrics line here too so a metric is also created when an error occurs.

@trutx
Copy link
Contributor Author

trutx commented Apr 15, 2021

Looks like tests pass and fail randomly, they passed locally before pushing the PR. Will check tomorrow.

@trutx
Copy link
Contributor Author

trutx commented Apr 16, 2021

The last CordonAndDrain event wasn't caught sometimes because the test exited too quickly. Adding a retry loop solved the issue.

@trutx
Copy link
Contributor Author

trutx commented Apr 21, 2021

I put this PR on a real world scenario at work and I instantly missed the emitter node AZ and instance type in the events, with which I plan to create reports like types of interruptions by AZ and types of interruptions by instance type.

So I took a second pass here and added a default set of annotations to all events, basically all the data that was already gathered from IMDS by ec2metadata.GetNodeMetadata().

@haugenj haugenj self-requested a review April 22, 2021 14:40
Copy link
Contributor

@haugenj haugenj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr! I was using events to debug something the other day, so I can see how this would be useful.

Overall looks pretty good to me, and I appreciate you taking the time to clean up some of the docs and the prometheus test. Would you mind posting an example of how these get rendered with kubectl get events ... ?

// Emit a Kubernetes event for the current node and with the given event type, reason and message
func (r K8sEventRecorder) Emit(eventType, eventReason, eventMsgFmt string, eventMsgArgs ...interface{}) {
if r.enabled {
r.AnnotatedEventf(r.node, r.annotations, eventType, eventReason, eventMsgFmt, eventMsgArgs...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in queue processor mode, node might not be the node that's being cordoned and drained, is that intentional? perhaps we should pass in node ?

Copy link
Contributor Author

@trutx trutx Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, totally omitted that. Revamped it so instead of getting the Node from the API when initialising the recorder we just create an ObjectReference to a Node object with a parameterised name. Such name will be passed in every time recorder.Emit() is called and it will come from the nodeName stored in every particular interruption event, thus making it valid both for IMDS and SQS processors if I'm not mistaken.

annotations["public-ipv4"] = nodeMetadata.PublicIP
annotations["region"] = nodeMetadata.Region

// Parse extra annotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - you can remove the comments from here through the end of the function, they're just stating what the next lines show

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// Create default annotations
// Worth iterating over nodeMetadata fields using reflect? (trutx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo it's fine as you've done it

}
metrics.NodeActionsInc("uncordon", nodeName, err)
recorder.Emit(observability.Normal, observability.UncordonReason, observability.UncordonMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in an else ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, did it everywhere but here. Fixed.

test/e2e/prometheus-metrics-test Show resolved Hide resolved
Comment on lines 45 to 51
verbs:
- create
- get
- list
- patch
- update
- watch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all these necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. Removed them all but create.

cmd/node-termination-handler.go Outdated Show resolved Hide resolved
Copy link
Contributor

@brycahta brycahta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution and enhancing the existing code!

cmd/node-termination-handler.go Outdated Show resolved Hide resolved
docs/kubernetes_events.md Outdated Show resolved Hide resolved
docs/kubernetes_events.md Outdated Show resolved Hide resolved
pkg/observability/k8s-events.go Outdated Show resolved Hide resolved
pkg/observability/k8s-events.go Show resolved Hide resolved
test/e2e/spot-interruption-test-events-on Outdated Show resolved Hide resolved
pkg/observability/k8s-events.go Outdated Show resolved Hide resolved
@trutx
Copy link
Contributor Author

trutx commented Apr 23, 2021

🤔 pushed changes to my fork but this PR doesn't sync

newrelic-forks@f6786aa

@trutx
Copy link
Contributor Author

trutx commented Apr 23, 2021

Ok there we go.

🤔 pushed changes to my fork but this PR doesn't sync

newrelic-forks@f6786aa


## Caveats

### Default annotations in Queue Processor Mode
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to document that annotations in SQS mode will match those of the Node where ANTH runs and not those of the Node where the interruption, drain or whatever event actually happens.

Not sure if that's the best approach or we should just don't create default, IMDS-based annotations when running in SQS mode. Happy to code that if preferred.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the helpful explanations 👍🏼

Similar to @haugenj , I'd like to see some examples of events before making that call. If the annotations introduce a lot of noise in QueueProcessor mode, then maybe we omit them

@trutx
Copy link
Contributor Author

trutx commented Apr 23, 2021

Overall looks pretty good to me, and I appreciate you taking the time to clean up some of the docs and the prometheus test. Would you mind posting an example of how these get rendered with kubectl get events ... ?

I'll post an example and a screenshot when I have something to show. I'm waiting for actual rebalance or interruption events to happen in my test clusters. Probably when the US wake up since clusters live in us-east-2


## Default annotations

If `emit-kubernetes-events` is enabled, AWS Node Termination Handler will automatically inject a set of annotations to each event it emits. Such annotations are gathered from the underlying host's IMDS endpoint and enrich each event with information about the host that emitted it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you elevate your note here about Queue Processor mode?

Ex: Note: In Queue Processor mode, these annotations will reflect the node running NTH not the node receiving the events. See Caveats for more information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Feel free to edit/correct any documentation wording at will. A native's English will always be better than mine.


## Caveats

### Default annotations in Queue Processor Mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the helpful explanations 👍🏼

Similar to @haugenj , I'd like to see some examples of events before making that call. If the annotations introduce a lot of noise in QueueProcessor mode, then maybe we omit them

@trutx
Copy link
Contributor Author

trutx commented Apr 23, 2021

I just realised how relevant it is to have the instance lifecycle available so I added it.

Still waiting for a real interruption or recommendation event to happen. Problem is events TTL in k8s is 1 hour by default unless modified and it's too much effort for me to change it in a ~550 node cluster so I guess we'll have to wait and be lucky enough to see it before it expires.

@@ -75,9 +75,6 @@ func (m SQSMonitor) ec2StateChangeToInterruptionEvent(event EventBridgeEvent, me
InstanceID: ec2StateChangeDetail.InstanceID,
Description: fmt.Sprintf("EC2 State Change event received. Instance went into %s at %s \n", ec2StateChangeDetail.State, event.getTime()),
}
if err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this error handling here made the ec2-state-change-sqs-test test fail. Not sure if swallowing this error is the way to go but this is out of the scope of this PR. I'm just replacing err with _ for the sake of the linter but IMO this should be revisited.

Valid also for pkg/monitor/sqsevent/rebalance-recommendation-event.go and pkg/monitor/sqsevent/spot-itn-event.go

@trutx
Copy link
Contributor Author

trutx commented Apr 26, 2021

Here's an anonimized example of a rebalance recommendation that just happened:

$ k get ev --field-selector source=aws-node-termination-handler
LAST SEEN   TYPE     REASON                    OBJECT                                            MESSAGE
13m         Normal   RebalanceRecommendation   node/ip-10-1-2-3.us-east-2.compute.internal   Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z
13m         Normal   PreDrain                  node/ip-10-1-2-3.us-east-2.compute.internal   Pre-drain task successfully executed
13m         Normal   CordonAndDrain            node/ip-10-1-2-3.us-east-2.compute.internal   Node successfully cordoned and drained
$ k get ev --field-selector source=aws-node-termination-handler -o wide
LAST SEEN   TYPE     REASON                    OBJECT                                            SUBOBJECT   SOURCE                                                                     MESSAGE                                                                                FIRST SEEN   COUNT   NAME
13m         Normal   RebalanceRecommendation   node/ip-10-1-2-3.us-east-2.compute.internal               aws-node-termination-handler, ip-10-1-2-3.us-east-2.compute.internal   Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z   13m          1       ip-10-1-2-3.us-east-2.compute.internal.1679710f4148ba6a
13m         Normal   PreDrain                  node/ip-10-1-2-3.us-east-2.compute.internal               aws-node-termination-handler, ip-10-1-2-3.us-east-2.compute.internal   Pre-drain task successfully executed                                                   13m          1       ip-10-1-2-3.us-east-2.compute.internal.1679710f48cde7e6
13m         Normal   CordonAndDrain            node/ip-10-1-2-3.us-east-2.compute.internal               aws-node-termination-handler, ip-10-1-2-3.us-east-2.compute.internal   Node successfully cordoned and drained                                                 13m          1       ip-10-1-2-3.us-east-2.compute.internal.167971124e6d316c
$ k get ev --field-selector source=aws-node-termination-handler,reason=RebalanceRecommendation -o json
{
    "apiVersion": "v1",
    "items": [
        {
            "apiVersion": "v1",
            "count": 1,
            "eventTime": null,
            "firstTimestamp": "2021-04-26T15:10:51Z",
            "involvedObject": {
                "kind": "Node",
                "name": "ip-10-1-2-3.us-east-2.compute.internal",
                "namespace": "default"
            },
            "kind": "Event",
            "lastTimestamp": "2021-04-26T15:10:51Z",
            "message": "Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z \n",
            "metadata": {
                "annotations": {
                    "account-id": "123456789012",
                    "availability-zone": "us-east-2a",
                    "instance-id": "i-abcdef12345678901",
                    "instance-life-cycle": "spot",
                    "instance-type": "m5.8xlarge",
                    "local-hostname": "ip-10-1-2-3.us-east-2.compute.internal",
                    "local-ipv4": "10.1.2.3",
                    "public-hostname": "",
                    "public-ipv4": "",
                    "region": "us-east-2"
                },
                "creationTimestamp": "2021-04-26T15:10:51Z",
                "name": "ip-10-1-2-3.us-east-2.compute.internal.1679710f4148ba6a",
                "namespace": "default",
                "resourceVersion": "1488622620",
                "selfLink": "/api/v1/namespaces/default/events/ip-10-1-2-3.us-east-2.compute.internal.1679710f4148ba6a",
                "uid": "1adcd988-2e9b-4422-bc54-c89982d46d43"
            },
            "reason": "RebalanceRecommendation",
            "reportingComponent": "",
            "reportingInstance": "",
            "source": {
                "component": "aws-node-termination-handler",
                "host": "ip-10-1-2-3.us-east-2.compute.internal"
            },
            "type": "Normal"
        }
    ],
    "kind": "List",
    "metadata": {
        "resourceVersion": "",
        "selfLink": ""
    }
}

@trutx
Copy link
Contributor Author

trutx commented Apr 26, 2021

And here's how a few days worth of events in a couple of clusters look like in New Relic One, combined with cluster autoscaler and node events.
Screenshot 2021-04-26 at 18 01 50
All "by AZ", "by instance type", "by account ID" filters are based on the annotations attached to each event, meaning you could set any arbitrary annotations you might need via the extra annotations parameter and then use them to filter events.

Copy link
Contributor

@brycahta brycahta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates + sample. After seeing the sample, could you make 3 minor updates? After, it'll be good from my side 👍

"message": "Rebalance recommendation received. Instance will be cordoned at 2021-04-26T15:10:48Z \n",

Could you update the Description for sqs-events to contain the instanceID to avoid confusion? This prob should've been done when first adding these events.

The code links are as follows:

  1. Rebalance Recommendation
  • ex: fmt.Sprintf("Rebalance recommendation event received. Instance %s will be cordoned at %s \n", rebalanceRecDetail.InstanceID, event.getTime())
  1. ec2 state change event

  2. spot itn event

To close out on the other open question, I am good with leaving annotations for both imds+sqs mode.

@trutx
Copy link
Contributor Author

trutx commented Apr 26, 2021

Done. Still struggling with go-report-card-test though, funnily enough reducing cyclomatic complexity made the grade go down to 93.9% and now tests don't pass because of that.

brycahta
brycahta previously approved these changes Apr 26, 2021
Copy link
Contributor

@brycahta brycahta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍🏼

@trutx
Copy link
Contributor Author

trutx commented Apr 27, 2021

Here's the output of two consecutive make go-report-card-test with not a single change in between:

$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 6.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            2.9s
 => => transferring dockerfile: 37B                                                                                                                                             0.1s
 => [internal] load .dockerignore                                                                                                                                               0.1s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.8s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:17:48 ERROR: (misspell) exit status 2
Grade: A+ (99.7%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 100%
misspell: 0%
$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 0.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            0.0s
 => => transferring dockerfile: 37B                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                               0.0s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.0s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:22:51 ERROR: (misspell) exit status 2
2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2
Grade: A+ (93.9%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 0%
misspell: 0%
❌ Test failed to meet go-report-card threshold score of: 98
make: *** [go-report-card-test] Error 1

Tried that several times and looks totally random to me, and caused by a third party software anyway.

Sometimes we get the 2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2 error, then we get ineffassign: 0% which causes the overall grade to be Grade: A+ (93.9%), which is below 98 so the test fails. Some other times we don't get the error, ineffassign is at 100% and the test just passes.

@brycahta
Copy link
Contributor

Here's the output of two consecutive make go-report-card-test with not a single change in between:

$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 6.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            2.9s
 => => transferring dockerfile: 37B                                                                                                                                             0.1s
 => [internal] load .dockerignore                                                                                                                                               0.1s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.8s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:17:48 ERROR: (misspell) exit status 2
Grade: A+ (99.7%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 100%
misspell: 0%
$ make go-report-card-test
/Users/rtorrents/repos/newrelic-forks/aws-node-termination-handler//test/go-report-card-test/run-report-card-test.sh
[+] Building 0.2s (10/10) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                            0.0s
 => => transferring dockerfile: 37B                                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                               0.0s
 => => transferring context: 2B                                                                                                                                                 0.0s
 => [internal] load metadata for docker.io/library/golang:1                                                                                                                     0.0s
 => [1/6] FROM docker.io/library/golang:1                                                                                                                                       0.0s
 => CACHED [2/6] WORKDIR /app                                                                                                                                                   0.0s
 => CACHED [3/6] RUN go get github.com/gojp/goreportcard                                                                                                                        0.0s
 => CACHED [4/6] RUN cd /go/src/github.com/gojp/goreportcard && (make install || cd / && curl -L https://git.io/vp6lP | sh)                                                     0.0s
 => CACHED [5/6] RUN go get github.com/gojp/goreportcard/cmd/goreportcard-cli                                                                                                   0.0s
 => CACHED [6/6] RUN go get -u golang.org/x/tools/cmd/goimports                                                                                                                 0.0s
 => exporting to image                                                                                                                                                          0.0s
 => => exporting layers                                                                                                                                                         0.0s
 => => writing image sha256:08b623c0fc5ca3b455985be47730dbc0f29a1a680d3379b533cef36f73535d09                                                                                    0.0s
 => => naming to docker.io/library/go-report-card-cli                                                                                                                           0.0s

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
✅ goimports found no formatting errors in go source files
2021/04/27 09:22:51 ERROR: (misspell) exit status 2
2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2
Grade: A+ (93.9%)
Files: 46
Issues: 1
gofmt: 100%
go_vet: 100%
gocyclo: 97%
	cmd/node-termination-handler.go
		Line 54: warning: cyclomatic complexity 38 of function main() is high (> 15) (gocyclo)
golint: 100%
license: 100%
ineffassign: 0%
misspell: 0%
❌ Test failed to meet go-report-card threshold score of: 98
make: *** [go-report-card-test] Error 1

Tried that several times and looks totally random to me, and caused by a third party software anyway.

Sometimes we get the 2021/04/27 09:25:51 ERROR: (ineffassign) exit status 2 error, then we get ineffassign: 0% which causes the overall grade to be Grade: A+ (93.9%), which is below 98 so the test fails. Some other times we don't get the error, ineffassign is at 100% and the test just passes.

Interesting-- thanks for looking into this. I'm good merging the PR after @haugenj reviews and we can look into the fix.

@trutx
Copy link
Contributor Author

trutx commented Apr 27, 2021

Spent some time trying to reproduce the error but I wasn't able to come to a consistent ending, in my laptop I can't get it neither to always success nor to always fail. What looks consistent is the outcome of the Github CI, always fails.

There have been changes in https://github.com/gojp/goreportcard recently, might be related despite I've tried to pin the module to an older commit and results were still unpredictable.

@@ -4,6 +4,7 @@ go 1.15

require (
github.com/aws/aws-sdk-go v1.33.1
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brycahta do we need to add to the third party licenses for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, nice catch.

Should be added here following similar format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case "error":
default:
return config, fmt.Errorf("Invalid log-level passed: %s Should be one of: info, debug, error", config.LogLevel)
if err := validateLogLevel(strings.ToLower(config.LogLevel)); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this adds value as a separate function, I'd prefer you revert this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't add much value, I just did it to reduce the cyclomatic complexity, otherwise go-report-card-test would fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say revert it and don't worry about the report card test, we'll take care of that afterwards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cmd/node-termination-handler.go Outdated Show resolved Hide resolved

Default annotations values are gathered from the IMDS endpoint local to the Node on which AWS Node Termination Handler runs. This is fine when running on IMDS Processor Mode since an AWS Node Termination Handler Pod will be deployed to all Nodes via a `DaemonSet` and each Node will emit all events related to itself with its own default annotations.

However, when running in Queue Processor Mode AWS Node Termination Handler is deployed to a number of Nodes (1 replica by default) via a `Deployment`. In that case the default annotations values will be gathered from the Node(s) running AWS Node Termination Handler, and so the values in the default annotations stamped to all events will match those of the Node from which the event was emitted, not those of the Node of which the event is about.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the annotations worth anything in this case? It seems both misleading and unhelpful to print values for a different node than the event actually affected. Perhaps we should just omit the default annotations when running in queue processor mode, thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think if you would want to know the metadata of the node running nth in queue processor mode. If not, it looks like the host instance running NTH is included in the event here, so we could omit:

"source": {
                "component": "aws-node-termination-handler",
                "host": "ip-10-1-2-3.us-east-2.compute.internal"
            }

It seems both misleading and unhelpful to print values for a different node than the event actually affected

Also agree, but the latest commit should address this. Now, the messaging shown in the even will include the instanceId being affected:

"message": "Rebalance recommendation received. Instance **<instance-id>** will be cordoned at 2021-04-26T15:10:48Z \n",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User provided annotations will still be worthy (think of annotations like cluster name, cluster owner, environment, etc) but data coming from IMDS might not make much sense and I agree it can be confusing, hence the notes in the doc. Probably the right move is to omit annotations with data coming from IMDS when in SQS mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last commit should do it. Also updated the doc to reflect these changes.

Copy link
Contributor

@haugenj haugenj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, thanks!

@haugenj haugenj merged commit d698755 into aws:main Apr 28, 2021
@imuqtadir
Copy link
Contributor

FYI, this PR reverted some of the changes for #388

@spkane
Copy link

spkane commented May 26, 2021

@trutx Nice work on this. I was just looking for this feature and am glad to see that you have made it happen!

@trutx
Copy link
Contributor Author

trutx commented May 27, 2021

@trutx Nice work on this. I was just looking for this feature and am glad to see that you have made it happen!

Stay tuned, Pod events coming soon! cc/ @dthorsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: ability to emit Kubernetes events
5 participants