Multiple events for the same node cause a crash #307

Niksko · 2020-11-27T03:58:05Z

Currently I'm sending both EC2 state change events and ASG lifecycle events to the NTH in queue processor mode. This sounds like it should be supported, however in practice what seems to be happening is:

ASG lifecycle event is handle. Node is cordoned and drained.
EC2 state change event for termination comes though, node is now no longer there, so the NTH crashes.

I'm not convinced that the NTH needs to crash if it can't find a node, but that aside, is there any other workaround for this? Or should I just pick EC2 state change events OR ASG lifecycle events to handle?

bwagner5 · 2020-11-27T22:06:39Z

We've discussed a related situation previously here #272

I think this issue #297 could be used to mitigate this situation but we may not be able to run multiple replicated NTH pods reliably.

I'd be interested to see if using ec2 instance tags to turn on and off NTH management could mitigate this (although I'm not sure that would be the right use of it or work in every instance).

I'm not opposed to removing the crash and just log the event if it's not found. I don't think there's a possible workaround at this time.

Niksko · 2020-11-30T02:28:26Z

I'd prefer the crash removal if possible. I can understand why not being able to find a node might have been worthy of causing a crash during early design, but it seems increasingly that this is a common scenario.

universam1 · 2020-11-30T15:03:56Z

Agree facing same issue - in our case we basically see the pod crashing all the time

paalkr · 2020-11-30T22:16:05Z

I still observe this issue with 1.11 if I enable both ASG lifecycle events

      EventPattern:
        source:
        - "aws.autoscaling"
        detail-type:
        - "EC2 Instance-terminate Lifecycle Action"

and EC2 state change events

      EventPattern:
        source:
        - "aws.ec2"
        detail-type:
        - "EC2 Instance State-change Notification"
        detail:
          state:
          - "shutting-down"

universam1 · 2020-12-09T11:40:54Z

@haugenj anything we can do to get a fix? Would you accept a PR?

haugenj · 2020-12-09T16:54:25Z

PRs are always welcome!

If you could include a test to validate this scenario that would be 💯

universam1 · 2020-12-10T14:20:20Z

@haugenj fixed this issue by #313

Background: went all the way back, instead of just removing the os.exit(1) found that actually the nodeName was empty but not verified.
The existing unit test was flawed in this regard.

The empty nodeName parsing derives from EC2.DescribeInstances that have the metadata of PrivateDnsName empty, simply because the node is not running any more.

So created a custom error that allows to handle that case.

bwagner5 added the Type: Bug Something isn't working label Nov 27, 2020

bwagner5 assigned haugenj Nov 30, 2020

universam1 mentioned this issue Dec 4, 2020

Failing to drain/cordon causes CrashLoopBackOff #272

Closed

haugenj mentioned this issue Dec 8, 2020

hangs: SQS messages not being processed #309

Closed

haugenj mentioned this issue Dec 9, 2020

Queue Processor: should be multithreaded, cannot keep up #310

Closed

universam1 mentioned this issue Dec 10, 2020

🐛 fixing crashes on State-change events #313

Merged

haugenj closed this as completed in #313 Dec 10, 2020

brycahta mentioned this issue Dec 11, 2020

node-termination-handler v1.11.1 aws/eks-charts#387

Closed

ec2-bot mentioned this issue Dec 14, 2020

🥳 node-termination-handler v1.11.1 Automated Release! 🥑 aws/eks-charts#394

Merged

bwagner5 mentioned this issue Dec 21, 2020

Skip draining nodes which have already been deleted #331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple events for the same node cause a crash #307

Multiple events for the same node cause a crash #307

Niksko commented Nov 27, 2020

bwagner5 commented Nov 27, 2020

Niksko commented Nov 30, 2020 •

edited

Loading

universam1 commented Nov 30, 2020

paalkr commented Nov 30, 2020

universam1 commented Dec 9, 2020

haugenj commented Dec 9, 2020

universam1 commented Dec 10, 2020

Multiple events for the same node cause a crash #307

Multiple events for the same node cause a crash #307

Comments

Niksko commented Nov 27, 2020

bwagner5 commented Nov 27, 2020

Niksko commented Nov 30, 2020 • edited Loading

universam1 commented Nov 30, 2020

paalkr commented Nov 30, 2020

universam1 commented Dec 9, 2020

haugenj commented Dec 9, 2020

universam1 commented Dec 10, 2020

Niksko commented Nov 30, 2020 •

edited

Loading