Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple events for the same node cause a crash #307

Closed
Niksko opened this issue Nov 27, 2020 · 7 comments · Fixed by #313
Closed

Multiple events for the same node cause a crash #307

Niksko opened this issue Nov 27, 2020 · 7 comments · Fixed by #313
Assignees
Labels
Type: Bug Something isn't working

Comments

@Niksko
Copy link

Niksko commented Nov 27, 2020

Currently I'm sending both EC2 state change events and ASG lifecycle events to the NTH in queue processor mode. This sounds like it should be supported, however in practice what seems to be happening is:

  1. ASG lifecycle event is handle. Node is cordoned and drained.
  2. EC2 state change event for termination comes though, node is now no longer there, so the NTH crashes.

I'm not convinced that the NTH needs to crash if it can't find a node, but that aside, is there any other workaround for this? Or should I just pick EC2 state change events OR ASG lifecycle events to handle?

@bwagner5
Copy link
Contributor

We've discussed a related situation previously here #272

I think this issue #297 could be used to mitigate this situation but we may not be able to run multiple replicated NTH pods reliably.

I'd be interested to see if using ec2 instance tags to turn on and off NTH management could mitigate this (although I'm not sure that would be the right use of it or work in every instance).

I'm not opposed to removing the crash and just log the event if it's not found. I don't think there's a possible workaround at this time.

@bwagner5 bwagner5 added the Type: Bug Something isn't working label Nov 27, 2020
@Niksko
Copy link
Author

Niksko commented Nov 30, 2020

I'd prefer the crash removal if possible. I can understand why not being able to find a node might have been worthy of causing a crash during early design, but it seems increasingly that this is a common scenario.

@universam1
Copy link
Contributor

Agree facing same issue - in our case we basically see the pod crashing all the time

@paalkr
Copy link

paalkr commented Nov 30, 2020

I still observe this issue with 1.11 if I enable both ASG lifecycle events

      EventPattern:
        source:
        - "aws.autoscaling"
        detail-type:
        - "EC2 Instance-terminate Lifecycle Action"

and EC2 state change events

      EventPattern:
        source:
        - "aws.ec2"
        detail-type:
        - "EC2 Instance State-change Notification"
        detail:
          state:
          - "shutting-down"

@universam1
Copy link
Contributor

@haugenj anything we can do to get a fix? Would you accept a PR?

@haugenj
Copy link
Contributor

haugenj commented Dec 9, 2020

PRs are always welcome!

If you could include a test to validate this scenario that would be 💯

@universam1
Copy link
Contributor

@haugenj fixed this issue by #313

Background: went all the way back, instead of just removing the os.exit(1) found that actually the nodeName was empty but not verified.
The existing unit test was flawed in this regard.

The empty nodeName parsing derives from EC2.DescribeInstances that have the metadata of PrivateDnsName empty, simply because the node is not running any more.

So created a custom error that allows to handle that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants