HDDS-4766: Recon resets the Operational State of datanodes to IN_SERVICE #1857
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
When a datanode is decommission or put to maintenance, its new state is persisted into the datanode.yaml file. When running on a cluster with Recon enabled, we can see conflicting commands are received repeatedly on the Datanode, eg:
This is happening because Recon delegates processing the DN heartbeats received by ReconNodeManager to an instance of SCMNodeManager running inside Recon. SCMNodeManager checks the reported state of the datanode matches the SCM memory state, and if they don't match, it issues a command to the DN to update its state.
In this case, Recon always tries to set the DN state back to IN_SERVICE.
Recon sub-classes the SCMNodeManager where this event is produced. Recon filters events it is allowed to send via the onMessage interface on SCMNodeManager, but the newly added event for decommission did not use that interface and hence bypassed the filter.
This change pushes the even over the onMessage interface to avoid this problem.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4766
How was this patch tested?
Existing tests and manually verified in a Docker cluster.