Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4766: Recon resets the Operational State of datanodes to IN_SERVICE #1857

Merged
merged 2 commits into from
Feb 2, 2021

Conversation

sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

When a datanode is decommission or put to maintenance, its new state is persisted into the datanode.yaml file. When running on a cluster with Recon enabled, we can see conflicting commands are received repeatedly on the Datanode, eg:

datanode_3  | 2021-01-29 16:26:20,009 [EndpointStateMachine task thread for scm/172.24.0.6:9861 - 0 ] INFO endpoint.HeartbeatEndpointTask: Received SCM set operational state command. State: DECOMMISSIONED Expiry: 0 id 3645344
datanode_3  | 2021-01-29 16:26:50,012 [EndpointStateMachine task thread for recon/172.24.0.3:9891 - 0 ] INFO commands.SetNodeOperationalStateCommand: Create a new command to set op state IN_SERVICE 0 id is 3675347

This is happening because Recon delegates processing the DN heartbeats received by ReconNodeManager to an instance of SCMNodeManager running inside Recon. SCMNodeManager checks the reported state of the datanode matches the SCM memory state, and if they don't match, it issues a command to the DN to update its state.

In this case, Recon always tries to set the DN state back to IN_SERVICE.

Recon sub-classes the SCMNodeManager where this event is produced. Recon filters events it is allowed to send via the onMessage interface on SCMNodeManager, but the newly added event for decommission did not use that interface and hence bypassed the filter.

This change pushes the even over the onMessage interface to avoid this problem.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4766

How was this patch tested?

Existing tests and manually verified in a Docker cluster.

Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @sodonnel. I was wondering if we can also add a command filter at the Recon SCM heartbeat processing method. Just before returning the list of commands, we could filter them for just Recon allowed commands. That way, we don't need to depend on a specific implementation under the hood. What are your thoughts on that?

@sodonnel
Copy link
Contributor Author

@avijayanhwx Thanks for taking a look. I think that makes sense, as it would be easy for someone else to add a command in SCM in the future without using the onMessage interface, if the command originates within SCMNodeManager as this decommission related one did. I will make this change and push a new commit.

Copy link
Contributor

@swagle swagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@sodonnel sodonnel merged commit d054faa into apache:master Feb 2, 2021
Copy link
Contributor

@prashantpogde prashantpogde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@prashantpogde
Copy link
Contributor

Could we have a test case where SCM state for a data node doesn't match with the reported Datanode state ?

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Mar 11, 2021
…ICE (apache#1857)

(cherry picked from commit d054faa)
Change-Id: Ib35069184edacfd73483cc2a760ae3b3d3d5bc71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants