HDDS-4766: Recon resets the Operational State of datanodes to IN_SERVICE #1857

sodonnel · 2021-01-29T20:39:59Z

What changes were proposed in this pull request?

When a datanode is decommission or put to maintenance, its new state is persisted into the datanode.yaml file. When running on a cluster with Recon enabled, we can see conflicting commands are received repeatedly on the Datanode, eg:

datanode_3  | 2021-01-29 16:26:20,009 [EndpointStateMachine task thread for scm/172.24.0.6:9861 - 0 ] INFO endpoint.HeartbeatEndpointTask: Received SCM set operational state command. State: DECOMMISSIONED Expiry: 0 id 3645344
datanode_3  | 2021-01-29 16:26:50,012 [EndpointStateMachine task thread for recon/172.24.0.3:9891 - 0 ] INFO commands.SetNodeOperationalStateCommand: Create a new command to set op state IN_SERVICE 0 id is 3675347

This is happening because Recon delegates processing the DN heartbeats received by ReconNodeManager to an instance of SCMNodeManager running inside Recon. SCMNodeManager checks the reported state of the datanode matches the SCM memory state, and if they don't match, it issues a command to the DN to update its state.

In this case, Recon always tries to set the DN state back to IN_SERVICE.

Recon sub-classes the SCMNodeManager where this event is produced. Recon filters events it is allowed to send via the onMessage interface on SCMNodeManager, but the newly added event for decommission did not use that interface and hence bypassed the filter.

This change pushes the even over the onMessage interface to avoid this problem.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4766

How was this patch tested?

Existing tests and manually verified in a Docker cluster.

avijayanhwx

Thanks for working on this @sodonnel. I was wondering if we can also add a command filter at the Recon SCM heartbeat processing method. Just before returning the list of commands, we could filter them for just Recon allowed commands. That way, we don't need to depend on a specific implementation under the hood. What are your thoughts on that?

sodonnel · 2021-01-31T14:31:18Z

@avijayanhwx Thanks for taking a look. I think that makes sense, as it would be easy for someone else to add a command in SCM in the future without using the onMessage interface, if the command originates within SCMNodeManager as this decommission related one did. I will make this change and push a new commit.

… add a test

swagle

+1 LGTM

prashantpogde

lgtm

prashantpogde · 2021-02-02T18:34:17Z

Could we have a test case where SCM state for a data node doesn't match with the reported Datanode state ?

…ICE (apache#1857) (cherry picked from commit d054faa) Change-Id: Ib35069184edacfd73483cc2a760ae3b3d3d5bc71

avijayanhwx reviewed Jan 30, 2021

View reviewed changes

S O'Donnell added 2 commits February 1, 2021 14:35

Fix for HDDS-4766

9391921

Filter commands which are not allowed from the heartbeat response and…

85a27a4

… add a test

sodonnel force-pushed the HDDS-4766 branch from 397dbb3 to 85a27a4 Compare February 1, 2021 14:35

swagle approved these changes Feb 2, 2021

View reviewed changes

sodonnel merged commit d054faa into apache:master Feb 2, 2021

prashantpogde approved these changes Feb 2, 2021

View reviewed changes

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Mar 11, 2021

HDDS-4766: Recon resets the Operational State of datanodes to IN_SERV…

415a2c2

…ICE (apache#1857) (cherry picked from commit d054faa) Change-Id: Ib35069184edacfd73483cc2a760ae3b3d3d5bc71

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-4766: Recon resets the Operational State of datanodes to IN_SERVICE #1857

HDDS-4766: Recon resets the Operational State of datanodes to IN_SERVICE #1857

sodonnel commented Jan 29, 2021

avijayanhwx left a comment

sodonnel commented Jan 31, 2021

swagle left a comment

prashantpogde left a comment

prashantpogde commented Feb 2, 2021

HDDS-4766: Recon resets the Operational State of datanodes to IN_SERVICE #1857

HDDS-4766: Recon resets the Operational State of datanodes to IN_SERVICE #1857

Conversation

sodonnel commented Jan 29, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

avijayanhwx left a comment

Choose a reason for hiding this comment

sodonnel commented Jan 31, 2021

swagle left a comment

Choose a reason for hiding this comment

prashantpogde left a comment

Choose a reason for hiding this comment

prashantpogde commented Feb 2, 2021