Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce clustermap change listener to propagate remote replica addition/removal events #1368

Merged
merged 8 commits into from
Feb 10, 2020

Conversation

jsjtzyy
Copy link
Contributor

@jsjtzyy jsjtzyy commented Jan 28, 2020

With "move replica", Helix cluster manager should be able to dynamically add/remove remote replica infos and therefore clustermap will be dynamically changed. External components like ReplicationManager needs to update thread assignment and some in-mem data structures in repsonse to such changes. This PR introduces clustermap change listener which allows ReplicationManager to listen to any change in
clustermap and take actions accordingly. For now, there is only one method in clustermap
change listener: onReplicaAddedOrRemoved(), we could add more methods like onReplicaLeadershipChange etc if needed.


Remote replica addition/removal in clustermap is currently supported by dynamic cluster change handler only.( #1355 )

@jsjtzyy jsjtzyy self-assigned this Jan 28, 2020
@codecov-io
Copy link

codecov-io commented Jan 28, 2020

Codecov Report

Merging #1368 into master will increase coverage by <.01%.
The diff coverage is 84.37%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1368      +/-   ##
============================================
+ Coverage     72.08%   72.09%   +<.01%     
- Complexity     6792     6805      +13     
============================================
  Files           489      489              
  Lines         38530    38651     +121     
  Branches       4902     4920      +18     
============================================
+ Hits          27775    27865      +90     
- Misses         9417     9437      +20     
- Partials       1338     1349      +11
Impacted Files Coverage Δ Complexity Δ
...om.github.ambry.replication/ReplicationEngine.java 79.04% <ø> (ø) 47 <0> (ø) ⬇️
....github.ambry.clustermap/StaticClusterManager.java 71.02% <0%> (-0.3%) 83 <0> (ø)
...thub.ambry.clustermap/CompositeClusterManager.java 59.72% <0%> (-1.71%) 22 <0> (ø)
...m.github.ambry.clustermap/HelixClusterManager.java 87.08% <33.33%> (-0.49%) 57 <0> (ø)
...va/com.github.ambry.replication/PartitionInfo.java 80% <76%> (-10.91%) 8 <5> (+3)
...m.github.ambry.replication/ReplicationMetrics.java 94.48% <85.71%> (-0.5%) 46 <5> (+3)
...m.github.ambry.replication/ReplicationManager.java 91.89% <95.89%> (+3.43%) 19 <4> (+2) ⬆️
...in/java/com.github.ambry.clustermap/Partition.java 76.92% <0%> (-3.3%) 28% <0%> (-1%)
...github.ambry.rest/AsyncRequestResponseHandler.java 87.87% <0%> (-2.36%) 21% <0%> (ø)
...java/com.github.ambry.network/SSLTransmission.java 70.46% <0%> (-0.93%) 76% <0%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfb751b...7310645. Read the comment docs.

@jsjtzyy jsjtzyy changed the title WIP: Introduce clustermap change listener to propagate remote replica addition/removal events Introduce clustermap change listener to propagate remote replica addition/removal events Jan 29, 2020
@jsjtzyy jsjtzyy marked this pull request as ready for review January 29, 2020 06:12
@jsjtzyy jsjtzyy requested a review from ankagrawal January 29, 2020 19:53
@@ -153,6 +153,11 @@ public MetricRegistry getMetricRegistry() {
return metricRegistry;
}

@Override
public void registerClusterMapListener(ClusterMapChangeListener clusterMapChangeListener) {
// no op for static cluster map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we dont really support, do you think throwing an exception might work here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will make the change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isnt addressed yet.

@Override
public void onReplicaAddedOrRemoved(List<ReplicaId> addedReplicas, List<ReplicaId> removedReplicas) {
// 1. wait for start() to complete
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it looks like we have some duplicate code between this method and addReplica/removeReplica methods. Can we try to refactor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addReplica(), removeReplica() are methods for changing replicas on local node. They create a new PartitionInfo along with all peer RemoteReplicaInfos. onReplicaAddedOrRemoved() basically modifies existing PartitionInfo by adding/removing certain RemoteReplicaInfo. Refactoring will be considered in a separate PR after we have verified functionality of this change.

* concurrently update remote replica infos.
*/
@Override
public void onReplicaAddedOrRemoved(List<ReplicaId> addedReplicas, List<ReplicaId> removedReplicas) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding a private class to serve as the implementation of ClusterMapChangeListener, just like PartitionStateChangeListener.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benefits we can get by doing so? One probably reason I come up with is: when unifying all existing replication managers into a general replication manager, it requires 2 clustermap change listeners taking actions against changes in VCR and regular Ambry cluster. Feel free to add more thoughts here.

* @return {@code true} if remote replica info is added. {@code false} if it is already present
*/
boolean addReplicaInfoIfAbsent(RemoteReplicaInfo remoteReplicaInfo) {
lock.lock();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please lock getRemoveReplicaInfos method as well, this lock is protecting remoteReplicas list, we should lock it everywhere we use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, i will use read-write lock instead

if (partitionInfo.addReplicaInfoIfAbsent(remoteReplicaInfo)) {
replicationMetrics.addMetricsForRemoteReplicaInfo(remoteReplicaInfo);
}
replicaInfosToAdd.add(remoteReplicaInfo);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only add remoteReplicaInfo to the "toAdd" list when it was successfully added to partitionInfo, so we should move this statement in the if-else statement above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, will fix this.

Comment on lines 194 to 208
for (ReplicaId remoteReplica : removedPeerReplicas) {
PartitionInfo partitionInfo = partitionToPartitionInfo.get(remoteReplica.getPartitionId());
List<RemoteReplicaInfo> remoteReplicaInfos = new ArrayList<>(partitionInfo.getRemoteReplicaInfos());
for (RemoteReplicaInfo remoteReplicaInfo : remoteReplicaInfos) {
if (remoteReplicaInfo.getReplicaId().getDataNodeId() == remoteReplica.getDataNodeId()) {
logger.info("Removing remote replica {} on {} from replica threads.", remoteReplica.getReplicaPath(),
remoteReplica.getDataNodeId());
replicaInfosToRemove.add(remoteReplicaInfo);
if (partitionInfo.removeRelicaInfoIfPresent(remoteReplicaInfo)) {
replicationMetrics.removeMetricsForRemoteReplicaInfo(remoteReplicaInfo);
}
break;
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? we can just change the PartitionInfo's remove method to take a replica id as parameter, and looping the list in PartitionInfo, it's more aligned with addReplicaInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, will change removeRelicaInfoIfPresent

if (remoteReplicaInfo.getReplicaId().getDataNodeId() == remoteReplica.getDataNodeId()) {
logger.info("Removing remote replica {} on {} from replica threads.", remoteReplica.getReplicaPath(),
remoteReplica.getDataNodeId());
replicaInfosToRemove.add(remoteReplicaInfo);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should put this statement within if-else statement as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

events

With "move replica", Helix cluster manager should be able to dynamically
add/remove remote replica infos and therefore clustermap will be
dynamically changed. External components like ReplicationManager needs
to update thread assignment and some in-mem data structures in repsonse
to such changes. This PR introduces clustermap change listener which
allows ReplicationManager to listen to any change in clustermap and take
actions accordingly. For now, there is only one method in clustermap
change listener: onReplicaAddedOrRemoved(), we could add more methods
like onReplicaLeadershipChange etc if needed.
@jsjtzyy jsjtzyy force-pushed the clustermap-callback branch from b12d085 to fa026e9 Compare February 7, 2020 02:14
Copy link
Collaborator

@ankagrawal ankagrawal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall.

@justinlin-linkedin justinlin-linkedin merged commit bc91abd into linkedin:master Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants