[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

zaynt4606 · 2024-12-19T04:10:47Z

What changes were proposed in this pull request?

Retry seding RPC to LifecycleManager when TimeoutException.

Why are the changes needed?

RPC messages are processed by Dispatcher.threadpool which its numThreads depends on numUsableCores.
In some cases (k8s) the numThreads of LifecycleManager are not enough while the RPCs are a lot so there are TimeoutExceptions.
Add retry when there are TimeoutExceptions.

Does this PR introduce any user-facing change?

No.

Another way is to adjust the configuration celeborn.lifecycleManager.rpc.dispatcher.threads to add the numThreads.
This way is more affective.

How was this patch tested?

Cluster testing.

turboFei · 2024-12-19T04:56:07Z

docs/configuration/network.md

@@ -29,7 +29,7 @@ license: |
 | celeborn.&lt;module&gt;.io.enableVerboseMetrics | false | false | Whether to track Netty memory detailed metrics. If true, the detailed metrics of Netty PoolByteBufAllocator will be gotten, otherwise only general memory usage will be tracked. |  |  | 
 | celeborn.&lt;module&gt;.io.lazyFD | true | false | Whether to initialize FileDescriptor lazily or not. If true, file descriptors are created only when data is going to be transferred. This can reduce the number of open files. If setting <module> to `fetch`, it works for worker fetch server. |  |  | 
 | celeborn.&lt;module&gt;.io.maxRetries | 3 | false | Max number of times we will try IO exceptions (such as connection timeouts) per request. If set to 0, we will not do any retries. If setting <module> to `data`, it works for shuffle client push and fetch data. If setting <module> to `replicate`, it works for replicate client of worker replicating data to peer worker. If setting <module> to `push`, it works for Flink shuffle client push data. |  |  | 
-| celeborn.&lt;module&gt;.io.mode | EPOLL | false | Netty EventLoopGroup backend, available options: NIO, EPOLL. If epoll mode is available, the default IO mode is EPOLL; otherwise, the default is NIO. |  |  | 
+| celeborn.&lt;module&gt;.io.mode | NIO | false | Netty EventLoopGroup backend, available options: NIO, EPOLL. If epoll mode is available, the default IO mode is EPOLL; otherwise, the default is NIO. |  |  | 


cc @SteNicholas Seems the doc generation depends on the developer environment

it can not pass the GA, need to revert it.

codecov · 2024-12-19T12:00:11Z

Codecov Report

Attention: Patch coverage is 88.88889% with 1 line in your changes missing coverage. Please review.

Project coverage is 33.02%. Comparing base (4aabe37) to head (f8cd555).
Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
...cala/org/apache/celeborn/common/CelebornConf.scala	88.89%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3008      +/-   ##
==========================================
+ Coverage   32.88%   33.02%   +0.14%     
==========================================
  Files         331      331              
  Lines       19800    19851      +51     
  Branches     1780     1787       +7     
==========================================
+ Hits         6510     6554      +44     
- Misses      12929    12934       +5     
- Partials      361      363       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

mridulm · 2024-12-20T21:08:17Z

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

    initDataClientFactoryIfNeeded();
  }

+  public <T> T callLifecycleManagerWithTimeoutRetry(Callable<T> callable, String name)


Instead of making changes everywhere - do we want to simply change askSync/askAsync to become retry aware ? With number of retries passed in as a param (for specific cases where we dont want retries for ex) ?

I agree to change askSync/askAsync.
There are a lot of exception changes caused by that the setupLifecycleManagerRef will throws RpcTimeoutExceptions which we need to catch. I change the Exception type to RuntimeException

turboFei · 2024-12-23T18:52:42Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

+      .withAlternative("celeborn.callLifecycleManager.maxRetries")
+      .categories("client")
+      .version("0.6.0")
+      .doc("Max retry times for client to reserve slots.")


to reserve slots.

Seems not only for reserving slots.

turboFei · 2024-12-23T18:53:50Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

+    buildConf("celeborn.rpc.timeoutRetryWait")
+      .categories("network")
+      .version("0.6.0")
+      .doc("Wait time before next retry if RpcTimeoutException.")


nit: if RpcTimeoutException => on RpcTimeoutException.

turboFei · 2024-12-23T18:54:16Z

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

    try {
      limitZeroInFlight(mapKey, pushState);
-


unnecessary change.

turboFei · 2024-12-23T18:54:24Z

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

@@ -1700,13 +1700,12 @@ private void mapEndInternal(
      throws IOException {
    final String mapKey = Utils.makeMapKey(shuffleId, mapId, attemptId);
    PushState pushState = getPushState(mapKey);
-


unnecessary change.

turboFei · 2024-12-23T18:55:18Z

client/src/main/java/org/apache/celeborn/client/ShuffleClientImpl.java

-import java.util.concurrent.ConcurrentHashMap;
-import java.util.concurrent.ExecutorService;
-import java.util.concurrent.TimeUnit;
+import java.util.concurrent.*;


seems unnecessary change? I do not see new concurrent class involved in this class.

turboFei · 2024-12-23T18:59:38Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

+
+  val CLIENT_CALL_LIFECYCLEMANAGER_MAX_RETRIES: ConfigEntry[Int] =
+    buildConf("celeborn.client.callLifecycleManager.maxRetries")
+      .withAlternative("celeborn.callLifecycleManager.maxRetries")


Seems there is no legacy config celeborn.callLifecycleManager.maxRetries, do not need withAlternative?

Is it possible to reuse CLIENT_RPC_MAX_RETIRES？

val CLIENT_RPC_MAX_RETIRES: ConfigEntry[Int] = buildConf("celeborn.client.rpc.maxRetries") .categories("client") .version("0.3.2") .doc("Max RPC retry times in LifecycleManager.") .intConf .createWithDefault(3)

Too many parameters are not easy to maintain.

Maybe we can fallback the specific client config to a client default config item at least.

turboFei · 2024-12-23T19:02:03Z

common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala

@@ -4884,6 +4886,23 @@ object CelebornConf extends Logging {
      .timeConf(TimeUnit.MILLISECONDS)
      .createWithDefaultString("3s")

+  val RPC_TIMEOUT_RETRY_WAIT: ConfigEntry[Long] =
+    buildConf("celeborn.rpc.timeoutRetryWait")


Seems celeborn.rpc.retryWait is enough.

has been updated~

zaynt4606 added 4 commits December 19, 2024 11:59

retry when send Rpc LifecycleManager RpcTimeout

a928c40

retry in ont function

591ad34

reformat

b636fbb

config change

b63c232

turboFei reviewed Dec 19, 2024

View reviewed changes

zaynt4606 added 8 commits December 19, 2024 14:05

config error

d968c77

delete break

4b98374

add throws

42687e7

no need init wait time

0a4b8b7

add exception for flink sc

0364bd3

reformat

0999748

exception change

dce8f13

mr exception

99154e8

mridulm reviewed Dec 20, 2024

View reviewed changes

zaynt4606 added 6 commits December 23, 2024 14:01

change Exception to runtimeException

5bcf9fc

revert exception

f8cd555

move retry into rpc

bc6237a

Compatible with UT

12650d1

useless change

b0d6e58

interruptedException

f515888

turboFei reviewed Dec 23, 2024

View reviewed changes

zaynt4606 added 3 commits December 24, 2024 10:18

config

0451af2

networker config

015fbb3

revert import change by reformat

f317d6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

zaynt4606 commented Dec 19, 2024 •

edited

Loading

turboFei Dec 19, 2024

turboFei Dec 19, 2024

codecov bot commented Dec 19, 2024 •

edited

Loading

mridulm Dec 20, 2024

zaynt4606 Dec 23, 2024 •

edited

Loading

turboFei Dec 23, 2024

turboFei Dec 23, 2024

turboFei Dec 23, 2024

turboFei Dec 23, 2024

turboFei Dec 23, 2024

turboFei Dec 23, 2024

turboFei Dec 23, 2024

turboFei Dec 23, 2024 •

edited

Loading

turboFei Dec 23, 2024 •

edited

Loading

turboFei Dec 23, 2024

zaynt4606 Dec 24, 2024

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

Are you sure you want to change the base?

[CELEBORN-1757] Add retry when sending RPC to LifecycleManager #3008

Conversation

zaynt4606 commented Dec 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 19, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

zaynt4606 Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turboFei Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

turboFei Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaynt4606 commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

zaynt4606 Dec 23, 2024 •

edited

Loading

turboFei Dec 23, 2024 •

edited

Loading

turboFei Dec 23, 2024 •

edited

Loading