Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reset waitId unexpectedly #847

Merged
merged 2 commits into from
Jun 20, 2022
Merged

fix: reset waitId unexpectedly #847

merged 2 commits into from
Jun 20, 2022

Conversation

killme2008
Copy link
Contributor

@killme2008 killme2008 commented Jun 18, 2022

Motivation:

Try to fix #838 #842

Right now the replicator will wait on new logs when there is no more logs to be replicated by invoking LogManager#wait(expectLogIndex, closure, threadId) method, then log manager will create and save a WaitMeta in memory map waitMap and returns a valid waitId(starts from one) to replicator. The replicator will check whether waitId >= 0 to prevent waiting more than once. So in normal case, there is at most one WaitMeta instance for each replicator.

But when user try to submit large log entries to jraft node, the bolt may checking connection fails because of netty writing buffer limitation. Checking the bolt log ~/logs/bolt/remoting-rpc.log:

2022-06-18 20:26:15,967 WARN  [?#] [Append-Entries-Thread-Send1] check failed. address: 192.168.165.240:5004, connection: com.alipay.remoting.Connection@d36d504
com.alipay.remoting.exception.RemotingException: Check connection failed for address: Origin url [192.168.165.240:5004], Unique key [192.168.165.240:5004]., maybe write overflow!
	at com.alipay.remoting.DefaultConnectionManager.check(DefaultConnectionManager.java:363)
	at com.alipay.remoting.rpc.RpcClient.checkConnection(RpcClient.java:447)
	at com.alipay.sofa.jraft.rpc.impl.BoltRpcClient.checkConnection(BoltRpcClient.java:79)
	at com.alipay.sofa.jraft.rpc.impl.AbstractClientService.checkConnection(AbstractClientService.java:90)
	at com.alipay.sofa.jraft.rpc.impl.core.DefaultRaftClientService.appendEntries(DefaultRaftClientService.java:119)
	at com.alipay.sofa.jraft.core.Replicator.sendEntries(Replicator.java:1681)
	at com.alipay.sofa.jraft.core.Replicator.sendEntries(Replicator.java:1595)
	at com.alipay.sofa.jraft.core.Replicator.onRpcReturned(Replicator.java:1366)

Then the replicator will block a few moment and resume the replication after timeout:

 private static boolean onAppendEntriesReturned
     ....
      if (!status.isOk()) {
            // If the follower crashes, any RPC to the follower fails immediately,
            // so we need to block the follower for a while instead of looping until
            // it comes back or be removed
            // dummy_id is unlock in block
            if (isLogDebugEnabled) {
                sb.append(" fail, sleep, status=") //
                    .append(status);
                LOG.debug(sb.toString());
            }
            notifyReplicatorStatusListener(r, ReplicatorEvent.ERROR, status);
            if ((r.consecutiveErrorTimes++) % 10 == 0) {
                LOG.warn("Fail to issue RPC to {}, consecutiveErrorTimes={}, error={}", r.options.getPeerId(),
                    r.consecutiveErrorTimes, status);
            }
            r.resetInflights();
            r.setState(State.Probe);
            // unlock in in block
            r.block(startTimeMs, status.getCode());
            return false;
        }
  .....

After block timeout, it will reset the waitId and sending entries again:

static boolean continueSending(final ThreadId id, final int errCode) {
    ....
   r.waitId = -1; // reset the waitId
   if (errCode == RaftError.ETIMEDOUT.getNumber()) {
            r.blockTimer = null;
            // Send empty entries after block timeout to check the correct
            // _next_index otherwise the replicator is likely waits in            executor.shutdown();
            // _wait_more_entries and no further logs would be replicated even if the
            // last_index of this followers is less than |next_index - 1|
            r.sendProbeRequest();
        } else if (errCode != RaftError.ESTOP.getNumber()) {
            // id is unlock in _send_entries
            r.sendEntries();
        } 
}

So the waitId is reset unexpectedly, and creating too many WaitMeta instances in LogManager. When new log entries are appended, LogManager will notify each WaitMeta closure in a thread task, it may cause closure threadpool too busy and throws RejectedExecutionexception.

Modification:

Only reset waitId before sending entries and log error status at first time.

Result:

Fixes #838 #842

TODO

  • Guideline for submit large log entries.

@killme2008 killme2008 requested a review from fengjiachun June 18, 2022 12:53
@killme2008 killme2008 changed the title fix: waitId may case memory leak when replicator blocks on network is… fix: waitId may cause memory leak Jun 18, 2022
@killme2008 killme2008 changed the title fix: waitId may cause memory leak fix: waitMeta leak Jun 18, 2022
@killme2008 killme2008 changed the title fix: waitMeta leak fix: reset waitId unexpectedly Jun 18, 2022
Copy link
Contributor

@fengjiachun fengjiachun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fengjiachun fengjiachun merged commit aec90ee into master Jun 20, 2022
@fengjiachun fengjiachun deleted the fix/waitmeta-leak branch June 20, 2022 03:24
@fengjiachun fengjiachun mentioned this pull request Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

线程池满问题排查
2 participants