Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix][MetaSchedule] RPCRunner timeout when queueing up #13963

Merged

Conversation

MasterJH5574
Copy link
Contributor

@MasterJH5574 MasterJH5574 commented Feb 12, 2023

This PR fixes the timeout rule of MetaSchedule RPCRunner.

Prior to this PR, the RPCRunner sets a timeout threshold for jobs submitted to popen pool. As a result, the jobs are timed since the time that they are sent to the remote side.

Consider the case where there is only a single device for measurement. In this case, all jobs can only be executed serially and jobs must queue up. Therefore, the previous timeout configuration means the time spent on queueing will also be counted. This causes some jobs, in the worst cases, gets timeout without even started to execute, and has negative impacts on RPC MetaSchedule tuning, from the perspectives of both efficiency and result performance.

To address the problem, this PR

  • removes the RPCRunner timeout requirement for PopenExecutor. Since now, the RPCRunner timeout will only be raised from the RPC server side.
  • changes the default maximum number of RPCRunner workers to 1, which is to avoid the case where too many jobs waiting in line.

Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com

@tvm-bot
Copy link
Collaborator

tvm-bot commented Feb 12, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@MasterJH5574
Copy link
Contributor Author

cc @junrushao @tqchen

@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-02-11-rpc-runner-timeout-fix branch from e2d19eb to a2b0143 Compare February 12, 2023 07:44
@junrushao
Copy link
Member

please fix the lint

@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-02-11-rpc-runner-timeout-fix branch 2 times, most recently from a8d490d to 21e3655 Compare February 12, 2023 17:30
This PR fixes the timeout rule of MetaSchedule RPCRunner.

Prior to this PR, the RPCRunner sets a timeout threshold for jobs
submitted to popen pool. As a result, the jobs are timed since the
time that they are sent to the remote side.

Consider the case where there is only a single device for measurement.
In this case, all jobs can only be executed serially and jobs must
queue up. Therefore, the previous timeout configuration means the
time spent on queueing will also be counted. This causes some jobs,
in the worst cases, gets timeout without even started to execute,
and has negative impacts on RPC MetaSchedule tuning, from the
perspectives of both efficiency and result performance.

To address the problem, this PR
* removes the RPCRunner timeout requirement for PopenExecutor. Since
now, the RPCRunner timeout will only be raised from the RPC server
side.
* changes the default maximum number of RPCRunner workers to 1, which
is to avoid the case where too many jobs waiting in line.

Co-authored-by: Bohan Hou <32121147+spectrometerHBH@users.noreply.github.com>
@MasterJH5574 MasterJH5574 force-pushed the tvm-dev/2023-02-11-rpc-runner-timeout-fix branch from 21e3655 to 5477802 Compare February 12, 2023 17:50
@MasterJH5574
Copy link
Contributor Author

Note: I reverted the removal of timeout_sec field of RPCRunnerFuture, as I found it has uses on RPCRunnerMicro and Hexagon side, which I’m not familiar with and does not know the potential implications.

timeout_sec=self.session_timeout_sec,

@junrushao junrushao merged commit 8b2b165 into apache:main Feb 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants