-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MetaSchedule] Tuning API cleanup & ergonomics #12895
[MetaSchedule] Tuning API cleanup & ergonomics #12895
Conversation
ad84e8d
to
c4afc4d
Compare
I like this change of decoupling compilation and tuning, the changes to default classes usage also make sense. Please let me know when the PR is ready for review. |
be1b58b
to
9c28959
Compare
fb0e91e
to
ea280df
Compare
6943a88
to
83871fe
Compare
Hey @masahi, I added On the other hand, as a high-level API, I would prefer not to tweak |
83871fe
to
612979f
Compare
Yes, I agree with this. A part of the reason I didn't want to change |
612979f
to
d99ac85
Compare
105fd0c
to
7b71a2a
Compare
@masahi I updated the PR with my latest understanding of Hexagon pipeline. Would you mind taking another look? Thanks a lot! |
@junrushao I made one comment but otherwise Hexagon change looks good to me. It didn't occur to me before that we can do |
This PR refactors tuning APIs to help with developer ergonomics and enable new potential and usecases. \## Introduction **📅 Original behavior.** The original monolithic tuning API assumes that tuning is an end-to-end process that transforms an IRModule into a runtime Module. For example, the API below is designed for Relay end-to-end tuning: ```python from tvm import meta_scheduler as ms ms.tune_relay( mod: IRModule, # The Relay program target: Union[str, Target], # Parameters used in the Relay program config: TuneConfig, # Configuration, e.g. number of trials work_dir: str, # Compilation target ... ) -> runtime.Module: ... ``` **🤔 The challenge.** While striving to be "the" API that controls end-to-end tuning, the design ignores a fact that many users desire to compile an neural network without going through the tuning process, and the fact that MetaSchedule is capable of doing so when supplied with a pre-tuned database. **🆕 Our refactoring.** Therefore, this PR is introduced to cater those concrete needs by refactoring the monolithic API into 2 or 3 stages, depending how it is used. Take `tune_relay` as another example, now it's refactored into 2 separate APIs, the first of which is slower tuning that returns a database, while the second takes a pre-tuned database for fast Relay compilation. ```python ms.relay_integration.tune_relay( mod: IRModule, params: Dict[str, NDArray], target: Union[str, Target], work_dir: str, max_trials_global: int, ... ) -> Database: ... ms.relay_integration.compile_relay( database: Database, mod: IRModule, target: Union[Target, str], params: Optional[Dict[str, NDArray]], ... ) -> runtime.Module: ... ``` \## Upgrade guide \### If you are using `ms.tune_relay` The original monolithic API is used as: ```python lib = ms.tune_relay( mod=mod, target=ARGS.target, config=ms.TuneConfig( strategy="evolutionary", num_trials_per_iter=64, max_trials_per_task=ARGS.num_trials, max_trials_global=ARGS.num_trials, adaptive_training=ARGS.adaptive_training, ), runner=runner, work_dir=ARGS.work_dir, params=params, backend=ARGS.backend, ) ``` And the new design is very much similar with 2 notable differences: - The monolithic API is split into 2 separate APIs - It no longer requires a second level configuration, i.e. `TuneConfig` As a concrete example, the API above should be written as: ```python database = ms.relay_integration.tune_relay( mod=mod, target=ARGS.target, work_dir=ARGS.work_dir, max_trials_global=ARGS.num_trials, num_trials_per_iter=64, params=params, runner=runner, strategy="evolutionary", ) lib = ms.relay_integration.compile_relay( database=database, mod=mod, target=ARGS.target, params=params, backend=ARGS.backend, ) ``` Please refer to changes in `python/tvm/meta_schedule/testing/tune_relay.py` as a practical case. \### If you are using `ms.tune_extracted_tasks` As a classic usecase, fluent TVM users may want to extract tasks from Relay first, filter the tasks themselves before sending them to the tuning system. It usually involves 3 APIs: ```python from tvm import meta_schedule as ms \# API 1. Task extraction and filtering extracted_tasks: List[ExtractedTask] = ms.extract_task_from_relay(relay_mod, target, params) extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name] \# API 2. Tuning database = tune_extracted_tasks( tune_tasks, ms.TuneConfig(...), work_dir=work_dir, num_threads=32, ..., ) \# API 3. Relay compilation with database, tvm.transform.PassContext( opt_level=3, config={"relay.backend.use_meta_schedule": True}, ): lib = relay.build(relay_mod, target=target, params=params) ``` To provide more flexibility of fine-grained control over the tuning system, we again add an extra API that allows customize the behavior of `ms.ExtractedTask` to `ms.TuneContext` conversion. More specifically, after this refactoring, the APIs are changed into: ```python \# API 1. Task extraction and filtering extracted_tasks: List[ExtractedTask] = ms.relay_integration.extract_tasks(relay_mod, target, params) extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name] \# API 2. Convert `ms.ExtractedTask` to `ms.TuneContext` tasks: List[TuneContext] task_weights: List[float] tasks, task_weights = ms.relay_integration.extracted_tasks_to_tune_contexts( extracted_tasks=tune_tasks, work_dir=work_dir, space="post-order-apply", # gives the flexibility to customize per-task search space num_threads=32, ) \# API 3. Tuning database = ms.tune.tune_tasks( tasks=tasks, task_weights=task_weights, work_dir=work_dir, max_trials_global=20000, ) \# API 4. Relay compilation lib = ms.relay_integration.compile_relay( database=database, mod=mod, target=ARGS.target, params=params, backend=ARGS.backend, ) ``` Please refer to changes in `tests/python/integration/test_meta_schedule_auto_tensorize.py` as a practical case. \### Misc changes - `blocks` in `tune_tir` is moved to `ms.space.PostOrderApply(f_block_filter=...)` - `adaptive_training` in `tune_{relay}/{tir}/{extracted_tasks}` is moved to `ms.cost_model.XGBModel(adaptive_training=...)` - `sch_rules`/`postprocs`/`mutators` in `tune_{relay}/{tir}/{extracted_tasks}` is moved to `ms.space.PostOrderApply(...)`, and when unspecified, a target-specific default is used. - `default_config.py` is broken down into `tvm::meta_schedule::{ScheduleRule}/{Mutator}/{Postproc}::Default{LLVM}/{CPU}/{CUDA}`.
7b71a2a
to
b3a0191
Compare
|
||
|
||
@pytest.mark.skip("Requires cascadelake") | ||
def test_vnni_schedule_fn_tune(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is broken with the error
> space=ms.space_generator.PostOrderApply(
f_block_filter=None,
sch_rules=None,
postprocs=[],
mutator_probs=None,
),
)
test_meta_schedule_vnni_integration.py:213:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../python/tvm/meta_schedule/space_generator/post_order_apply.py:53: in __init__
sch_rules, postprocs, mutator_probs = _normalize_rules(sch_rules, postprocs, mutator_probs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
sch_rules = None, postprocs = [], mutator_probs = None
def _normalize_rules(
sch_rules: ScheduleRuleType,
postprocs: PostprocType,
mutator_probs: MutatorProbType,
) -> Tuple[
Optional[List["ScheduleRule"]],
Optional[List["Postproc"]],
Optional[Dict["Mutator", float]],
]:
# pylint: disable=import-outside-toplevel
from ..mutator import Mutator
from ..postproc import Postproc
from ..schedule_rule import ScheduleRule
# pylint: enable=import-outside-toplevel
> assert sch_rules is not None
E AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will send a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work
space=ms.space_generator.PostOrderApply(
f_block_filter=None,
sch_rules="from-target",
postprocs=[],
mutator_probs="from-target",
),
)
config = ms.TuneConfig( | ||
strategy="replay_trace", | ||
target = get_hexagon_target("v68") | ||
database = ms.tir_integration.tune_tir( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two uses of tune_tir
in this file have incorrect signature. I got the following errors:
E TypeError: tune_tir() got an unexpected keyword argument 'sch_rules'
E Check failed: (!checked_type.defined()) is false: Expected Map[meta_schedule.Mutator, FloatImm], but got Array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work:
target = get_hexagon_target("v68")
database = ms.tir_integration.tune_tir(
mod=workload,
target=target,
max_trials_global=8,
num_trials_per_iter=8,
max_trials_per_task=8,
work_dir=work_dir,
space=ms.space_generator.PostOrderApply(
f_block_filter=None,
sch_rules=sch_rules,
postprocs=postprocs,
mutator_probs={},
),
builder=get_hexagon_local_builder(),
runner=get_hexagon_rpc_runner(hexagon_launcher, number=10),
)
sch = ms.tir_integration.compile_tir(database, workload, target)
def schedule_rule_dense_vnni(sch: Schedule, dense_block: BlockRV): | ||
_schedule_dense(m=None, do_tune=True)(sch, dense_block) | ||
return [sch] | ||
|
||
register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@junrushao @masahi or others, may I ask what the difference is between using the TE annotation as described (e.g. attrs={"schedule_rule": "meta_schedule.dense_vnni"},
and a corresponding packed func defining the schedule to use, as opposed to just generating the space via
space=ms.space_generator.ScheduleFn(
_schedule_dense,
...
),
?
Is it that in this test case we allow auto scheduling for all ops but apply special manual scheduling for certain ops (dense in this case), whereas if we use the ScheduleFn technique for generating a search space we do not allow other operators to be auto scheduled? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ScheduleFnDatabase
is for a completely manual schedule, while the register_func
way allows autotvm-style template based tuning. At least that's what I wanted to demonstrate before this PR or before ScheduleFnDatabase
was introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I'm not referring to ScheduleFnDatabase as is used in test_vnni_schedule_fn_database
. I'm referring here to what is done in the test test_vnni_schedule_fn_tune
which utilizes the TE compute schedule_rule
attr annotation along with a global packed function for the schedule that matches the annotation value meta_schedule.dense_vnni
. I'm wondering if there is any difference or advantage between using the TE attr annotation and packed func as opposed to specifying an alternate search space with ScheduleFn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Chris, ScheduleFn
space generator is designed to schedule all blocks in the whole Schedule, not block specific. The annotation based packfunc scheduling only works in PostOrderApply
space generator, which essentially applies this annotated rule for this specific block, and apply default schedule rules (or schedule rules given in user interface) to other non-specified blocks.
Therefore, creating a ScheduleFn
takes more effort and use the annotation based scheduling is easier because you don't need to worry about scheduling of other blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR refactors tuning APIs to help with developer ergonomics and enable new potential and usecases.
Introduction
📅 Original behavior. The original monolithic tuning API assumes that tuning is an end-to-end process that transforms an IRModule into a runtime Module. For example, the API below is designed for Relay end-to-end tuning:
🤔 The challenge. While striving to be "the" API that controls end-to-end tuning, the design ignores a fact that many users desire to compile an neural network without going through the tuning process, and the fact that MetaSchedule is capable of doing so when supplied with a pre-tuned database.
🆕 Our refactoring. Therefore, this PR is introduced to cater those concrete needs by refactoring the monolithic API into 2 or 3 stages, depending how it is used. Take
tune_relay
as another example, now it's refactored into 2 separate APIs, the first of which is slower tuning that returns a database, while the second takes a pre-tuned database for fast Relay compilation.Upgrade guide
If you are using
ms.tune_relay
The original monolithic API is used as:
And the new design is very much similar with 2 notable differences:
TuneConfig
As a concrete example, the API above should be written as:
Please refer to changes in
python/tvm/meta_schedule/testing/tune_relay.py
as a practical case.If you are using
ms.tune_extracted_tasks
As a classic usecase, fluent TVM users may want to extract tasks from Relay first, filter the tasks themselves before sending them to the tuning system. It usually involves 3 APIs:
To provide more flexibility of fine-grained control over the tuning system, we again add an extra API that allows customize the behavior of
ms.ExtractedTask
toms.TuneContext
conversion. More specifically, after this refactoring, the APIs are changed into:Please refer to changes in
tests/python/integration/test_meta_schedule_auto_tensorize.py
as a practical case.Misc changes
blocks
intune_tir
is moved toms.space.PostOrderApply(f_block_filter=...)
adaptive_training
intune_{relay}/{tir}/{extracted_tasks}
is moved toms.cost_model.XGBModel(adaptive_training=...)
sch_rules
/postprocs
/mutators
intune_{relay}/{tir}/{extracted_tasks}
is moved toms.space.PostOrderApply(...)
, and when unspecified, a target-specific default is used.default_config.py
is broken down intotvm::meta_schedule::{ScheduleRule}/{Mutator}/{Postproc}::Default{LLVM}/{CPU}/{CUDA}
.Performance Numbers
The PR is tested end-to-end on a subset of representative models to avoid potential regression.
Performance comparison on V100 (AWS P3.2xlarge).
Performance comparison on Intel Skylake (AWS C5.9xlarge):
In summarize, no performance regression is observed after this refactoring.