Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MetaSchedule] Tuning API cleanup & ergonomics #12895

Conversation

junrushao
Copy link
Member

@junrushao junrushao commented Sep 25, 2022

This PR refactors tuning APIs to help with developer ergonomics and enable new potential and usecases.

Introduction

📅 Original behavior. The original monolithic tuning API assumes that tuning is an end-to-end process that transforms an IRModule into a runtime Module. For example, the API below is designed for Relay end-to-end tuning:

from tvm import meta_scheduler as ms

ms.tune_relay(
  mod: IRModule,              # The Relay program
  target: Union[str, Target], # Parameters used in the Relay program
  config: TuneConfig,         # Configuration, e.g. number of trials
  work_dir: str,              # Compilation target
  ...
) -> runtime.Module: ...

🤔 The challenge. While striving to be "the" API that controls end-to-end tuning, the design ignores a fact that many users desire to compile an neural network without going through the tuning process, and the fact that MetaSchedule is capable of doing so when supplied with a pre-tuned database.

🆕 Our refactoring. Therefore, this PR is introduced to cater those concrete needs by refactoring the monolithic API into 2 or 3 stages, depending how it is used. Take tune_relay as another example, now it's refactored into 2 separate APIs, the first of which is slower tuning that returns a database, while the second takes a pre-tuned database for fast Relay compilation.

ms.relay_integration.tune_relay(
    mod: IRModule,
    params: Dict[str, NDArray],
    target: Union[str, Target],
    work_dir: str,
    max_trials_global: int,
    ...
) -> Database: ...

ms.relay_integration.compile_relay(
    database: Database,
    mod: IRModule,
    target: Union[Target, str],
    params: Optional[Dict[str, NDArray]],
    ...
) -> runtime.Module: ...

Upgrade guide

If you are using ms.tune_relay

The original monolithic API is used as:

lib = ms.tune_relay(
    mod=mod,
    target=ARGS.target,
    config=ms.TuneConfig(
        strategy="evolutionary",
        num_trials_per_iter=64,
        max_trials_per_task=ARGS.num_trials,
        max_trials_global=ARGS.num_trials,
        adaptive_training=ARGS.adaptive_training,
    ),
    runner=runner,
    work_dir=ARGS.work_dir,
    params=params,
    backend=ARGS.backend,
)

And the new design is very much similar with 2 notable differences:

  • The monolithic API is split into 2 separate APIs
  • It no longer requires a second level configuration, i.e. TuneConfig

As a concrete example, the API above should be written as:

database = ms.relay_integration.tune_relay(
    mod=mod,
    target=ARGS.target,
    work_dir=ARGS.work_dir,
    max_trials_global=ARGS.num_trials,
    num_trials_per_iter=64,
    params=params,
    runner=runner,
    strategy="evolutionary",
)
lib = ms.relay_integration.compile_relay(
    database=database,
    mod=mod,
    target=ARGS.target,
    params=params,
    backend=ARGS.backend,
)

Please refer to changes in python/tvm/meta_schedule/testing/tune_relay.py as a practical case.

If you are using ms.tune_extracted_tasks

As a classic usecase, fluent TVM users may want to extract tasks from Relay first, filter the tasks themselves before sending them to the tuning system. It usually involves 3 APIs:

from tvm import meta_schedule as ms

# API 1. Task extraction and filtering
extracted_tasks: List[ExtractedTask] = ms.extract_task_from_relay(relay_mod, target, params)
extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name]

# API 2. Tuning
database = tune_extracted_tasks(
    tune_tasks,
    ms.TuneConfig(...),
    work_dir=work_dir,
    num_threads=32,
    ...,
)

# API 3. Relay compilation
with database, tvm.transform.PassContext(
    opt_level=3,
    config={"relay.backend.use_meta_schedule": True},
):
    lib = relay.build(relay_mod, target=target, params=params)

To provide more flexibility of fine-grained control over the tuning system, we again add an extra API that allows customize the behavior of ms.ExtractedTask to ms.TuneContext conversion. More specifically, after this refactoring, the APIs are changed into:

# API 1. Task extraction and filtering
extracted_tasks: List[ExtractedTask] = ms.relay_integration.extract_tasks(relay_mod, target, params)
extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name]

# API 2. Convert `ms.ExtractedTask` to `ms.TuneContext`
tasks: List[TuneContext]
task_weights: List[float]
tasks, task_weights = ms.relay_integration.extracted_tasks_to_tune_contexts(
    extracted_tasks=tune_tasks,
    work_dir=work_dir,
    space="post-order-apply", # gives the flexibility to customize per-task search space
    num_threads=32,
 )

# API 3. Tuning
database = ms.tune.tune_tasks(
    tasks=tasks,
    task_weights=task_weights,
    work_dir=work_dir,
    max_trials_global=20000,
)

# API 4. Relay compilation
lib = ms.relay_integration.compile_relay(
    database=database,
    mod=mod,
    target=ARGS.target,
    params=params,
    backend=ARGS.backend,
)

Please refer to changes in tests/python/integration/test_meta_schedule_auto_tensorize.py as a practical case.

Misc changes

  • blocks in tune_tir is moved to ms.space.PostOrderApply(f_block_filter=...)
  • adaptive_training in tune_{relay}/{tir}/{extracted_tasks} is moved to ms.cost_model.XGBModel(adaptive_training=...)
  • sch_rules/postprocs/mutators in tune_{relay}/{tir}/{extracted_tasks} is moved to ms.space.PostOrderApply(...), and when unspecified, a target-specific default is used.
  • default_config.py is broken down into tvm::meta_schedule::{ScheduleRule}/{Mutator}/{Postproc}::Default{LLVM}/{CPU}/{CUDA}.

Performance Numbers

The PR is tested end-to-end on a subset of representative models to avoid potential regression.

Performance comparison on V100 (AWS P3.2xlarge).

MetaSchedule @ main (ms) This PR (ms) Difference
bert_base 3.185650996 3.222358502 -1.14%
resnet_50 1.588203344 1.586299171 0.12%
mobilenet_v2 0.4574400258 0.4596171817 -0.47%
resnet_18 0.6853301584 0.6812821976 0.59%
mobilenet_v3 0.7230763281 0.7010596015 3.14%
wide_resnet_50 2.864763701 2.797114016 2.42%
densenet_121 2.330949968 2.332683173 -0.07%
vgg_16 2.780654826 2.807344907 -0.95%

Performance comparison on Intel Skylake (AWS C5.9xlarge):

MetaSchedule @ main (ms) This PR (ms) Difference
bert_base 12.15242064 12.37192344 -1.77%
resnet_50 5.225000453 5.320676231 -1.80%
mobilenet_v2 0.7461500253 0.753737067 -1.01%
resnet_18 2.103578434 2.019274095 4.17%
mobilenet_v3 1.14312758 1.15862842 -1.34%
wide_resnet_50 11.73288837 11.84455867 -0.94%
densenet_121 14.90702895 15.41747371 -3.31%
vgg_16 15.47650269 15.42590106 0.33%

In summarize, no performance regression is observed after this refactoring.

@junrushao junrushao marked this pull request as ready for review September 25, 2022 05:49
@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch 16 times, most recently from ad84e8d to c4afc4d Compare September 27, 2022 20:48
@zxybazh
Copy link
Member

zxybazh commented Sep 27, 2022

I like this change of decoupling compilation and tuning, the changes to default classes usage also make sense. Please let me know when the PR is ready for review.

@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch 7 times, most recently from be1b58b to 9c28959 Compare September 28, 2022 04:20
@junrushao
Copy link
Member Author

@tqchen @Hzfengsy @spectrometerHBH @zxybazh @vinx13 @yelite The PR is ready for review. Please take a look!

@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch 3 times, most recently from fb0e91e to ea280df Compare September 29, 2022 00:26
@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch 2 times, most recently from 6943a88 to 83871fe Compare October 6, 2022 03:25
@junrushao
Copy link
Member Author

Hey @masahi, I added executor parameter to extract_tasks and compile_relay, which controls the default value of relay.FuseOps.link_params in pass configuration. It's quite confusing to me that executor is lifted out of pass config and somehow control the compilation process in a half-functioning way (only works for GraphExecutor), and am not sure if I'm using that correctly, so please feel free to suggest what the best way is :-)

On the other hand, as a high-level API, I would prefer not to tweak tune_relay adding more parameters to it, given we wanted to give a cleaner interface to introductory level users. Instead, advance users could always use extract_tasks + tune_tasks + compile_relay to get fine-grained control over the tuning process

@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch from 83871fe to 612979f Compare October 6, 2022 06:06
@masahi
Copy link
Member

masahi commented Oct 6, 2022

On the other hand, as a high-level API, I would prefer not to tweak tune_relay adding more parameters to it, given we wanted to give a cleaner interface to introductory level users. Instead, advance users could always use extract_tasks + tune_tasks + compile_relay to get fine-grained control over the tuning process

Yes, I agree with this. A part of the reason I didn't want to change extract_task API before was that the vast majority of users don't need to care about executor stuff. For Hexagon users, we can add a wrapper API in contrib/hexagon/meta_schedule to simplify the usage. We already require using Hexagon-specific builder / runner, so the wrapper API can hide such details too.

@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch from 612979f to d99ac85 Compare October 6, 2022 14:28
@junrushao junrushao changed the title [MetaSchedule] UX: Tuning API cleanup & developer ergonomics [MetaSchedule] Tuning API cleanup & ergonomics Oct 6, 2022
@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch 2 times, most recently from 105fd0c to 7b71a2a Compare October 6, 2022 20:58
@junrushao
Copy link
Member Author

@masahi I updated the PR with my latest understanding of Hexagon pipeline. Would you mind taking another look? Thanks a lot!

@masahi
Copy link
Member

masahi commented Oct 7, 2022

@junrushao I made one comment but otherwise Hexagon change looks good to me. It didn't occur to me before that we can do mod = mod.with_attr(...) from the user script to avoid threading executor through task extraction and tune_relay etc.

This PR refactors tuning APIs to help with developer ergonomics and enable new potential and usecases.

\## Introduction

**📅 Original behavior.** The original monolithic tuning API assumes that tuning is an end-to-end process that transforms an IRModule into a runtime Module. For example, the API below is designed for Relay end-to-end tuning:

```python
from tvm import meta_scheduler as ms

ms.tune_relay(
  mod: IRModule,              # The Relay program
  target: Union[str, Target], # Parameters used in the Relay program
  config: TuneConfig,         # Configuration, e.g. number of trials
  work_dir: str,              # Compilation target
  ...
) -> runtime.Module: ...
```

**🤔 The challenge.** While striving to be "the" API that controls end-to-end tuning, the design ignores a fact that many users desire to compile an neural network without going through the tuning process, and the fact that MetaSchedule is capable of doing so when supplied with a pre-tuned database.

**🆕 Our refactoring.** Therefore, this PR is introduced to cater those concrete needs by refactoring the monolithic API into 2 or 3 stages, depending how it is used. Take `tune_relay` as another example, now it's refactored into 2 separate APIs, the first of which is slower tuning that returns a database, while the second takes a pre-tuned database for fast Relay compilation.

```python
ms.relay_integration.tune_relay(
    mod: IRModule,
    params: Dict[str, NDArray],
    target: Union[str, Target],
    work_dir: str,
    max_trials_global: int,
    ...
) -> Database: ...

ms.relay_integration.compile_relay(
    database: Database,
    mod: IRModule,
    target: Union[Target, str],
    params: Optional[Dict[str, NDArray]],
    ...
) -> runtime.Module: ...
```

\## Upgrade guide
\### If you are using `ms.tune_relay`

The original monolithic API is used as:

```python
lib = ms.tune_relay(
    mod=mod,
    target=ARGS.target,
    config=ms.TuneConfig(
        strategy="evolutionary",
        num_trials_per_iter=64,
        max_trials_per_task=ARGS.num_trials,
        max_trials_global=ARGS.num_trials,
        adaptive_training=ARGS.adaptive_training,
    ),
    runner=runner,
    work_dir=ARGS.work_dir,
    params=params,
    backend=ARGS.backend,
)
```

And the new design is very much similar with 2 notable differences:
- The monolithic API is split into 2 separate APIs
- It no longer requires a second level configuration, i.e. `TuneConfig`

As a concrete example, the API above should be written as:

```python
database = ms.relay_integration.tune_relay(
    mod=mod,
    target=ARGS.target,
    work_dir=ARGS.work_dir,
    max_trials_global=ARGS.num_trials,
    num_trials_per_iter=64,
    params=params,
    runner=runner,
    strategy="evolutionary",
)
lib = ms.relay_integration.compile_relay(
    database=database,
    mod=mod,
    target=ARGS.target,
    params=params,
    backend=ARGS.backend,
)
```

Please refer to changes in `python/tvm/meta_schedule/testing/tune_relay.py` as a practical case.

\### If you are using `ms.tune_extracted_tasks`

As a classic usecase, fluent TVM users may want to extract tasks from Relay first, filter the tasks themselves before sending them to the tuning system. It usually involves 3 APIs:

```python
from tvm import meta_schedule as ms

\# API 1. Task extraction and filtering
extracted_tasks: List[ExtractedTask] = ms.extract_task_from_relay(relay_mod, target, params)
extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name]

\# API 2. Tuning
database = tune_extracted_tasks(
    tune_tasks,
    ms.TuneConfig(...),
    work_dir=work_dir,
    num_threads=32,
    ...,
)

\# API 3. Relay compilation
with database, tvm.transform.PassContext(
    opt_level=3,
    config={"relay.backend.use_meta_schedule": True},
):
    lib = relay.build(relay_mod, target=target, params=params)
```

To provide more flexibility of fine-grained control over the tuning system, we again add an extra API that allows customize the behavior of `ms.ExtractedTask` to `ms.TuneContext` conversion. More specifically, after this refactoring, the APIs are changed into:

```python
\# API 1. Task extraction and filtering
extracted_tasks: List[ExtractedTask] = ms.relay_integration.extract_tasks(relay_mod, target, params)
extracted_tasks = [task for task in extracted_tasks if "conv2d" in task.task_name]

\# API 2. Convert `ms.ExtractedTask` to `ms.TuneContext`
tasks: List[TuneContext]
task_weights: List[float]
tasks, task_weights = ms.relay_integration.extracted_tasks_to_tune_contexts(
    extracted_tasks=tune_tasks,
    work_dir=work_dir,
    space="post-order-apply", # gives the flexibility to customize per-task search space
    num_threads=32,
 )

\# API 3. Tuning
database = ms.tune.tune_tasks(
    tasks=tasks,
    task_weights=task_weights,
    work_dir=work_dir,
    max_trials_global=20000,
)

\# API 4. Relay compilation
lib = ms.relay_integration.compile_relay(
    database=database,
    mod=mod,
    target=ARGS.target,
    params=params,
    backend=ARGS.backend,
)
```

Please refer to changes in `tests/python/integration/test_meta_schedule_auto_tensorize.py` as a practical case.

\### Misc changes

- `blocks` in `tune_tir` is moved to `ms.space.PostOrderApply(f_block_filter=...)`
- `adaptive_training` in `tune_{relay}/{tir}/{extracted_tasks}` is moved to `ms.cost_model.XGBModel(adaptive_training=...)`
- `sch_rules`/`postprocs`/`mutators` in `tune_{relay}/{tir}/{extracted_tasks}` is moved to `ms.space.PostOrderApply(...)`, and when unspecified, a target-specific default is used.
- `default_config.py` is broken down into `tvm::meta_schedule::{ScheduleRule}/{Mutator}/{Postproc}::Default{LLVM}/{CPU}/{CUDA}`.
@junrushao junrushao force-pushed the feature/2022-09-19/tune-api-refactoring branch from 7b71a2a to b3a0191 Compare October 7, 2022 02:46
@spectrometerHBH spectrometerHBH merged commit 6780c9f into apache:main Oct 7, 2022


@pytest.mark.skip("Requires cascadelake")
def test_vnni_schedule_fn_tune():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is broken with the error

>               space=ms.space_generator.PostOrderApply(
                    f_block_filter=None,
                    sch_rules=None,
                    postprocs=[],
                    mutator_probs=None,
                ),
            )

test_meta_schedule_vnni_integration.py:213: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../python/tvm/meta_schedule/space_generator/post_order_apply.py:53: in __init__
    sch_rules, postprocs, mutator_probs = _normalize_rules(sch_rules, postprocs, mutator_probs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

sch_rules = None, postprocs = [], mutator_probs = None

    def _normalize_rules(
        sch_rules: ScheduleRuleType,
        postprocs: PostprocType,
        mutator_probs: MutatorProbType,
    ) -> Tuple[
        Optional[List["ScheduleRule"]],
        Optional[List["Postproc"]],
        Optional[Dict["Mutator", float]],
    ]:
        # pylint: disable=import-outside-toplevel
        from ..mutator import Mutator
        from ..postproc import Postproc
        from ..schedule_rule import ScheduleRule
    
        # pylint: enable=import-outside-toplevel
>       assert sch_rules is not None
E       AssertionError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will send a fix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should work

               space=ms.space_generator.PostOrderApply(
                    f_block_filter=None,
                    sch_rules="from-target",
                    postprocs=[],
                    mutator_probs="from-target",
                ),
            )

config = ms.TuneConfig(
strategy="replay_trace",
target = get_hexagon_target("v68")
database = ms.tir_integration.tune_tir(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two uses of tune_tir in this file have incorrect signature. I got the following errors:

E               TypeError: tune_tir() got an unexpected keyword argument 'sch_rules' 
E             Check failed: (!checked_type.defined()) is false: Expected Map[meta_schedule.Mutator, FloatImm], but got Array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should work:

            target = get_hexagon_target("v68")
            database = ms.tir_integration.tune_tir(
                mod=workload,
                target=target,
                max_trials_global=8,
                num_trials_per_iter=8,
                max_trials_per_task=8,
                work_dir=work_dir,
                space=ms.space_generator.PostOrderApply(
                    f_block_filter=None,
                    sch_rules=sch_rules,
                    postprocs=postprocs,
                    mutator_probs={},
                ),
                builder=get_hexagon_local_builder(),
                runner=get_hexagon_rpc_runner(hexagon_launcher, number=10),
            )
            sch = ms.tir_integration.compile_tir(database, workload, target)

Comment on lines +192 to +196
def schedule_rule_dense_vnni(sch: Schedule, dense_block: BlockRV):
_schedule_dense(m=None, do_tune=True)(sch, dense_block)
return [sch]

register_func("meta_schedule.dense_vnni", schedule_rule_dense_vnni)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrushao @masahi or others, may I ask what the difference is between using the TE annotation as described (e.g. attrs={"schedule_rule": "meta_schedule.dense_vnni"}, and a corresponding packed func defining the schedule to use, as opposed to just generating the space via

space=ms.space_generator.ScheduleFn(
     _schedule_dense,
    ...
),

?

Is it that in this test case we allow auto scheduling for all ops but apply special manual scheduling for certain ops (dense in this case), whereas if we use the ScheduleFn technique for generating a search space we do not allow other operators to be auto scheduled? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ScheduleFnDatabase is for a completely manual schedule, while the register_func way allows autotvm-style template based tuning. At least that's what I wanted to demonstrate before this PR or before ScheduleFnDatabase was introduced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'm not referring to ScheduleFnDatabase as is used in test_vnni_schedule_fn_database. I'm referring here to what is done in the test test_vnni_schedule_fn_tune which utilizes the TE compute schedule_rule attr annotation along with a global packed function for the schedule that matches the annotation value meta_schedule.dense_vnni. I'm wondering if there is any difference or advantage between using the TE attr annotation and packed func as opposed to specifying an alternate search space with ScheduleFn.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Chris, ScheduleFn space generator is designed to schedule all blocks in the whole Schedule, not block specific. The annotation based packfunc scheduling only works in PostOrderApply space generator, which essentially applies this annotated rule for this specific block, and apply default schedule rules (or schedule rules given in user interface) to other non-specified blocks.

Therefore, creating a ScheduleFn takes more effort and use the annotation based scheduling is easier because you don't need to worry about scheduling of other blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, okay I see, thanks for the discussion @zxybazh @masahi, this is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants