Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous stage in software pipeline #80

Merged
merged 14 commits into from
Jul 16, 2022
Merged

Conversation

masahi
Copy link
Member

@masahi masahi commented Jun 17, 2022

Rendered

I'm looking for feedbacks, particularly on the synchronization model. Let me know if your target of interest can or cannot be supported by this approach!

@junrushao1994 @vinx13 @csullivan @tqchen @Hzfengsy @kparzysz-quic @wrongtest

@Hzfengsy
Copy link
Member

Many thanks, @masahi! It's an outstanding contribution to push TVM perf to another stage. My experiments show async pipeline will speed up cutlass from 80T to 93T (on fp16 gemm, 3080). I am looking forward to the following PRs.

One question: Can it speed up kernels with wmma intrinsic or those using CUDA cores?

@masahi
Copy link
Member Author

masahi commented Jun 18, 2022

@Hzfengsy On NVIDIA Ampere, the only asynchronous operation is global to shared memory copy. So wmma, mma, or cuda core can all benefit from it.

I have an MMA schedule with multi-stage pipeline, where global to shared memory is 4x multi buffered and asynchronous. The test case is here https://github.com/masahi/tvm/blob/2b325394e951b8b38c9a24d9a4b7a8c6f6d749e7/tests/python/unittest/test_tir_transform_inject_software_pipeline.py#L1436. It generates interesting code (generated TIR after lowering to PTX async instructions), but currently I'm not getting good performance: The baseline MMA schedule without any pipelining or async gets 40 TFLOPS on 3070, but this fancy schedule gets only 33 TFLOPS.

My next step is to get Ampere async GEMM actually perform on par with or better than the baseline.

@Hzfengsy
Copy link
Member

My observation: async works well for fp16-16-16 (use fp16 accumulator) but helps little for fp16-16-32. On the other hand, the best cutlass kernel only uses 3 stages on my machine.

I guess it is because of the shared memory usage

@wrongtest-intellif
Copy link

Many thanks~ The settings seems to also greatly benefit DMA synchronizations handling in NPU workloads. For example, there could be "input DMA" - "computation" - "output DMA" pipelines, where each pipeline stage may take it's own IQ thus explicit synchronization instructions should be correctly inserted, like "input DMA waits for the last (i-1 or i-2) output DMA".

Here are my two questions, just out of my curiosity :),

  • What is the main purpose of async_scope annotation? It is for explicit semantic representation, or useful hint to lowering analysis, or would affect final codegen in CUDA?
    If we only have data dependencies instead of the explicit control-flow dependency annotations, could we still reach the same point of proper synchronizations?

  • For async_commit_stage/async_wait_stage, could I understand that they are the standard tir intrinsic in stage pipeline settings, and the only things for vendors to care about is how to lowering / codegen them?

@masahi
Copy link
Member Author

masahi commented Jun 20, 2022

Thanks @wrongtest for questions!

What is the main purpose of async_scope annotation? It is for explicit semantic representation, or useful hint to lowering analysis, or would affect final codegen in CUDA?

I've added a section explaining what async_scope is for. My intention is indeed more of "useful hint to lowering analysis, or would affect final codegen in CUDA". In particular, it is not used for sync insertion at all. I wouldn't call it "semantic representation", but it does make things explicit for later passes and also helps when eyeballing TIR dumps :)

If we only have data dependencies instead of the explicit control-flow dependency annotations, could we still reach the same point of proper synchronizations?

What do you mean by "the explicit control-flow dependency annotations" here? From the given annotation, we obviously need to determine read-write dependencies to tell which stmt is the consumer of which async stmt. In that sense, I think I'm already working with data dependencies only. I don't think the TIR software transform pass deals with control flow at all.

For async_commit_stage/async_wait_stage, could I understand that they are the standard tir intrinsic in stage pipeline settings, and the only things for vendors to care about is how to lowering / codegen them?

Yes, if I understand your question, async_commit_stage/async_wait_stage are new TIR intrinsics:
https://github.com/masahi/tvm/blob/4f40d1de6aa00ddaa709d0a21e6685cb9e9f57af/include/tvm/tir/builtin.h#L673-L674

They are generated during sync insertion, and each backend can specify how to lower them, for example in cuda:
https://github.com/masahi/tvm/blob/4f40d1de6aa00ddaa709d0a21e6685cb9e9f57af/src/target/source/intrin_rule_cuda.cc#L244-L248

@JosephTheOctonaut
Copy link

Awesome stuff, @masahi !

Specific questions:

  • I don't think I saw it explicitly mentioned, but it is assumed/required that the backend executes committed chunks in order, right? E.g., if the program commits a chunk for i=0 to i=5 before hitting a wait, the hardware can't decide to execute the i=5 chunk first.
  • The index "software_pipeline_async_stages" refers to elements of the "software_pipeline_stage" list, correct? I found this a bit confusing, because the asynchronous "stages" are not the same as the pipeline "stages". (e.g., a pipeline with "software_pipeline_stage", [0,0,0,1] could have "software_pipeline_async_stages, [2,3]" when there are no pipeline stages 2 and 3)
  • The definition of async_commit_stage(i) is to "Group one or more invocation of async operations...". Is the model here that all uncommitted async operations at that point in the program are committed? Furthermore, are "async operations" defined as any statement in a pipeline stage that was identified as async?
    • I initially found it confusing that async_commit_stage(i) could appear inside and outside of an async_scope: section. It makes sense that it could be both, because (assuming my above assumptions are correct) it doesn't have to do with any lexical scoping, operating instead on the "hidden state" of whatever async operations are uncommitted. No specific question here, but maybe a more intuitive grouping is possible that would be syntactically visible.
  • Is it correct that async_wait_stage will only block the "main" thread of execution? For example, if (on a particular backend) an async pipeline stage is executed by another thread, and async_wait_stage appeared in that stage, which thread would it block? I'm assuming the answer is "whatever the backend lowering does," but I was curious if there was an intended semantics.

Small doc nits for clarity:

  • for k0 in range(125): I think the index should be i here
  • B_local[0] <- B_shared[(I + 1) % 4, ...] capital I used instead of i
  • async_scoppe instead of async_scope

Higher-level thoughts:

One concern is moving an async system into TIR that is too narrow. You discuss this quite a bit, and make a lot of good points. This system clearly works very well for CUDA, but if there isn't a good transform to token-based systems, do we end up with two async systems in TIR? Not trying to suggest this should not be incorporated, just genuinely curious what the long-term plans might be.

Importantly, we are not trying to propose a “general async semantics to TIR”. Rather the goal is to come up with an async design specifically for the TIR software pipeline transform.

This can be a bit difficult, as it would make sense for future systems to try to use the existing async system to whatever extent possible. Again, not trying to suggest a specific course of action, just pointing out that there's a difficult general-vs-specific trade-off here.

Regarding interopability with a token-based model: you make an excellent point that a token system has expressibility issues in a TIR program, because it has to refer to a specific statement at a specific loop iteration. But, it's also more intuitive. This contrasts with the implicit system here, which feels unintuitive, but is quite succinct and easily expressible.

I'm still working through what a translation to/from a token system might look like, but I'm currently thinking that they're much closer than I initially thought. In either case, what we want (at a high level) is a way to refer to a specific chunk, blocking execution until that chunk has finished. Your comment about keeping a buffer in the token case made me realize that it ends up pretty similar: a token system waiting for chunk C from N iterations previous might use wait(C, i-N), which looks pretty similar to the wait(C, N) we would use here. (Just my preliminary thoughts; I'm not sure if there's a nicer abstraction to use in TIR that caters to both systems.)

It's interesting that you mention that there's no easy translation from tokens to counting (based on MLIR not having implemented one?), but you suspect the reverse could be simple. Does this suggest that the token system has less information encoded than the counting system? (I.e., we can go counting --> tokens but not the reverse because we lost information in the transformation.) Or is it just specifics of a PTX-like system, not a "counting system" in general, that make the translation to it hard?

@masahi
Copy link
Member Author

masahi commented Jun 22, 2022

Thanks for the detailed feedback @JosephTheOctonaut! I'll update the doc accordingly, but here are my answers.

I don't think I saw it explicitly mentioned, but it is assumed/required that the backend executes committed chunks in order, right

Correct, commit groups must execute in FIFO order. But the order of completions within one commit group are not specified, following the PTX spec.

The index "software_pipeline_async_stages" refers to elements of the "software_pipeline_stage" list, correct?

Yes, I haven't put deep thought into the name choice, here I simply want to say "the index of stmt/block in the list", provided to software_pipeline_stage and software_pipeline_order. @vinx13 was also confused, so I should come with a better name. Maybe software_pipeline_async_statement_index?

The definition of async_commit_stage(i) is to "Group one or more invocation of async operations...". Is the model here that all uncommitted async operations at that point in the program are committed? Furthermore, are "async operations" defined as any statement in a pipeline stage that was identified as async?

Both correct. I'm using "commit" in the same sense as PTX here. In the doc, I'm probably using "async operations" when I should be using "async commit groups" to be exact. But I think I'm using "async commit groups" when the distinction matters.

I initially found it confusing that async_commit_stage(i) could appear inside and outside of an async_scope: section

Yes, I agree that it was confusing, I was a bit informal since it is just pseudocode. As you said, the exact placement doesn't matter, both in the illustration and the implementation. I made it consistent in the doc. In the implementation, async_scope only encloses the operation itself.

Is it correct that async_wait_stage will only block the "main" thread of execution?

This is an interesting question, that I haven't really thought about. I would expect that each async "engine" is represented by its own thread, so for example if a vector unit finds out that it needs to wait at some point, the thread that's running the vector unit should block. I hope this makes sense... I think this is a natural model but as you said, I'm not sure if such details should be specified at the TIR level.

It's interesting that you mention that there's no easy translation from tokens to counting (based on MLIR not having implemented one?), but you suspect the reverse could be simple. Does this suggest that the token system has less information encoded than the counting system? (I.e., we can go counting --> tokens but not the reverse because we lost information in the transformation.) Or is it just specifics of a PTX-like system, not a "counting system" in general, that make the translation to it hard?

What I said is that the reverse seems more "feasible", not simple :) I would say, the counting system relies on more implicit states in the HW, so going from token to counts requires uncovering such states from the given token alone. The lost information would be an ordering of async operations (or commit groups, to be exact), or any information about "other" tokens. Given only a token, we don’t know (1) how many other async-ops are in flight before that sync point and (2) how many of them can still be in-flight after (The latter one is required by PTX). I claim that this is a difficult problem in general, and I gave the MLIR bit as a data point. Maybe we can encode some information in the token itself, but I haven't really thought about it.

@wrongtest-intellif
Copy link

wrongtest-intellif commented Jun 22, 2022

@masahi Hi~ thanks for the reply several day ago. After a bit learning of CUDA async interface cuda::pipeline, may I ask one more question? Thank you very much again!

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#with-memcpy_async-pipeline-pattern-multi
The tutorial says there are four primitives: producer_acquire, producer_commit, consumer_wait, consumer_release; What is the relation to proposed async_commit_stage/async_wait_stage? (if they do relates)

Copy link
Contributor

@Lunderberg Lunderberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together, and I really like it!

```

**Semantics of the proposed intrinsics**. “stage” refers to the same notion in the TIR software pipeline.
- `async_commit_stage(i)` : Group one or more invocation of async operations, and “commit” them to the `i`-th stage. The exact interpretation of “committing” can be up to each backend, but informally it signifies that a group of async operations are now in-flight. The group of operations committed together is awaited as one chunk, and thus they constitute the granularity at which the synchronization intrinsic discussed next operates on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to express async_commit_stage as an annotation rather than a callable intrinsic? I'm thinking something like the following:

# A stage includes one or more async_scope blocks, and defines a
# group.  async_scope blocks may only occur directly within an
# async_stage block, or as elements of a SeqStmt that occurs directly
# within an async_stage block.  The group is committed at the site
# where it is defined.
with async_commit_stage(0):
    with async_scope:
        B[(i + 1) % 2] = A[i] + 1

This way, there's less backtracking needed for a reader to determine which scopes are being launched, and a runtime wouldn't need to maintain state describing the stages to be launched next time it encounters a call to async_commit_stage. This would also prevent cases where the scopes have been defined outside a conditional, but the async_commit_stage call exists inside a conditional.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, this would also give a really clean notation for stages that consist of only a single scope.

# If only one async_scope exists, the annotation can be dropped.
with async_stage(0):
    B[(i + 1) % 2] = A[i] + 1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is a great idea. We probably need a separate lowering pass for commit_stage to insert a target-specific commit at the right place, while currently the lowering is trivial (line-by-line change). But it is certainly doable. I'll try implement this while we resolve other discussion points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've incorporated this suggestion into the doc and also in my implementation.

rfcs/0077-async-pipeline.md Outdated Show resolved Hide resolved



(The following is highly speculative) On the other hand, translation from “count” to “token” seems more feasible: At each synchronization point, a backend presumably maintains the number and the order of pending async operations. Given the count, it should be possible to derive the correct token from the corresponding ordered list of tokens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though speculative, this makes sense to me. This could even be done within the TIR itself, with each "stage" having a unique integer value, and only ever using N=0 when waiting. In effect, the integer passed to commit/wait would be the unique token. (This assumes that there is minimal overhead in maintaining the existence of a "stage" that can be waited on.)

If sequentially assigned, I think this would also allow the token-based integer synchronization to be translated into the count-based synchronization. (e.g. If iteration i launches stage i + offset and wants N in-flight, it could wait on stage i + offset - N.)

Copy link
Member Author

@masahi masahi Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that I should make the notion of "stage" more precise. For example, if you look at the "Multi-stage pipelined GEMM" under https://github.com/masahi/tvm-rfcs/blob/async-pipe/rfcs/0077-async-pipeline.md#more-examples, and in particular how A_shared and B_shared are produced and consumed, there are 4 "stages" in the traditional sense, one of which is overlapped with compute. But commit_stage and wait_stage only ever refer to the "stage" 0, the async producer stage. This notion of stage might be TIR software pipeline specific, and it corresponds to "0" in the annotation

sch.annotate(k0, ann_key="software_pipeline_stage", ann_val=[0, 0, 2, 3, 3])

I think the notion of "stage" you have in mind is the traditional one, for example when you say "stage i + offset", that would correspond to the index (i + 3) % 4 in the example.

On the other hand, I and this proposal talk about "stage" in the sense used by TIR software pipeline. So for example, if I am given an annotation "software_pipeline_stage", ann_val=[0, 0, 3], I'd say there are "two" stages in the TIR sense, even though the traditional sense would be four stages.

cc @vinx13

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi I believe I understand the intended usage of "stage" here, but could we use it like Eric is suggesting? Is there a limitation to having an arbitrary number of async stages?

@Lunderberg Unless I've misunderstood, this translation would require fully unrolling all loops, right? Because each time you call commit you need to pass in a new static integer?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to unrolling is if non-constant stage numbers are allowed. It might not be pretty or succinct, but you could thread variable(s) through the program to hold stage (fence) IDs.

Copy link
Member

@vinx13 vinx13 Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi is accurate. In TIR software pipeline, we can group multiple statements into one stage. Here stage is only an annotation of how the loop should be shifted, the i-th iteration of the pipelined loop executes i-stage-th iteration in the original loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masahi I believe I understand the intended usage of "stage" here, but could we use it like Eric is suggesting? Is there a limitation to having an arbitrary number of async stages?

That's correct. My goal was to map out what can be expressed using the commit/wait semantics defined, whether it results in broader functionality than the pipelinining use case, and whether that broader functionality is desired.

On the other hand, I and this proposal talk about "stage" in the sense used by TIR software pipeline. So for example, if I am given an annotation "software_pipeline_stage", ann_val=[0, 0, 3], I'd say there are "two" stages in the TIR sense, even though the traditional sense would be four stages.

Reading through again, I think there are two different levels of abstraction, and that the use of the term "stage" in both levels may be causing my confusion. At the higher abstraction level with "software_pipeline_stage", each stage is defined by their values in the annotations. At the lower abstraction level with commit/wait, the first argument defines a a set of work to be completed, or to be waited on. The first argument at the commit/wait abstraction level is generated from the stage at the "software_pipeline" abstraction level, but that doesn't mean it would be the only possible lowering.

@Lunderberg Unless I've misunderstood, this translation would require fully unrolling all loops, right? Because each time you call commit you need to pass in a new static integer?

An alternative to unrolling is if non-constant stage numbers are allowed. It might not be pretty or succinct, but you could thread variable(s) through the program to hold stage (fence) IDs.

@JosephTheOctonaut Correct, the non-constant stage numbers are what I had been picturing. For static graphs without unrolling loops, this would be some loop_iter + offset used to define the unique ID for each commit/wait.


(The following is highly speculative) On the other hand, translation from “count” to “token” seems more feasible: At each synchronization point, a backend presumably maintains the number and the order of pending async operations. Given the count, it should be possible to derive the correct token from the corresponding ordered list of tokens.

Importantly, we are not trying to propose a “general async semantics to TIR”. Rather the goal is to come up with an async design specifically for the TIR software pipeline transform. Hopefully, this would allow making assumptions that might not be reasonable in more general settings (control flow, relative order of operations etc), and simplify the implementation by building on what the TIR software pipeline already produces as part of transform. Hopefully, reading the explanation below should also convince one that the counting based sync is a natural fit (or “good enough”) for the TIR software pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a unique integer "stage" value passed to commit/wait would be equivalent to creating and waiting of a fence, I think the current design does result in general async semantics. I'm not opposed to doing so, but if we want to avoid a generic async framework, we should make a more restrictive data structure for it.

@vinx13
Copy link
Member

vinx13 commented Jun 22, 2022

@wrongtest producer_acquire, producer_commit, consumer_wait, consumer_release are high level CUDA C++ API, CUDA pipeline at cuda::thread_scope::thread_scope_block is equivalent to low-level PTX commit / wait, these four primitives are implemented with internal counting to emit count-based async pipeline. (CUDA C++ pipeline API can also be in shared scope, which requires a more complicated implementation)

@masahi
Copy link
Member Author

masahi commented Jun 23, 2022

@JosephTheOctonaut, I'm going to change the definition of software_pipeline_async_stages from the current "index into the statement" to the "stage" itself, since this is confusing. Actually I confused myself in our conversation today: I mistakenly said that the elements in software_pipeline_async_stages should be less than the maximum value of software_pipeline_stage.

For example, if we have three statements in the loop and we want to make the first two statements async, the current annotation is

sch.annotate(loop, ann_key="software_pipeline_stage", ann_val=[0, 0, 3])
sch.annotate(loop, ann_key="software_pipeline_async_stages", ann_val=[0, 1]) # 0, 1 are indices to the list of statements

this would become the following after this change.

sch.annotate(loop, ann_key="software_pipeline_stage", ann_val=[0, 0, 3])
sch.annotate(loop, ann_key="software_pipeline_async_stages", ann_val=[0]) # the stage "0" is async.

With this, the label "software_pipeline_async_stages" shouldn't be ambiguous. Also, this change makes it clear that different statements in the same stage cannot be mixed sync and async.

But the notion of stage here is the TIR-specific one, which comes back to your question in #80 (comment)

I believe I understand the intended usage of "stage" here, but could we use it like Eric is suggesting?

I want to keep the terminology between this RFC and the TIR software pipeline implementation consistent, so if we want to change the meaning of "stage" in this proposal, I want to evaluate the feasibility of such change to the implementation first. Generally, I agree that we should use the common terminology. cc @vinx13 @Lunderberg if they have any thoughts on this topic.

@Lunderberg
Copy link
Contributor

I want to keep the terminology between this RFC and the TIR software pipeline implementation consistent, so if we want to change the meaning of "stage" in this proposal, I want to evaluate the feasibility of such change to the implementation first. Generally, I agree that we should use the common terminology. cc @vinx13 @Lunderberg if they have any thoughts on this topic.

On the terminology side, I'm wondering if we want to have separate terminology at the different abstraction levels. At the level of "software_pipeline_async_stages" annotations, the "stages" term makes sense to me, but not at the commit/wait abstraction level.

At the commit/wait abstraction level, what if we rename it from "stage" to "queue"? If I'm reading correctly, that is the functionality provided by commit/wait, that sequential calls to commit with the same value of i define a work queue to be executed sequentially, and that calls with different values of i define work that can be performed independently. That the "queue" is produced from the "stage" feels more like an implementation detail of the lowering, rather than something inherent to the commit/wait functionality.

@JosephTheOctonaut
Copy link

@masahi

I want to keep the terminology between this RFC and the TIR software pipeline implementation consistent

@Lunderberg

On the terminology side, I'm wondering if we want to have separate terminology at the different abstraction levels

I agree with @Lunderberg here, but I'd also like to draw a distinction between the current intended usage of the system and how we might expand upon it later. Specifically, we "intend" that stage i in commit_stage(i) and wait_stage(i) refer to software pipeline stages, but there is no technical requirement or reliance on the integer i corresponding to a pipeline stage.

Putting in Eric's terms of "queues" (a terminology change I support), a standard usage would have one queue for each async stage, because you need to synchronize around the output of each async group. But we can imagine simple alternative usages that do not have this 1-to-1. E.g., in stage 0 we have 5 DMA units performing 5 parallel loads, which are used in 5 parallel pipelines; here, we'd want 5 queues, but they all correspond to stage 0.

if we want to change the meaning of "stage" in this proposal, I want to evaluate the feasibility of such change to the implementation first

To double check, would this kind of proposed usage affect the implementation at all? I thought you said in our chat that the i passed to commit/wait could be arbitrary (though it's perhaps not the intended pattern).

@JosephTheOctonaut
Copy link

I'm going to write up a few clarifying notes from the chat yesterday; please correct if any of these are wrong.

  • In commit_stage(i) and wait_stage(i), the integer i is only used for the commit/wait system.
  • The integer i (as above) is currently considered to be static and constant, but a backend could treat it as a runtime value if necessary.
  • The current user interface is to write the "software_pipeline_async_stages" annotation on a schedule, but there seemed to be support for a sugared version that provides more intuition and visibility in the TIR.
  • The blocking instruction wait_stage(i, N) blocks the main thread / executor (i.e., it cannot be executed by a "worker" thread). This means that the the "queues" (async stages) have their work dispatched to them as they execute, rather that populating a work queue in advance. This seems standard for an "online" host-dispatching-to-coprocessor model.

@masahi
Copy link
Member Author

masahi commented Jun 23, 2022

At the commit/wait abstraction level, what if we rename it from "stage" to "queue"?

Yeah, my thinking has been that there is a 1-1 mapping between a stage and a queue. So I have no problem with this change.

To double check, would this kind of proposed usage affect the implementation at all?

Oh, what I meant was that we might want to change the terminology "stage" throughout our existing TIR software pipeline implementation first, not just the implementation of this proposal. To align with more standard terminology and avoid potential confusion with this proposal.

But reading the discussions more carefully, I'm realizing that your suggestion is not necessarily changing the existing use of "stage" in TIR software pipeline, but rather decoupling asynchrony (commit, wait) from the "stage" in this proposal. As the title of this proposal literally says, the original intention has been to bring asynchrony to the "stage" in TIR software pipeline. So, the current mechanics of commit / wait are naturally tied to the "stage" in the TIR sense and the proposal / implementation are highly influenced by how the current TIR software pipeline works / is implemented. @Lunderberg rightfully hinted when he said: "That the "queue" is produced from the "stage" feels more like an implementation detail of the lowering, rather than something inherent to the commit/wait functionality."

I'll go through discussion comments more carefully today and think about how to incorporate the proposed suggestions. Thank you very much for the detailed feedback so far!

@masahi
Copy link
Member Author

masahi commented Jun 24, 2022

Summarizing the current situation of the proposal and discussion points so far:

  • The original goal was (and still is) to bring asynchronies in the TIR software pipeline. Two intrinsics were introduced specifically with this goal in mind: Rather than tackling a potentially more difficult problem of designing "general async semantics" for TIR, the machanics of the intrinsics are strongly tied to the software pipeline (in particular, the notion of "stage").

  • But @Lunderberg and @JosephTheOctonaut suggest that the proposed async semantics may not be too far from a "general" one, if we can decouple asynchrony from the "stage".

  • So, what we can do instead, is first define more general semantics of commit / wait, and build async pipeline on top of it (rather than the other way around). I really like this way of thinking!

  • The exact semantics of "wait" probably needs more care. Currently, it is a bit ambiguous "who" gets to wait (main thread, or worker thread?). I was hoping that the current model allows worker-thread wait, but since the only HW that I can test for real is CUDA right now, my thinking has been unintentionally leaning toward the "main thread" model (despite the fact that I gave an example of multiple async stages which I intended to run "autonomously").

  • So far I haven't received an opposition for adopting the counting based model. If that's going to hold, I'll proceed with the counting based one. Whether or not the translation to a token based one can be done is left for future work, so the translation is outside the scope of this proposal.

@JosephTheOctonaut @Lunderberg @vinx13 @junrushao1994

@masahi
Copy link
Member Author

masahi commented Jun 24, 2022

Updated the doc to talk about commit / wait in terms of "queue". The change is in https://github.com/masahi/tvm-rfcs/blob/async-pipe/rfcs/0077-async-pipeline.md#making-parallelism-more-explicit-asynchronous-pipeline, after the example.

@masahi
Copy link
Member Author

masahi commented Jun 27, 2022

I've incorporated @Lunderberg's suggestion #80 (comment) of making async_commit_queue a scope annotation rather than an intrinsic. This made clearer which async operations are attached to a given commit_queue.

I think I've addressed all feedbacks from @JosephTheOctonaut and @Lunderberg so far. I'm not sure if the "main thread" issue is resolved by now, and I don't know what to about it otherwise. So I'll leave it as it is for now.

@kparzysz-quic
Copy link

kparzysz-quic commented Jul 6, 2022

* The original goal was (and still is) to bring asynchronies in the TIR software pipeline. Two intrinsics were introduced specifically with this goal in mind: Rather than tackling a potentially more difficult problem of designing "general async semantics" for TIR, the machanics of the intrinsics are strongly tied to the software pipeline (in particular, the notion of "stage").

The RFC states that the proposed mechanism is deliberately more general than what pipelining itself would require. Was that added after the feedback? I think that adding parallelization mechanisms that are specific to a particular scheduling operation is not the right thing to do, but it seems like the async queues could be used independently of pipelining.

@masahi
Copy link
Member Author

masahi commented Jul 6, 2022

Was that added after the feedback?

Yes, this text is my attempt to incorporate the feed back I've received so far.

I think that adding parallelization mechanisms that are specific to a particular scheduling operation is not the right thing to do, but it seems like the async queues could be used independently of pipelining.

Yes, this is in agreement with the current direction.

@junrushao
Copy link
Member

Hey guys, thank you all for the very fruitful and edifying discussion! If all the blocking issues are addressed, let's explicitly approve this RFC and proceed upstreaming this feature!

@masahi
Copy link
Member Author

masahi commented Jul 13, 2022

@JosephTheOctonaut brought up a valid point: Initially both commit and wait were TIR intrinsic, later I made async_commit an annotation based on the feedback in #80 (comment). Now, it might make better sense if we also make async_wait an annotation, in terms of consistency and ease of lowering.

Any thought on this question before merging?

@JosephTheOctonaut
Copy link

To elaborate: my instinct is that they should both be annotations or both be intrinsics. Based on some preliminary sketching of lowering to other targets, I think the annotation route might be easier, because it lets you lower the entire asynchronous block at once. This can reduce the duplication of information that might be needed for lowering each intrinsic separately. Further, if the back-end requires additional transformations of the block, or has a synchronization system that doesn't fit exactly with the commit/wait style, operating at the block-level can provide more flexibility.

@masahi
Copy link
Member Author

masahi commented Jul 13, 2022

Ok @JosephTheOctonaut, thanks for the suggestion. I've update wording and pseudo code examples to make async_wait_queue a scope annotation. There are non-trivial changes to the code examples, since now async_wait is making its scope explicit. For example, previously we had

   if i < 1:
      async_wait_group(1, 1)
   else:
      async_wait_group(1, 0)

   D[(i + 14) % 2] = C[(i + 14) % 2] + 1

This is now

   if i < 1:
      async_wait_group(1, 1):
         D[(i + 14) % 2] = C[(i + 14) % 2] + 1
   else:
      async_wait_group(1, 0):
         D[(i + 14) % 2] = C[(i + 14) % 2] + 1

@junrushao1994 @vinx13 I think all outstanding issues have been addressed, ready for a final look and merge.

@JosephTheOctonaut
Copy link

Hmm I hadn't thought of the duplication issue for branches in wait. Some initial ideas:

  • Would additional unrolling (as with the pipeline prologue and epilogue) help solve the issue? Or would it just create a new branch condition in the longer prologue?
  • Could we put a non-constant expression in the annotation? E.g., async_wait_group(1, 1 if i==0 else 0)?
  • Could we alter semantics around wait usage? Right now, it's guarding the statements in the block it annotates, but it also guards all sequentially later blocks as well, correct? So we could annotate empty blocks, with the subsequent blocks containing the compute statement. (This feels like an abuse of program structure and annotations to me, though.) Something like this:
with block():
  with block():
    T.where(i < 1):
      async_wait_group(1, 1)
  with block():
    T.where(i > 0):
      async_wait_group(1, 0)
  D[(i + 14) % 2] = C[(i + 14) % 2] + 1

@masahi
Copy link
Member Author

masahi commented Jul 13, 2022

@JosephTheOctonaut You are right, in the example I gave, the if/else is inside the epilogue, so unrolling takes care of generating only one of the branches.

The second suggestion also works: Right now I'm generating IfThenElse TIR stmt when a predicate is non trivial, but we can use builtin::if_then_else as an equivalent of a tenary operator. I need to check if builtin::if_then_else can be eliminated after unrolling, though.

Right now, it's guarding the statements in the block it annotates, but it also guards all sequentially later blocks as well, correct?

Yes, this is correct.

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @junrushao1994 mentioned early this week it appears we have reached consensus and there have been no more mentions of blocking concerns. Thus let us approve and merge RFC #80.

Many thanks to @masahi @vinx13 @junrushao1994 @JosephTheOctonaut @Lunderberg @Hzfengsy @wrongtest-intellif @kparzysz-quic and all others involved for the great RFC and updates that came from these fruitful discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants