[Arc] Improve LowerState to never produce read-after-write conflicts #7703

fabianschuiki · 2024-10-14T02:42:08Z

This is a complete rewrite of the LowerState pass that makes the LegalizeStateUpdate pass obsolete.

The old implementation of LowerState produces arc.models that still contain read-after-write conflicts. This primarily happens because the pass simply emits arc.state_write operations that write updated values to simulation memory for each arc.state, and any user of arc.state would use an arc.state_read operation to retrieve the original value of the state before any writes occurred. Memories are similar. The Arc dialect considers arc.state_write and arc.memory_write operations to be deferred writes until the LegalizeStateUpdate pass runs, at which point they become immediate since the legalization inserts the necessary temporaries to resolve the read-after-write conflicts.

The previous implementation would also not handle state-to-output and state-to-side-effecting-op propagation paths correctly. When a model's eval function is called, registers are updated to their new value, and any outputs that combinatorially depend on those new values should also immediately update. Similarly, operations such as asserts or debug trackers should observe new values for states immediately after they have been written. However, since all writes are considered deferred, there is no way for LowerState to produce a mixture of operations that depend on a register's old state (because they are used to compute a register's new state), and on a new state because they are combinatorially derived values.

This new implementation of LowerState completely avoids read-after-write conflicts. It does this by changing the way modules are lowered in two ways:

Phases: The pass tracks in which phase of the simulation lifecycle a value is needed and allows for operations to have different lowerings in different phases. An arc.state operation for example requires its inputs, enable, and reset to be computed based on the old value they had, i.e. the value the end of the previous call to the model's eval function. The clock however has to be computed based on the new value it has in the current call to eval. Therefore, the ops defining the inputs, enable, and reset are lowered in the old phase, while the ops defining the clock are lowered in the new phase. The arc.state op lowering will then write its new value to simulation storage.

This phase tracking allows registers to be used as the clock for other registers: since the clocks require new values, registers serving as clock to others are lowered first, such that the dependent registers can immediately react to the updated clock. It also allows for module outputs and side-effecting ops based on arc.states to be scheduled after the states have been updated, since they depend on the state's new value.

The pass also covers the situation where an operation depends on a module input and a state, and feeds into a module output as well as another state. In this situation that operation has to be lowered twice: once for the old phase to serve as input to the subsequent state, and once for the new phase to compute the new module output.

In addition to the old and new phases representing the previous and current call to eval, the pass also models an initial and final phase. These are used for seq.initial and llhd.final ops, and in order to compute the initial values for states. If an arc.state op has an initial value operand it is lowered in the initial phase. Similarly for the ops in llhd.final. The pass places all ops lowered in the initial and final phases into corresponding arc.initial and arc.final ops. At a later point we may want to generate the *_initial, *_eval, and *_final functions directly.

No more clock trees: The new implementation also no longer generates arc.clock_tree and arc.passthrough operations. These were a holdover from the early days of the Arc dialect, where no eval function would be generated. Instead, the user was required to directly call clock functions. This was never able to model clocks changing at the exact same moment, or having clocks derived from registers and other combinatorial operations. Since Arc has since switched to generating an eval function that can accurately interleave the effects of different clocks, grouping ops by clock tree is no longer needed. In fact, removing the clock tree ops allows for the pass to more efficiently interleave the operations from different clock domains.

The Rocket core in the circt/arc-tests repository still works with this new implementation of LowerState. In combination with the MergeIfs pass the performance stays the same. (Actually 10% speedup on my workstation.)

I have renamed the implementation and test files to make the git diffs easier to read. The names will be changed back in a follow-up commit.

Fixes #6390.

uenoku · 2024-10-15T14:30:44Z

integration_test/arcilator/JIT/reg.mlir

@@ -41,6 +41,7 @@ func.func @main() {
  %step = arith.constant 1 : index

  arc.sim.instantiate @counter as %model {
+    arc.sim.step %model : !arc.sim.instance<@counter>


Is this change required to run initialization?

Yeah. In its current form, a call to the initial function only executes seq.initial blocks and initializes the storage for states. It does not propagate those changes to outputs and side-effecting ops though. I wasn't sure if the initial function should also call eval once after initialization (potentially triggering first state updates and other things to run immediately), or if the user might want more fine grained control over when the first eval happens. WDYT?

Yeah I see that makes sense. I agree that we want to avoid implicitly calling eval from initial

maerhart

Thanks a lot for implementing all this and sorry for my late review!

This makes a lot of sense to me because the state legalization has always been a pain point, is quite tricky to do properly for memories, and optimizations in between these two passes will probably do at least some degree of memory operation analysis anyway to avoid insertion of absurd amount of temporaries by state legalization.

I wonder if we can make this pass more modular in some way in the future such that arcilator can support new CIRCT core for frontend ops without having to modify this pass. I was thinking maybe a Op or dialect interface could work?

This also fixes #6390

maerhart · 2024-10-15T11:21:45Z

lib/Dialect/Arc/ArcTypes.cpp

+  if (llvm::isa<seq::ClockType>(innerType))
+    return success();


Unnecessary since it is now handled in getBitWidth?

Not in hw::getBitWidth though 😢

Ah right, you're calling hw::getBitWidth. Why not StateType::getBitWidth though?

Because that is not available at this point 😢 (it takes the type from the this parameter instead of a function argument). I should just factor this out into a static helper.

Let me fix that post-merge.

lib/Dialect/Arc/Transforms/LowerStateRewrite.cpp

maerhart · 2024-10-23T14:48:31Z

lib/Dialect/Arc/Transforms/LowerStateRewrite.cpp

+  // HACK: If the result comes from an op that has a "names" attribute, use that
+  // as a name for the allocation. This should no longer be necessary once we
+  // properly support the Debug dialect.
+  if (isa<StateOp, sim::DPICallOp>(result.getOwner()))


You're already skipping the TapOp here because it'd be too much work to add it, and we want to move to using the debug dialect anyway? Don't we rely on it when tracing is enabled in arc-tests?

The TapOp creates an AllocStateOp directly further down, where it also sets the name properly. This code up here is only for state-like ops that have to create an allocation when their def or a use is encountered for the first time.

lib/Dialect/Arc/Transforms/LowerStateRewrite.cpp

This is a complete rewrite of the `LowerState` pass that makes the `LegalizeStateUpdate` pass obsolete. The old implementation of `LowerState` produces `arc.model`s that still contain read-after-write conflicts. This primarily happens because the pass simply emits `arc.state_write` operations that write updated values to simulation memory for each `arc.state`, and any user of `arc.state` would use an `arc.state_read` operation to retrieve the original value of the state before any writes occurred. Memories are similar. The Arc dialect considers `arc.state_write` and `arc.memory_write` operations to be _deferred_ writes until the `LegalizeStateUpdate` pass runs, at which point they become _immediate_ since the legalization inserts the necessary temporaries to resolve the read-after-write conflicts. The previous implementation would also not handle state-to-output and state-to-side-effecting-op propagation paths correctly. When a model's eval function is called, registers are updated to their new value, and any outputs that combinatorially depend on those new values should also immediately update. Similarly, operations such as asserts or debug trackers should observe new values for states immediately after they have been written. However, since all writes are considered deferred, there is no way for `LowerState` to produce a mixture of operations that depend on a register's _old_ state (because they are used to compute a register's new state), and on a _new_ state because they are combinatorially derived values. This new implementation of `LowerState` completely avoids read-after-write conflicts. It does this by changing the way modules are lowered in two ways: **Phases:** The pass tracks in which _phase_ of the simulation lifecycle a value is needed and allows for operations to have different lowerings in different phases. An `arc.state` operation for example requires its inputs, enable, and reset to be computed based on the _old_ value they had, i.e. the value the end of the previous call to the model's eval function. The clock however has to be computed based on the _new_ value it has in the current call to eval. Therefore, the ops defining the inputs, enable, and reset are lowered in the _old_ phase, while the ops defining the clock are lowered in the _new_ phase. The `arc.state` op lowering will then write its _new_ value to simulation storage. This phase tracking allows registers to be used as the clock for other registers: since the clocks require _new_ values, registers serving as clock to others are lowered first, such that the dependent registers can immediately react to the updated clock. It also allows for module outputs and side-effecting ops based on `arc.state`s to be scheduled after the states have been updated, since they depend on the state's _new_ value. The pass also covers the situation where an operation depends on a module input and a state, and feeds into a module output as well as another state. In this situation that operation has to be lowered twice: once for the _old_ phase to serve as input to the subsequent state, and once for the _new_ phase to compute the new module output. In addition to the _old_ and _new_ phases representing the previous and current call to eval, the pass also models an _initial_ and _final_ phase. These are used for `seq.initial` and `llhd.final` ops, and in order to compute the initial values for states. If an `arc.state` op has an initial value operand it is lowered in the _initial_ phase. Similarly for the ops in `llhd.final`. The pass places all ops lowered in the initial and final phases into corresponding `arc.initial` and `arc.final` ops. At a later point we may want to generate the `*_initial`, `*_eval`, and `*_final` functions directly. **No more clock trees:** The new implementation also no longer generates `arc.clock_tree` and `arc.passthrough` operations. These were a holdover from the early days of the Arc dialect, where no eval function would be generated. Instead, the user was required to directly call clock functions. This was never able to model clocks changing at the exact same moment, or having clocks derived from registers and other combinatorial operations. Since Arc has since switched to generating an eval function that can accurately interleave the effects of different clocks, grouping ops by clock tree is no longer needed. In fact, removing the clock tree ops allows for the pass to more efficiently interleave the operations from different clock domains. The Rocket core in the circt/arc-tests repository still works with this new implementation of LowerState. In combination with the MergeIfs pass the performance stays the same. I have renamed the implementation and test files to make the git diffs easier to read. The names will be changed back in a follow-up commit.

fabianschuiki added the Arc Involving the `arc` dialect label Oct 14, 2024

fabianschuiki requested review from uenoku and dtzSiFive October 14, 2024 02:42

fabianschuiki requested a review from maerhart as a code owner October 14, 2024 02:42

fabianschuiki force-pushed the fschuiki/arc-merge-ifs branch from ab552e4 to e0068a6 Compare October 14, 2024 16:59

fabianschuiki force-pushed the fschuiki/arc-new-lower-state branch 2 times, most recently from e648462 to 821d12f Compare October 14, 2024 17:40

fzi-hielscher mentioned this pull request Oct 14, 2024

[arcilator] Add clock divider integration test #7705

Merged

uenoku reviewed Oct 15, 2024

View reviewed changes

fabianschuiki force-pushed the fschuiki/arc-new-lower-state branch from 821d12f to 2b87d2e Compare October 15, 2024 16:45

fabianschuiki force-pushed the fschuiki/arc-merge-ifs branch from e0068a6 to 46a9297 Compare October 15, 2024 16:45

fabianschuiki force-pushed the fschuiki/arc-new-lower-state branch from 2b87d2e to 8c64305 Compare October 15, 2024 17:18

fabianschuiki force-pushed the fschuiki/arc-merge-ifs branch from 46a9297 to ded454f Compare October 15, 2024 17:18

Base automatically changed from fschuiki/arc-merge-ifs to main October 15, 2024 18:38

fabianschuiki force-pushed the fschuiki/arc-new-lower-state branch 2 times, most recently from ca2369f to 401c3cf Compare October 21, 2024 18:46

fabianschuiki mentioned this pull request Oct 24, 2024

[Arc] Automatic module partition #7650

Merged

maerhart approved these changes Oct 25, 2024

View reviewed changes

fabianschuiki force-pushed the fschuiki/arc-new-lower-state branch from 401c3cf to c0246c2 Compare October 28, 2024 21:36

fabianschuiki merged commit 3181b03 into main Oct 28, 2024
4 checks passed

fabianschuiki deleted the fschuiki/arc-new-lower-state branch October 28, 2024 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arc] Improve LowerState to never produce read-after-write conflicts #7703

[Arc] Improve LowerState to never produce read-after-write conflicts #7703

fabianschuiki commented Oct 14, 2024 •

edited

Loading

uenoku Oct 15, 2024

fabianschuiki Oct 15, 2024

uenoku Oct 17, 2024

maerhart left a comment

maerhart Oct 15, 2024

fabianschuiki Oct 28, 2024

maerhart Oct 28, 2024

fabianschuiki Oct 28, 2024

fabianschuiki Oct 28, 2024

maerhart Oct 23, 2024

fabianschuiki Oct 28, 2024

[Arc] Improve LowerState to never produce read-after-write conflicts #7703

[Arc] Improve LowerState to never produce read-after-write conflicts #7703

Conversation

fabianschuiki commented Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maerhart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianschuiki commented Oct 14, 2024 •

edited

Loading