Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRAM: repipeline the TLRAM into a 3 cycle RMW state machine #2582

Merged
merged 1 commit into from
Aug 13, 2020

Conversation

terpstra
Copy link
Contributor

@terpstra terpstra commented Jul 31, 2020

Here is a picture of the change to the pipeline:
https://app.lucidchart.com/invitations/accept/da44b89c-a93c-45e9-ba5e-cb6f9140d84e

Compared to the old pipeline, occupancy is increased from 2 cycles to 3 for:

  • atomics
  • sub-ECC-granularity writes
  • repaired ECC values

In exchange for this occupancy increase, a new register (REG) was added:
sram data output => REG => ecc-correction => ALU => sram write setup
This path was sufficiently long that it limited fMAX on many designs.
In designs without ECC and without atomics, this pipeline is optimized away.

Compared to the old pipeline, response latency is unchanged (by default) for:

  • reads (1)
  • atomics (1)
  • writes (1)
  • ECC-repaired reads (2)
  • ECC-repaired atomics (2)

Added a knob (sramReg) to set latency for all operations to 2.
With this knob disabled (the default), as in the old pipeline:

  • output data can flow uncorrected from the SRAM
  • output valid depends on correct ECC decode of SRAM output

With the knob enabled:

  • data flows from an ECC correction fed by registers
  • valid is a register

Type of change: other enhancement
Impact: API addition (no impact on existing code)
Development Phase: implementation

Release Notes
Improved cycle time for designs involving TLRAM.

Here is a picture of the change to the pipeline:
  https://app.lucidchart.com/invitations/accept/da44b89c-a93c-45e9-ba5e-cb6f9140d84e

Compared to the old pipeline, occupancy is increased from 2 cycles to 3 for:
 - atomics
 - sub-ECC-granularity writes
 - repaired ECC values

In exchange for this occupancy increase, a new register (REG) was added:
  sram data output => *REG* => ecc-correction => ALU => sram write setup
This path was sufficiently long that it limited fMAX on many designs.
In designs without ECC and without atomics, this pipeline is optimized away.

Compared to the old pipeline, response latency is unchanged (by default) for:
 - reads   (1)
 - atomics (1)
 - writes  (1)
 - ECC-repaired reads   (2)
 - ECC-repaired atomics (2)

Added a knob (sramReg) to set latency for all operations to 2.
With this knob disabled (the default), as in the old pipeline:
  - output data can flow uncorrected from the SRAM
  - output valid depends on correct ECC decode of SRAM output
With the knob enabled:
  - data flows from an ECC correction fed by registers
  - valid is a register
@terpstra terpstra added the WIP label Jul 31, 2020
@terpstra terpstra removed the WIP label Jul 31, 2020
@terpstra terpstra changed the title [WIP] SRAM: repipeline the TLRAM into a 3 cycle RMW state machine SRAM: repipeline the TLRAM into a 3 cycle RMW state machine Jul 31, 2020
@terpstra
Copy link
Contributor Author

There were some follow-up changes I wanted to make that put down a combination of Xbar + TLRAM, but I think keeping those separate is probably a good idea.

@@ -23,6 +24,7 @@ class TLRAM(
atomics: Boolean = false,
beatBytes: Int = 4,
ecc: ECCParams = ECCParams(),
sramReg: Boolean = false, // drive SRAM data output directly into a register => 1 cycle longer response
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipeline instead of sramReg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that name is much better. What is actually happening is that I remove the 'ecc-ok' fast path.

BTW, it has been suggested to me offline that perhaps in the pipelined case we should increase ECC repair latency to 3. Thus, we could take the output data uncorrected from registers. Of course, the valid signal would still need to include a suppression for the correction-required case. Do you think this change is worth it?

Copy link
Contributor Author

@terpstra terpstra Aug 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the trade-off is between latency in valid vs. data. The current design has a flop for valid and correction on the data. The proposal has flops for data and detection on valid. This is the same trade-off we made for the old and new 'fast path'. If it was better there, why is it not also better here?

d_raw_data := mem.read(addr, ren)
when (wen) { mem.write(addr, coded, sel) }
val index = Cat(mask.zip((addr >> log2Ceil(beatBytes)).asBools).filter(_._1).map(_._2).reverse)
r_raw_data := mem.read(index, ren) holdUnless RegNext(ren)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters much, but a while back I defined mem.readAndHold(index, ren) to summarize this pattern

* When stage D needs to perform a write (AMO, sub-ECC write, or ECC correction):
* - there is a WaW or WaR hazard vs. the operation in stage R
* - for sub-ECC writes and atomics, we ensure stage R has a bubble
* - for ECC correction, we cause stage R to be replayed (and reject stage A twice)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for not using replay for both cases? If using a bubble instead of a replay would improve performance, I'd understand. But since there's a structural hazard from the D stage on the following cycle anyway, is the bubble actually better than the replay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the performance in terms of latency and throughput would be the same. However, if I replay it requires reading the SRAM twice. I assumed that sub-ecc-granule writes might be common, so paying 2x power for them was unwise.

@terpstra terpstra merged commit 75823b3 into master Aug 13, 2020
@terpstra terpstra deleted the sram-rmw-pipeline branch August 13, 2020 05:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants