Critical bug in flow control of dm_csrs #171

Scheremo · 2024-08-15T14:46:31Z

This PR fixes a longstanding bug that leads to crashing the JTAG Tap when polling on end of computation (EOC) registers due to the DM_CSRS signaling ready, and the SBCS.sbbusy being deasserted when a request arrives, while the SBCS is still busy.

We found this while debugging in Chimera; even though the JTAG Test methods we use read the SBCS.sbbusy bit before re-issuing a read request, we observe an sbbusyerror that we cannot recover from. This PR fixes this issue, by deasserting the req_ready. My guess is that the timing of sbbusyerror is somehow delayed with respect to the request, but I'll have to dig a bit deeper.

@Lore0599 @adimauro-iis

andreaskurth · 2024-08-16T09:14:28Z

Hi @Scheremo, thanks for raising this. Are you sure the problem you described ("crashing the JTAG TAP when polling on end of computation (EOC) registers") is caused by this mechanism?

My current understanding is that this is working as intended -- without the change you propose: In dm_csrs, when a request (dmi_req_valid_i) to write SBCS or to read or write data via SBA arrives while the dm_sba module is busy (sbbusy_i), the response is DTM_BUSY and the sbbusyerror status bit gets set. The requester is expected to react to such a response by waiting for some time to let dm_sba become non-busy (which must happen eventually as long as the network it's attached to isn't deadlocked) and by then clearing the sbbusyerror bit.

Introducing backpressure on DMI requests, which the change you suggest would do, isn't the correct way of handling busyness of the SBA module, as far as I interpret the RISC-V Debug Spec (p. 33).

Am I missing / misunderstanding something?

Scheremo · 2024-08-16T11:02:24Z

HI @andreaskurth, sorry for the very short initial description; I personally had to dig a bit deeper on this issue, and I'm not fully confident this fix is appropriate or spec compliant; from my current limited understanding the issue is that sbbusy does not rise "immediately". We poll the sbcs register from our testbench and only issue new requests once sbbusy is read as low, as asked for by the spec; this looks like it's working as intended on our end, but we observe an sbbusyerror after reading back sbbusy low.

From the spec's description of sbcs

"This bit goes high immediately when a read or write is requested for any reason, and does not go low until the access is fully completed."

I wouldn't expect the sbbusyerror condition to occur when we do not access the registers while sbbusy is high.

However, after digging through the open issues on this repo, I think this is related to this issue. I can't reproduce the original error condition on my end once I reduce the JTAG's clock frequency substantially. I'll check in with @Lore0599 later and see whether the test case in which we found this error originally resolves itself the same way.

andreaskurth · 2024-08-16T11:31:18Z

Instead of relying on sbbusy before you issue a request, you could just issue the request and determine from the response whether it was successful or has to be retried later (after sbbusyerror has been cleared). That's probably more robust, right?

Scheremo · 2024-08-16T12:13:57Z

Instead of relying on sbbusy before you issue a request, you could just issue the request and determine from the response whether it was successful or has to be retried later (after sbbusyerror has been cleared). That's probably more robust, right?

You're probably right! In either case, relaxing the JTAG frequency fixes our original issue - I think we can close this PR, in a realistic setting we can set the SoC's clock fast enough / the JTAG's clock slow enough that we don't run into this condition. I also saw you already have a note on clock frequency constraints in the README.md, so I think no further action is needed. Thanks for your time!

Fix failing ready condiiton

5b21f35

Scheremo requested review from bluewww and micprog August 15, 2024 14:47

Scheremo closed this Aug 16, 2024

Scheremo mentioned this pull request Aug 16, 2024

Revert Commit #69cc1bc pulp-platform/chimera#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical bug in flow control of dm_csrs #171

Critical bug in flow control of dm_csrs #171

Scheremo commented Aug 15, 2024 •

edited

Loading

andreaskurth commented Aug 16, 2024 •

edited

Loading

Scheremo commented Aug 16, 2024 •

edited

Loading

andreaskurth commented Aug 16, 2024

Scheremo commented Aug 16, 2024 •

edited

Loading

Critical bug in flow control of dm_csrs #171

Critical bug in flow control of dm_csrs #171

Conversation

Scheremo commented Aug 15, 2024 • edited Loading

andreaskurth commented Aug 16, 2024 • edited Loading

Scheremo commented Aug 16, 2024 • edited Loading

andreaskurth commented Aug 16, 2024

Scheremo commented Aug 16, 2024 • edited Loading

Scheremo commented Aug 15, 2024 •

edited

Loading

andreaskurth commented Aug 16, 2024 •

edited

Loading

Scheremo commented Aug 16, 2024 •

edited

Loading

Scheremo commented Aug 16, 2024 •

edited

Loading