Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical bug in flow control of dm_csrs #171

Closed
wants to merge 1 commit into from
Closed

Conversation

Scheremo
Copy link

@Scheremo Scheremo commented Aug 15, 2024

This PR fixes a longstanding bug that leads to crashing the JTAG Tap when polling on end of computation (EOC) registers due to the DM_CSRS signaling ready, and the SBCS.sbbusy being deasserted when a request arrives, while the SBCS is still busy.

We found this while debugging in Chimera; even though the JTAG Test methods we use read the SBCS.sbbusy bit before re-issuing a read request, we observe an sbbusyerror that we cannot recover from. This PR fixes this issue, by deasserting the req_ready. My guess is that the timing of sbbusyerror is somehow delayed with respect to the request, but I'll have to dig a bit deeper.

@Lore0599 @adimauro-iis

@Scheremo Scheremo requested review from bluewww and micprog August 15, 2024 14:47
@andreaskurth
Copy link
Contributor

andreaskurth commented Aug 16, 2024

Hi @Scheremo, thanks for raising this. Are you sure the problem you described ("crashing the JTAG TAP when polling on end of computation (EOC) registers") is caused by this mechanism?

My current understanding is that this is working as intended -- without the change you propose: In dm_csrs, when a request (dmi_req_valid_i) to write SBCS or to read or write data via SBA arrives while the dm_sba module is busy (sbbusy_i), the response is DTM_BUSY and the sbbusyerror status bit gets set. The requester is expected to react to such a response by waiting for some time to let dm_sba become non-busy (which must happen eventually as long as the network it's attached to isn't deadlocked) and by then clearing the sbbusyerror bit.

Introducing backpressure on DMI requests, which the change you suggest would do, isn't the correct way of handling busyness of the SBA module, as far as I interpret the RISC-V Debug Spec (p. 33).

Am I missing / misunderstanding something?

@Scheremo
Copy link
Author

Scheremo commented Aug 16, 2024

HI @andreaskurth, sorry for the very short initial description; I personally had to dig a bit deeper on this issue, and I'm not fully confident this fix is appropriate or spec compliant; from my current limited understanding the issue is that sbbusy does not rise "immediately". We poll the sbcs register from our testbench and only issue new requests once sbbusy is read as low, as asked for by the spec; this looks like it's working as intended on our end, but we observe an sbbusyerror after reading back sbbusy low.

From the spec's description of sbcs

"This bit goes high immediately when a read or write is requested for any reason, and does not go low until the access is fully completed."

I wouldn't expect the sbbusyerror condition to occur when we do not access the registers while sbbusy is high.

However, after digging through the open issues on this repo, I think this is related to this issue. I can't reproduce the original error condition on my end once I reduce the JTAG's clock frequency substantially. I'll check in with @Lore0599 later and see whether the test case in which we found this error originally resolves itself the same way.

@andreaskurth
Copy link
Contributor

Instead of relying on sbbusy before you issue a request, you could just issue the request and determine from the response whether it was successful or has to be retried later (after sbbusyerror has been cleared). That's probably more robust, right?

@Scheremo
Copy link
Author

Scheremo commented Aug 16, 2024

Instead of relying on sbbusy before you issue a request, you could just issue the request and determine from the response whether it was successful or has to be retried later (after sbbusyerror has been cleared). That's probably more robust, right?

You're probably right! In either case, relaxing the JTAG frequency fixes our original issue - I think we can close this PR, in a realistic setting we can set the SoC's clock fast enough / the JTAG's clock slow enough that we don't run into this condition. I also saw you already have a note on clock frequency constraints in the README.md, so I think no further action is needed. Thanks for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants