Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

race condition in SimDTM.v under Verilator; related to "taddr-4" in fesvr/dtm.cc, write_chunk #55

Open
smcpeak opened this issue Mar 2, 2020 · 0 comments

Comments

@smcpeak
Copy link

smcpeak commented Mar 2, 2020

vsrc/SimDTM.v contains this call to debug_tick:

      __exit = debug_tick(
        __debug_req_valid,
        __debug_req_ready,
        __debug_req_bits_addr,
        __debug_req_bits_op,
        __debug_req_bits_data,
        __debug_resp_valid,
        __debug_resp_ready,
        __debug_resp_bits_resp,
        __debug_resp_bits_data
      );

This reads inputs, such as __debug_req_valid, that may depend on its own outputs, such as __debug_resp_valid. This makes the circuit behavior undefined.

There are existing code lines that attempt to solve this by introducing delays, for example:

  assign #0.1 debug_req_valid = __debug_req_valid;

but the delay specifier ("#0.1") has no effect under Verilator.

One way to make this race condition visible is to add a diagnostic printout to debug.scala:

diff --git a/src/common/debug.scala b/src/common/debug.scala
index ead55bc..b21d391 100644
--- a/src/common/debug.scala
+++ b/src/common/debug.scala
@@ -169,6 +169,32 @@ class DebugModule(implicit val conf: SodorConfiguration) extends Module {
   dmstatus.allhalted := dmcontrol.haltreq
   dmstatus.allrunning := dmcontrol.resumereq
   io.dcpath.halt := dmstatus.allhalted && !dmstatus.allrunning
+
+  // Stronger condition.
+  val outercond =
+    ( (io.dmi.req.bits.op === DMConsts.dmi_OP_WRITE) &&
+      (decoded_addr(DMI_RegAddrs.DMI_COMMAND)) &&
+      io.dmi.req.valid );
+
+  // Weaker condition.  I call this "inner" because its two elements are
+  // syntactically contained inside the "outer" condition.
+  val innercond = 
+    ( (decoded_addr(DMI_RegAddrs.DMI_COMMAND)) &&
+      io.dmi.req.valid );
+
+  // This creates an extra use of 'innercond' that changes the way
+  // Verilator generates code.  It is needed to expose the bug.  The
+  // value also has to be used somewhere, which is why it appears in
+  // the "bug happened" printout below.
+  val anothercond = innercond && dmcontrol.haltreq;
+
+  // Test for the impossible combination.
+  when (outercond) {
+    when (!innercond) {
+      printf("bug happened! anothercond=%d\n", anothercond);
+    }
+  }
+
   when (io.dmi.req.bits.op === DMConsts.dmi_OP_WRITE){
     when((decoded_addr(DMI_RegAddrs.DMI_ABSTRACTCS)) && io.dmi.req.valid) {
       val tempabstractcs = wdata.asTypeOf(new ABSTRACTCSFields())

With the above change, we see printouts indicating the impossible condition happens:

emulator/rv32_1stage$ ./emulator +max-cycles=30000 ../../install/riscv-tests/rv32ui-p-simple
Instantiated DTM.
bug happened! anothercond=0
bug happened! anothercond=0
bug happened! anothercond=0
bug happened! anothercond=0
Cyc=          0 [1] pc=[80000000] W[r 0=80000004][1] Op1=[r 0][00000000] Op2=[r12][00000000] inst=[04c0006f]  J  DASM(04c0006f)
[...]

At first I thought this was a bug in Verilator and filed verilator/verilator#2177 for it, but was informed about the race condition in SimDTM.v, so I'm reporting it here.

To fix the bug, I propose (following the advice in the linked issue):

diff --git a/vsrc/SimDTM.v b/vsrc/SimDTM.v
index a5f8c23..1d968d6 100644
--- a/vsrc/SimDTM.v
+++ b/vsrc/SimDTM.v
@@ -46,12 +46,18 @@ module SimDTM(
   bit __debug_resp_ready;
   int __exit;
 
-  assign #0.1 debug_req_valid = __debug_req_valid;
-  assign #0.1 debug_req_bits_addr = __debug_req_bits_addr[6:0];
-  assign #0.1 debug_req_bits_op = __debug_req_bits_op[1:0];
-  assign #0.1 debug_req_bits_data = __debug_req_bits_data[31:0];
-  assign #0.1 debug_resp_ready = __debug_resp_ready;
-  assign #0.1 exit = __exit;
+  // Copy data from 'debug_tick' on the negative clock edge so it does
+  // not happen at the same time as a positive clock edge, which would
+  // create a dependency cycle.
+  always @(negedge clk)
+  begin
+    debug_req_valid     <= __debug_req_valid;
+    debug_req_bits_addr <= __debug_req_bits_addr[6:0];
+    debug_req_bits_op   <= __debug_req_bits_op[1:0];
+    debug_req_bits_data <= __debug_req_bits_data[31:0];
+    debug_resp_ready    <= __debug_resp_ready;
+    exit                <= __exit;
+  end
 
   always @(posedge clk)
   begin

That change eliminates the observed misbehavior in debug.scala. However, tests then fail. For example:

emulator/rv32_1stage$ ./emulator +max-cycles=30000 ../../install/riscv-tests/rv32ui-p-simple
Instantiated DTM.
Cyc=          0 [1] pc=[80000000] W[r30=00000002][1] Op1=[r 0][00000000] Op2=[r 2][00000000] inst=[34202f73]     DASM(34202f73)
[...]
Cyc=      29714 [1] pc=[00000060] W[r 5=00000060][1] Op1=[r 0][00000000] Op2=[r 0][00000060] inst=[00000297]     DASM(00000297)
*** FAILED *** (timeout) after 30000 cycles

The problem now is that the program does not get loaded correctly. The first instruction should be 0x04c0006f, not 0x34202f73, which is the second instruction. I tracked this down to riscv-isa-sim/fesvr/dtm.cc, write_chunk, which has this line:

  write(DMI_SBADDRESS0, taddr-4);

The problem is the mysterious -4. There is nothing in the RISCV debug specification to suggest that should be needed (on the contrary, it should not), and there isn't anything similar in read_chunk nor in the non-Sodor implementation of write_chunk. I speculate the -4 was added because it compensated for an effect of the race condition I'm describing. If I remove it, and also fix the race as described above, then programs work again.

Unfortunately, even then not all of the tests pass. The first to fail is rv32_3stage, rv32ui-p-xori:

emulator/rv32_3stage$ ./emulator +max-cycles=30000 ../../install/riscv-tests/rv32ui-p-xori
[...]
Cyc=        280 [1] pc=[80000048] W[r 0=8000004c][1] Op1=[r31][0000000b] Op2=[r25][00000000] inst=[ff9ff06f]     DASM(ff9ff06f)
terminate called after throwing an instance of 'std::runtime_error'
  what():  bad syscall #342274048501219439
Exit 255

I have not attempted to debug that failure yet. I figured I'd file what I have and get some feedback before going further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant