Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect handling of writeback value in multicycle instructions #10

Open
gs2git opened this issue Nov 7, 2024 · 1 comment
Open

Incorrect handling of writeback value in multicycle instructions #10

gs2git opened this issue Nov 7, 2024 · 1 comment

Comments

@gs2git
Copy link

gs2git commented Nov 7, 2024

Potential Issue with Multicycle Instruction Handling in cv32e40px Core

Summary

When a custom CV-X-IF instruction has a writeback value (write to rd), if the instruction takes more than one clock cycle to raise the result_valid flag, cv32e40px writes an incorrect value to rd.
This doesn't happen in cv32e40x tested under the same conditions.

Issue Description

I am encountering unexpected behavior when executing multiple multicycle instructions with writeback on the cv32e40px core. In contrast, the cv32e40x core produces the expected results under the same conditions. This observation was made while using a co-processor wrapper based on the CV-X-IF standard (v0.2.0) to enable custom instructions. To reproduce the issue, I introduced a delay to the result in the FPU RTL and created a test that computes values on the FPU, then moves them to X registers, followed by sequential stores that reuse the same X register.

Testing and Findings

It appears that, during multicycle instructions with writeback, the cv32e40px core may not be coordinating the results as anticipated. The core seems to perform stores before the results for the current instruction are available, instead of waiting for the valid data. Meanwhile, the co-processor appears to respond correctly, activating result_valid only when the result data is ready.

To reproduce this behavior, I introduced a delay in the RTL of the FPU (fpu_ss.sv), adding a 5-cycle delay pipeline to x_result_valid_o and x_result_o. Below is the relevant code snippet with the added modifications. Please substitute the existing Result Interface Signals section in the fpu_ss.sv file with the Result Interface Logic (Pre-Delay) below:

  // Define delay pipeline depth and pipeline arrays
  localparam int PIPE_DEPTH = 5;  // Number of cycles to delay the result output
  x_result_t x_result_delay_pipe [PIPE_DEPTH-1:0]; // Pipeline array for result data
  logic x_result_valid_delay_pipe [PIPE_DEPTH-1:0]; // Pipeline array for result valid signal
  
  x_result_t x_result_pre_delay; // Holds the result data before it enters the delay pipeline
  logic x_result_valid_pre_delay; // Holds the result valid signal before it enters the delay pipeline

  //...

  fpu_ss_controller #(
      .PULP_ZFINX(PULP_ZFINX),
      .INPUT_BUFFER_DEPTH(INPUT_BUFFER_DEPTH),
      .OUT_OF_ORDER(OUT_OF_ORDER),
      .FORWARDING(FORWARDING)
  ) fpu_ss_controller_i (
      //...
      // Result Interface
      .x_result_ready_i(x_result_ready_i),
      .x_result_valid_o(x_result_valid_pre_delay), // Modified output: 'x_result_valid' will be delayed 
      .csr_instr_i(csr_instr)
  );

  //...

  // -------------------------
  // Result Interface Logic (Pre-Delay)
  // -------------------------
  //
  // The following logic prepares the result data (`x_result_pre_delay`) and valid signal (`x_result_valid_pre_delay`)
  // before they are fed into the delay pipeline. This logic generates `x_result_pre_delay` and `x_result_valid_pre_delay`

  assign x_result_pre_delay.exc     = 1'b0;  // no errors can occur for now
  assign x_result_pre_delay.exccode = '0; // no errors can occur for now

  // Prepare the result data before it enters the delay pipeline
  always_comb begin
    x_result_pre_delay.data = fpu_result;
    if (csr_wb & ~fpu_out_valid & csr_wb & ~fpu_out_valid) begin
      x_result_pre_delay.data = 32'($unsigned(csr_wb_addr));
    end
  end

  // Assign destination register and instruction ID before entering delay pipeline
  always_comb begin
    x_result_pre_delay.rd = '0;
    x_result_pre_delay.id = '0;
    if (fpu_out_valid & x_result_valid_pre_delay & x_result_ready_i) begin
      x_result_pre_delay.rd = fpu_tag_out.addr;
      x_result_pre_delay.id = fpu_tag_out.id;
    end else if (x_result_valid_pre_delay & x_result_ready_i & ~fpu_out_valid) begin
      x_result_pre_delay.rd = csr_wb_addr;
      x_result_pre_delay.id = csr_wb_id;
    end
  end

  // Set write-enable flag before entering delay pipeline
  always_comb begin
    x_result_pre_delay.we = 1'b0;
    if ((fpu_out_valid & ~fpu_tag_out.rd_is_fp) | (csr_wb)) begin
      x_result_pre_delay.we = 1'b1;
    end
  end

  // Set additional control signals before entering delay pipeline
  always_comb begin
    x_result_pre_delay.ecswe   = '0;
    x_result_pre_delay.ecsdata = '0;
    if (fpu_out_valid & x_result_valid_pre_delay & x_result_ready_i & fpu_tag_out.rd_is_fp) begin
      x_result_pre_delay.ecswe   = 3'b010;
      x_result_pre_delay.ecsdata = 6'b001100;
    end
  end


  // -------------------------
  // Delay Pipeline Logic
  // -------------------------
  //
  // This logic implements a shift register to delay the result output (`x_result_o`) by `PIPE_DEPTH` cycles.
  // The result and its validity flag are shifted through `PIPE_DEPTH` stages, one per clock cycle.
      
  always_ff @(posedge clk_i or negedge rst_ni) begin
    if (!rst_ni) begin
        // Initialize the delay pipeline to zero on reset
        for (int i = 0; i < PIPE_DEPTH; i++) begin
            x_result_delay_pipe[i] <= '0;
            x_result_valid_delay_pipe[i] <= '0;
        end
    end else begin
        // Shift the result and valid signals through the delay pipeline
        x_result_delay_pipe[0] <= x_result_pre_delay; // Insert new result at the beginning
        x_result_valid_delay_pipe[0] <= x_result_valid_pre_delay; // Insert new valid signal at the beginning
        for (int i = 1; i < PIPE_DEPTH; i++) begin
            x_result_delay_pipe[i] <= x_result_delay_pipe[i-1]; // Shift data
            x_result_valid_delay_pipe[i] <= x_result_valid_delay_pipe[i-1]; // Shift valid
        end
    end
  end


  // -------------------------
  // Delayed Output Assignment
  // -------------------------
  //
  // The delayed outputs are assigned from the final stage of the pipeline.
  // This effectively delays the signals `x_result_o` and `x_result_valid_o` by `PIPE_DEPTH` cycles.

  assign x_result_o = x_result_delay_pipe[PIPE_DEPTH-1]; // Final delayed result data
  assign x_result_valid_o = x_result_valid_delay_pipe[PIPE_DEPTH-1]; // Final delayed valid signal

Test Case

To reproduce the observed behavior, I created a test case using a sequence that approximates the Fibonacci series. The core performs floating-point calculations via custom instructions and stores the results.

#include <stdio.h>

/** 
 * Compute the first 10 values of the sequence a * x^n with n = 1 to 10,
 * rounding each value to an integer.
 */
void compute_powers(float a, float x, int res[10]) {
    asm volatile (
        // Compute each power a * x^n using floating-point multiplication
        "FMUL.S ft0, %[a],%[x]\n\t"  // ft0 = a * x^1
        "FMUL.S ft1, ft0, %[x]\n\t"  // ft1 = a * x^2
        "FMUL.S ft2, ft1, %[x]\n\t"  // ft2 = a * x^3
        "FMUL.S ft3, ft2, %[x]\n\t"  // ft3 = a * x^4
        "FMUL.S ft4, ft3, %[x]\n\t"  // ft4 = a * x^5
        "FMUL.S ft5, ft4, %[x]\n\t"  // ft5 = a * x^6
        "FMUL.S ft6, ft5, %[x]\n\t"  // ft6 = a * x^7
        "FMUL.S ft7, ft6, %[x]\n\t"  // ft7 = a * x^8
        "FMUL.S ft8, ft7, %[x]\n\t"  // ft8 = a * x^9
        "FMUL.S ft9, ft8, %[x]\n\t"  // ft9 = a * x^10

        // Convert each floating-point result to integer (rounding) and store it in memory
        "FCVT.W.S t1, ft0, rne\n\t"  "SW t1,  0(%[res])\n\t"  // res[0] = t1  = round(ft0)
        "FCVT.W.S t1, ft1, rne\n\t"  "SW t1,  4(%[res])\n\t"  // res[1] = t1  = round(ft1)
        "FCVT.W.S t1, ft2, rne\n\t"  "SW t1,  8(%[res])\n\t"  // res[2] = t1  = round(ft2)
        "FCVT.W.S t1, ft3, rne\n\t"  "SW t1, 12(%[res])\n\t"  // res[3] = t1  = round(ft3)
        "FCVT.W.S t1, ft4, rne\n\t"  "SW t1, 16(%[res])\n\t"  // res[4] = t1  = round(ft4)
        "FCVT.W.S t1, ft5, rne\n\t"  "SW t1, 20(%[res])\n\t"  // res[5] = t1  = round(ft5)
        "FCVT.W.S t2, ft6, rne\n\t"  "SW t2, 24(%[res])\n\t"  // res[6] =  t2 = round(ft6)
        "FCVT.W.S t2, ft7, rne\n\t"  "SW t2, 28(%[res])\n\t"  // res[7] =  t2 = round(ft7)
        "FCVT.W.S t2, ft8, rne\n\t"  "SW t2, 32(%[res])\n\t"  // res[8] =  t2 = round(ft8)
        "FCVT.W.S t2, ft9, rne\n\t"  "SW t2, 36(%[res])\n\t"  // res[9] =  t2 = round(ft9)
        :
        : [a] "f" (a), [x] "f" (x), [res] "r" (res)
        : "t1", "t2",
          "ft0", "ft1", "ft2", "ft3", "ft4", "ft5", "ft6", "ft7", "ft8", "ft9",
          "memory"
    );
}


/** 
 * Main function to test the compute_powers function.
 * This test uses values that approximate the Fibonacci sequence:
 *    - x is the golden ratio ((1 + sqrt(5)) / 2)
 *    - a is scaled to start the sequence at the second Fibonacci number.
 * The resulting series should resemble the Fibonacci sequence from position 2 onward:
 *    1, 2, 3, 5, 8, 13, 21, 34, 55, 89
 */
int main(void) {
    float x = 1.6180339887; // Golden ratio approximation
    float a = 0.7236067977; // 1 / sqrt(5) * x (to skip first 1 in Fibonacci series)
    int fib[10] = {0}; // Array to store computed Fibonacci-like sequence values

    compute_powers(a, x, fib);

    // Print each computed result, resembling Fibonacci sequence starting at position 2
    for (int i=0; i<10; i++) printf("%d:%d\r\n", i, fib[i]);

    return 0;
}

Observations

  • cv32e40x core: Results are as expected.

image

  • cv32e40px core: The first result appears incorrect, and subsequent values are shifted, skipping "13."

image

Steps to Reproduce

Commands:
The commands I used are the following ones:

  1. make mcu-gen CPU=cv32e40px MEMORY_BANKS=4 MEMORY_BANKS_IL=4 BUS=NtoM MCU_CFG=mcu_cfg.hjson
  2. make verilator-sim FUSESOC_PARAM='--REMOVE_OBI_FIFO --X_EXT=1'
  3. make app PROJECT=test_xif_wb_with_fpu TARGET=sim ARCH=rv32imfc (where the project folder contains a main.c file with the Test Case provided above).
  4. make run-app-sim PROJECT=test_xif_wb_with_fpu TARGET=sim ARCH=rv32imfc

Waveform Analysis

The waveforms below show how result values are stored correctly in X register mem(6), with the result_valid flag being activated as expected.

  • cv32e40x waveforms):

{B10A21A7-D99F-440A-9BB2-CAF5FE68ECC8}

  • cv32e40px waveforms:

{3292F747-0DCE-4D90-979E-4674D9C990B2}

Component

Component:RTL: For issues in the RTL (e.g. for files in the rtl directory)

Steps to Reproduce

Please provide:

  1. git hash
  2. Command line: Steps to reproduce above
  3. Logfile and/or wave-dump info (screen shots can be useful): Waveform analysis above
@gs2git gs2git changed the title Potential Issue with Multicycle Instruction Handling in cv32e40px Core Incorrect handling of writeback value in multicycle instructions Dec 17, 2024
@gs2git
Copy link
Author

gs2git commented Dec 17, 2024

Did you have a chance to look into this issue?
I have edited the title to better describe the issue, and added a short summary describing the problem. It doesn't seem to be a problem on my custom instruction, but on the PX core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant