You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Potential Issue with Multicycle Instruction Handling in cv32e40px Core
Summary
When a custom CV-X-IF instruction has a writeback value (write to rd), if the instruction takes more than one clock cycle to raise the result_valid flag, cv32e40px writes an incorrect value to rd.
This doesn't happen in cv32e40x tested under the same conditions.
Issue Description
I am encountering unexpected behavior when executing multiple multicycle instructions with writeback on the cv32e40px core. In contrast, the cv32e40x core produces the expected results under the same conditions. This observation was made while using a co-processor wrapper based on the CV-X-IF standard (v0.2.0) to enable custom instructions. To reproduce the issue, I introduced a delay to the result in the FPU RTL and created a test that computes values on the FPU, then moves them to X registers, followed by sequential stores that reuse the same X register.
Testing and Findings
It appears that, during multicycle instructions with writeback, the cv32e40px core may not be coordinating the results as anticipated. The core seems to perform stores before the results for the current instruction are available, instead of waiting for the valid data. Meanwhile, the co-processor appears to respond correctly, activating result_valid only when the result data is ready.
To reproduce this behavior, I introduced a delay in the RTL of the FPU (fpu_ss.sv), adding a 5-cycle delay pipeline to x_result_valid_o and x_result_o. Below is the relevant code snippet with the added modifications. Please substitute the existing Result Interface Signals section in the fpu_ss.sv file with the Result Interface Logic (Pre-Delay) below:
// Define delay pipeline depth and pipeline arrayslocalparam intPIPE_DEPTH=5; // Number of cycles to delay the result output
x_result_t x_result_delay_pipe [PIPE_DEPTH-1:0]; // Pipeline array for result data logic x_result_valid_delay_pipe [PIPE_DEPTH-1:0]; // Pipeline array for result valid signalx_result_t x_result_pre_delay; // Holds the result data before it enters the delay pipeline logic x_result_valid_pre_delay; // Holds the result valid signal before it enters the delay pipeline//...fpu_ss_controller#(
.PULP_ZFINX(PULP_ZFINX),
.INPUT_BUFFER_DEPTH(INPUT_BUFFER_DEPTH),
.OUT_OF_ORDER(OUT_OF_ORDER),
.FORWARDING(FORWARDING)
) fpu_ss_controller_i (
//...// Result Interface
.x_result_ready_i(x_result_ready_i),
.x_result_valid_o(x_result_valid_pre_delay), // Modified output: 'x_result_valid' will be delayed
.csr_instr_i(csr_instr)
);
//...// -------------------------// Result Interface Logic (Pre-Delay)// -------------------------//// The following logic prepares the result data (`x_result_pre_delay`) and valid signal (`x_result_valid_pre_delay`)// before they are fed into the delay pipeline. This logic generates `x_result_pre_delay` and `x_result_valid_pre_delay`assign x_result_pre_delay.exc =1'b0; // no errors can occur for nowassign x_result_pre_delay.exccode ='0; // no errors can occur for now// Prepare the result data before it enters the delay pipelinealways_combbegin
x_result_pre_delay.data = fpu_result;
if (csr_wb &~fpu_out_valid & csr_wb &~fpu_out_valid) begin
x_result_pre_delay.data =32'($unsigned(csr_wb_addr));
endend// Assign destination register and instruction ID before entering delay pipelinealways_combbegin
x_result_pre_delay.rd ='0;
x_result_pre_delay.id ='0;
if (fpu_out_valid & x_result_valid_pre_delay & x_result_ready_i) begin
x_result_pre_delay.rd = fpu_tag_out.addr;
x_result_pre_delay.id = fpu_tag_out.id;
endelseif (x_result_valid_pre_delay & x_result_ready_i &~fpu_out_valid) begin
x_result_pre_delay.rd = csr_wb_addr;
x_result_pre_delay.id = csr_wb_id;
endend// Set write-enable flag before entering delay pipelinealways_combbegin
x_result_pre_delay.we =1'b0;
if ((fpu_out_valid &~fpu_tag_out.rd_is_fp) | (csr_wb)) begin
x_result_pre_delay.we =1'b1;
endend// Set additional control signals before entering delay pipelinealways_combbegin
x_result_pre_delay.ecswe ='0;
x_result_pre_delay.ecsdata ='0;
if (fpu_out_valid & x_result_valid_pre_delay & x_result_ready_i & fpu_tag_out.rd_is_fp) begin
x_result_pre_delay.ecswe =3'b010;
x_result_pre_delay.ecsdata =6'b001100;
endend// -------------------------// Delay Pipeline Logic// -------------------------//// This logic implements a shift register to delay the result output (`x_result_o`) by `PIPE_DEPTH` cycles.// The result and its validity flag are shifted through `PIPE_DEPTH` stages, one per clock cycle.always_ff@(posedge clk_i ornegedge rst_ni) beginif (!rst_ni) begin// Initialize the delay pipeline to zero on resetfor (int i =0; i <PIPE_DEPTH; i++) begin
x_result_delay_pipe[i] <='0;
x_result_valid_delay_pipe[i] <='0;
endendelsebegin// Shift the result and valid signals through the delay pipeline
x_result_delay_pipe[0] <= x_result_pre_delay; // Insert new result at the beginning
x_result_valid_delay_pipe[0] <= x_result_valid_pre_delay; // Insert new valid signal at the beginningfor (int i =1; i <PIPE_DEPTH; i++) begin
x_result_delay_pipe[i] <= x_result_delay_pipe[i-1]; // Shift data
x_result_valid_delay_pipe[i] <= x_result_valid_delay_pipe[i-1]; // Shift validendendend// -------------------------// Delayed Output Assignment// -------------------------//// The delayed outputs are assigned from the final stage of the pipeline.// This effectively delays the signals `x_result_o` and `x_result_valid_o` by `PIPE_DEPTH` cycles.assign x_result_o = x_result_delay_pipe[PIPE_DEPTH-1]; // Final delayed result dataassign x_result_valid_o = x_result_valid_delay_pipe[PIPE_DEPTH-1]; // Final delayed valid signal
Test Case
To reproduce the observed behavior, I created a test case using a sequence that approximates the Fibonacci series. The core performs floating-point calculations via custom instructions and stores the results.
#include <stdio.h>/** * Compute the first 10 values of the sequence a * x^n with n = 1 to 10, * rounding each value to an integer.*/voidcompute_powers(float a, float x, int res[10]) {asmvolatile (
// Compute each power a * x^n using floating-point multiplication"FMUL.S ft0, %[a],%[x]\n\t"// ft0 = a * x^1"FMUL.S ft1, ft0, %[x]\n\t"// ft1 = a * x^2"FMUL.S ft2, ft1, %[x]\n\t"// ft2 = a * x^3"FMUL.S ft3, ft2, %[x]\n\t"// ft3 = a * x^4"FMUL.S ft4, ft3, %[x]\n\t"// ft4 = a * x^5"FMUL.S ft5, ft4, %[x]\n\t"// ft5 = a * x^6"FMUL.S ft6, ft5, %[x]\n\t"// ft6 = a * x^7"FMUL.S ft7, ft6, %[x]\n\t"// ft7 = a * x^8"FMUL.S ft8, ft7, %[x]\n\t"// ft8 = a * x^9"FMUL.S ft9, ft8, %[x]\n\t"// ft9 = a * x^10// Convert each floating-point result to integer (rounding) and store it in memory"FCVT.W.S t1, ft0, rne\n\t""SW t1, 0(%[res])\n\t"// res[0] = t1 = round(ft0)"FCVT.W.S t1, ft1, rne\n\t""SW t1, 4(%[res])\n\t"// res[1] = t1 = round(ft1)"FCVT.W.S t1, ft2, rne\n\t""SW t1, 8(%[res])\n\t"// res[2] = t1 = round(ft2)"FCVT.W.S t1, ft3, rne\n\t""SW t1, 12(%[res])\n\t"// res[3] = t1 = round(ft3)"FCVT.W.S t1, ft4, rne\n\t""SW t1, 16(%[res])\n\t"// res[4] = t1 = round(ft4)"FCVT.W.S t1, ft5, rne\n\t""SW t1, 20(%[res])\n\t"// res[5] = t1 = round(ft5)"FCVT.W.S t2, ft6, rne\n\t""SW t2, 24(%[res])\n\t"// res[6] = t2 = round(ft6)"FCVT.W.S t2, ft7, rne\n\t""SW t2, 28(%[res])\n\t"// res[7] = t2 = round(ft7)"FCVT.W.S t2, ft8, rne\n\t""SW t2, 32(%[res])\n\t"// res[8] = t2 = round(ft8)"FCVT.W.S t2, ft9, rne\n\t""SW t2, 36(%[res])\n\t"// res[9] = t2 = round(ft9):: [a] "f" (a), [x] "f" (x), [res] "r" (res)
:"t1", "t2",
"ft0", "ft1", "ft2", "ft3", "ft4", "ft5", "ft6", "ft7", "ft8", "ft9",
"memory"
);
}/** * Main function to test the compute_powers function. * This test uses values that approximate the Fibonacci sequence: * - x is the golden ratio ((1 + sqrt(5)) / 2) * - a is scaled to start the sequence at the second Fibonacci number. * The resulting series should resemble the Fibonacci sequence from position 2 onward: * 1, 2, 3, 5, 8, 13, 21, 34, 55, 89*/intmain(void) {float x =1.6180339887; // Golden ratio approximationfloat a =0.7236067977; // 1 / sqrt(5) * x (to skip first 1 in Fibonacci series) int fib[10] ={0}; // Array to store computed Fibonacci-like sequence valuescompute_powers(a, x, fib);
// Print each computed result, resembling Fibonacci sequence starting at position 2for (int i=0; i<10; i++) printf("%d:%d\r\n", i, fib[i]);
return0;
}
Observations
cv32e40x core: Results are as expected.
cv32e40px core: The first result appears incorrect, and subsequent values are shifted, skipping "13."
Steps to Reproduce
Commands:
The commands I used are the following ones:
make mcu-gen CPU=cv32e40px MEMORY_BANKS=4 MEMORY_BANKS_IL=4 BUS=NtoM MCU_CFG=mcu_cfg.hjson
make verilator-sim FUSESOC_PARAM='--REMOVE_OBI_FIFO --X_EXT=1'
make app PROJECT=test_xif_wb_with_fpu TARGET=sim ARCH=rv32imfc (where the project folder contains a main.c file with the Test Case provided above).
make run-app-sim PROJECT=test_xif_wb_with_fpu TARGET=sim ARCH=rv32imfc
Waveform Analysis
The waveforms below show how result values are stored correctly in X register mem(6), with the result_valid flag being activated as expected.
cv32e40x waveforms):
cv32e40px waveforms:
Component
Component:RTL: For issues in the RTL (e.g. for files in the rtl directory)
Steps to Reproduce
Please provide:
git hash
Command line: Steps to reproduce above
Logfile and/or wave-dump info (screen shots can be useful): Waveform analysis above
The text was updated successfully, but these errors were encountered:
gs2git
changed the title
Potential Issue with Multicycle Instruction Handling in cv32e40px Core
Incorrect handling of writeback value in multicycle instructions
Dec 17, 2024
Did you have a chance to look into this issue?
I have edited the title to better describe the issue, and added a short summary describing the problem. It doesn't seem to be a problem on my custom instruction, but on the PX core.
Potential Issue with Multicycle Instruction Handling in cv32e40px Core
Summary
When a custom CV-X-IF instruction has a writeback value (write to
rd
), if the instruction takes more than one clock cycle to raise theresult_valid
flag, cv32e40px writes an incorrect value tord
.This doesn't happen in cv32e40x tested under the same conditions.
Issue Description
I am encountering unexpected behavior when executing multiple multicycle instructions with writeback on the cv32e40px core. In contrast, the cv32e40x core produces the expected results under the same conditions. This observation was made while using a co-processor wrapper based on the CV-X-IF standard (v0.2.0) to enable custom instructions. To reproduce the issue, I introduced a delay to the result in the FPU RTL and created a test that computes values on the FPU, then moves them to X registers, followed by sequential stores that reuse the same X register.
Testing and Findings
It appears that, during multicycle instructions with writeback, the cv32e40px core may not be coordinating the results as anticipated. The core seems to perform stores before the results for the current instruction are available, instead of waiting for the valid data. Meanwhile, the co-processor appears to respond correctly, activating
result_valid
only when the result data is ready.To reproduce this behavior, I introduced a delay in the RTL of the FPU (
fpu_ss.sv
), adding a 5-cycle delay pipeline tox_result_valid_o
andx_result_o
. Below is the relevant code snippet with the added modifications. Please substitute the existingResult Interface Signals
section in thefpu_ss.sv
file with theResult Interface Logic (Pre-Delay)
below:Test Case
To reproduce the observed behavior, I created a test case using a sequence that approximates the Fibonacci series. The core performs floating-point calculations via custom instructions and stores the results.
Observations
Steps to Reproduce
Commands:
The commands I used are the following ones:
make mcu-gen CPU=cv32e40px MEMORY_BANKS=4 MEMORY_BANKS_IL=4 BUS=NtoM MCU_CFG=mcu_cfg.hjson
make verilator-sim FUSESOC_PARAM='--REMOVE_OBI_FIFO --X_EXT=1'
make app PROJECT=test_xif_wb_with_fpu TARGET=sim ARCH=rv32imfc
(where the project folder contains amain.c
file with the Test Case provided above).make run-app-sim PROJECT=test_xif_wb_with_fpu TARGET=sim ARCH=rv32imfc
Waveform Analysis
The waveforms below show how result values are stored correctly in X register
mem(6)
, with theresult_valid
flag being activated as expected.Component
Component:RTL: For issues in the RTL (e.g. for files in the rtl directory)
Steps to Reproduce
Please provide:
The text was updated successfully, but these errors were encountered: