Remove Comb Groups Pass Produces Unoptimal Designs #1232

andrewb1999 · 2022-11-06T15:47:33Z

I've been experimenting in Calyx to try to understand cases where even hand optimizing the program (mostly by chaining operations together) is not enough to meet a Vitis HLS design. One big limiting factor I have found is how while ... with ... loops are lowered to pure while loops during the remove comb groups pass. Here is an example of this issue:

This first program is a vadd optimized as much as possible while still using the higher level while ... with ... construct.

component main<"toplevel"=1>(@clk clk: 1, @reset reset: 1, @go go: 1) -> (@done done: 1) {
  cells {
    std_slice_2 = std_slice(32, 3);
    std_slice_1 = std_slice(32, 3);
    std_slice_0 = std_slice(32, 3);
    std_add_1 = std_add(32);
    std_add_0 = std_add(32);
    std_slt_0 = std_slt(32);
    @external(1) mem_2 = std_mem_d1(32, 8, 3);
    @external(1) mem_1 = std_mem_d1(32, 8, 3);
    @external(1) mem_0 = std_mem_d1(32, 8, 3);
    while_0_arg0_reg = std_reg(32);
  }
  wires {
    group assign_while_0_init_0 {
      while_0_arg0_reg.in = 32'd0;
      while_0_arg0_reg.write_en = 1'd1;
      assign_while_0_init_0[done] = while_0_arg0_reg.done;
    }
    comb group bb0_0 {
      std_slt_0.left = while_0_arg0_reg.out;
      std_slt_0.right = 32'd8;
    }
    group bb0_5 {
      std_slice_2.in = while_0_arg0_reg.out;
      std_slice_1.in = while_0_arg0_reg.out;
      std_slice_0.in = while_0_arg0_reg.out;
      mem_2.addr0 = std_slice_2.out;
      mem_2.write_data = std_add_1.out;
      mem_2.write_en = 1'd1;
      std_add_1.left = mem_0.read_data;
      mem_0.addr0 = std_slice_1.out;
      std_add_1.right = mem_1.read_data;
      mem_1.addr0 = std_slice_0.out;
      while_0_arg0_reg.in = std_add_0.out;
      while_0_arg0_reg.write_en = 1'd1;
      std_add_0.left = while_0_arg0_reg.out;
      std_add_0.right = 32'd1;
      bb0_5[done] = mem_2.done;
    }
  }
  control {
    seq {
      assign_while_0_init_0;
      while std_slt_0.out with bb0_0 {
        bb0_5;
      }
    }
  }
}

This design produces a correct result in 37 cycles. However, this next design is hand optimized again after running the first design through the remove-comb-groups pass:

component main<"toplevel"=1>(@clk clk: 1, @reset reset: 1, @go go: 1) -> (@done done: 1) {
  cells {
    std_slice_2 = std_slice(32, 3);
    std_slice_1 = std_slice(32, 3);
    std_slice_0 = std_slice(32, 3);
    std_add_1 = std_add(32);
    std_add_0 = std_add(32);
    std_slt_0 = std_slt(32);
    @external mem_2 = std_mem_d1(32, 8, 3);
    @external mem_1 = std_mem_d1(32, 8, 3);
    @external mem_0 = std_mem_d1(32, 8, 3);
    while_0_arg0_reg = std_reg(32);
    @generated comb_reg = std_reg(1);
  }
  wires {
    group assign_while_0_init_0 {
      while_0_arg0_reg.in = 32'd0;
      while_0_arg0_reg.write_en = 1'd1;
      assign_while_0_init_0[done] = while_0_arg0_reg.done;
    }
    group bb0_5 {
      std_slice_2.in = while_0_arg0_reg.out;
      std_slice_1.in = while_0_arg0_reg.out;
      std_slice_0.in = while_0_arg0_reg.out;
      mem_2.addr0 = std_slice_2.out;
      mem_2.write_data = std_add_1.out;
      mem_2.write_en = 1'd1;
      std_add_1.left = mem_0.read_data;
      mem_0.addr0 = std_slice_1.out;
      std_add_1.right = mem_1.read_data;
      mem_1.addr0 = std_slice_0.out;
      while_0_arg0_reg.in = std_add_0.out;
      while_0_arg0_reg.write_en = 1'd1;
      std_add_0.left = while_0_arg0_reg.out;
      std_add_0.right = 32'd1;
      std_slt_0.left = while_0_arg0_reg.out;
      std_slt_0.right = 32'd8;
      comb_reg.in = std_slt_0.out;
      comb_reg.write_en = 1'd1;
      bb0_5[done] = mem_2.done;
    }
    group bb0_00<"static"=1> {
      std_slt_0.left = while_0_arg0_reg.out;
      std_slt_0.right = 32'd8;
      comb_reg.in = std_slt_0.out;
      comb_reg.write_en = 1'd1;
      bb0_00[done] = comb_reg.done ? 1'd1;
    }
  }

  control {
    seq {
      assign_while_0_init_0;
      seq {
        bb0_00;
        while comb_reg.out {
            bb0_5;
        }
      }
    }
  }

The main change between these two designs is that the comb group in the body of the loop has been chained together with the rest of the operations in bb0_5. This new design runs in 27 cycles, which is exactly the same as a non-pipelined HLS implementation.

Although it does not have a huge impact in latency in this case, it will become a big issue when considering large nested loops and pipelines, where the while loop may run thousands of times.

Now obviously this optimization cannot always be made, depending on the target frequency and how the iter arg of the while loop is incremented. If it is incremented with an addition this optimization should almost always be possible because the comparison delay is small.

I think the correct solution here unfortunately requires some analysis of delay and chaining (or just assume a user who wants full performance will not use while ... with ... constructs), but I figured it was worth pointing out this limitation with examples.

The text was updated successfully, but these errors were encountered:

andrewb1999 · 2022-11-16T01:20:11Z

I think the solution here is easier than I originally though. #911 should remove this extra cycle without needing any delay information.

sampsyo · 2022-11-19T19:56:57Z

Yes, this makes sense to me… #911 feels like the root cause here.

rachitnigam · 2023-04-29T13:28:00Z

Closing this because a combination of #1422 and the new static abstractions will allow us to represent the same code

andrewb1999 added the C: Calyx Extension or change to the Calyx IL label Nov 6, 2022

rachitnigam added S: Discussion needed Issues blocked on discussion AMC Needed for Andrew's memory compiler labels Nov 13, 2022

rachitnigam closed this as completed Apr 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Comb Groups Pass Produces Unoptimal Designs #1232

Remove Comb Groups Pass Produces Unoptimal Designs #1232

andrewb1999 commented Nov 6, 2022

andrewb1999 commented Nov 16, 2022

sampsyo commented Nov 19, 2022

rachitnigam commented Apr 29, 2023

Remove Comb Groups Pass Produces Unoptimal Designs #1232

Remove Comb Groups Pass Produces Unoptimal Designs #1232

Comments

andrewb1999 commented Nov 6, 2022

andrewb1999 commented Nov 16, 2022

sampsyo commented Nov 19, 2022

rachitnigam commented Apr 29, 2023