Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Comb Groups Pass Produces Unoptimal Designs #1232

Closed
andrewb1999 opened this issue Nov 6, 2022 · 3 comments
Closed

Remove Comb Groups Pass Produces Unoptimal Designs #1232

andrewb1999 opened this issue Nov 6, 2022 · 3 comments
Labels
AMC Needed for Andrew's memory compiler C: Calyx Extension or change to the Calyx IL S: Discussion needed Issues blocked on discussion

Comments

@andrewb1999
Copy link
Collaborator

I've been experimenting in Calyx to try to understand cases where even hand optimizing the program (mostly by chaining operations together) is not enough to meet a Vitis HLS design. One big limiting factor I have found is how while ... with ... loops are lowered to pure while loops during the remove comb groups pass. Here is an example of this issue:

This first program is a vadd optimized as much as possible while still using the higher level while ... with ... construct.

component main<"toplevel"=1>(@clk clk: 1, @reset reset: 1, @go go: 1) -> (@done done: 1) {
  cells {
    std_slice_2 = std_slice(32, 3);
    std_slice_1 = std_slice(32, 3);
    std_slice_0 = std_slice(32, 3);
    std_add_1 = std_add(32);
    std_add_0 = std_add(32);
    std_slt_0 = std_slt(32);
    @external(1) mem_2 = std_mem_d1(32, 8, 3);
    @external(1) mem_1 = std_mem_d1(32, 8, 3);
    @external(1) mem_0 = std_mem_d1(32, 8, 3);
    while_0_arg0_reg = std_reg(32);
  }
  wires {
    group assign_while_0_init_0 {
      while_0_arg0_reg.in = 32'd0;
      while_0_arg0_reg.write_en = 1'd1;
      assign_while_0_init_0[done] = while_0_arg0_reg.done;
    }
    comb group bb0_0 {
      std_slt_0.left = while_0_arg0_reg.out;
      std_slt_0.right = 32'd8;
    }
    group bb0_5 {
      std_slice_2.in = while_0_arg0_reg.out;
      std_slice_1.in = while_0_arg0_reg.out;
      std_slice_0.in = while_0_arg0_reg.out;
      mem_2.addr0 = std_slice_2.out;
      mem_2.write_data = std_add_1.out;
      mem_2.write_en = 1'd1;
      std_add_1.left = mem_0.read_data;
      mem_0.addr0 = std_slice_1.out;
      std_add_1.right = mem_1.read_data;
      mem_1.addr0 = std_slice_0.out;
      while_0_arg0_reg.in = std_add_0.out;
      while_0_arg0_reg.write_en = 1'd1;
      std_add_0.left = while_0_arg0_reg.out;
      std_add_0.right = 32'd1;
      bb0_5[done] = mem_2.done;
    }
  }
  control {
    seq {
      assign_while_0_init_0;
      while std_slt_0.out with bb0_0 {
        bb0_5;
      }
    }
  }
}

This design produces a correct result in 37 cycles. However, this next design is hand optimized again after running the first design through the remove-comb-groups pass:

component main<"toplevel"=1>(@clk clk: 1, @reset reset: 1, @go go: 1) -> (@done done: 1) {
  cells {
    std_slice_2 = std_slice(32, 3);
    std_slice_1 = std_slice(32, 3);
    std_slice_0 = std_slice(32, 3);
    std_add_1 = std_add(32);
    std_add_0 = std_add(32);
    std_slt_0 = std_slt(32);
    @external mem_2 = std_mem_d1(32, 8, 3);
    @external mem_1 = std_mem_d1(32, 8, 3);
    @external mem_0 = std_mem_d1(32, 8, 3);
    while_0_arg0_reg = std_reg(32);
    @generated comb_reg = std_reg(1);
  }
  wires {
    group assign_while_0_init_0 {
      while_0_arg0_reg.in = 32'd0;
      while_0_arg0_reg.write_en = 1'd1;
      assign_while_0_init_0[done] = while_0_arg0_reg.done;
    }
    group bb0_5 {
      std_slice_2.in = while_0_arg0_reg.out;
      std_slice_1.in = while_0_arg0_reg.out;
      std_slice_0.in = while_0_arg0_reg.out;
      mem_2.addr0 = std_slice_2.out;
      mem_2.write_data = std_add_1.out;
      mem_2.write_en = 1'd1;
      std_add_1.left = mem_0.read_data;
      mem_0.addr0 = std_slice_1.out;
      std_add_1.right = mem_1.read_data;
      mem_1.addr0 = std_slice_0.out;
      while_0_arg0_reg.in = std_add_0.out;
      while_0_arg0_reg.write_en = 1'd1;
      std_add_0.left = while_0_arg0_reg.out;
      std_add_0.right = 32'd1;
      std_slt_0.left = while_0_arg0_reg.out;
      std_slt_0.right = 32'd8;
      comb_reg.in = std_slt_0.out;
      comb_reg.write_en = 1'd1;
      bb0_5[done] = mem_2.done;
    }
    group bb0_00<"static"=1> {
      std_slt_0.left = while_0_arg0_reg.out;
      std_slt_0.right = 32'd8;
      comb_reg.in = std_slt_0.out;
      comb_reg.write_en = 1'd1;
      bb0_00[done] = comb_reg.done ? 1'd1;
    }
  }

  control {
    seq {
      assign_while_0_init_0;
      seq {
        bb0_00;
        while comb_reg.out {
            bb0_5;
        }
      }
    }
  }

The main change between these two designs is that the comb group in the body of the loop has been chained together with the rest of the operations in bb0_5. This new design runs in 27 cycles, which is exactly the same as a non-pipelined HLS implementation.

Although it does not have a huge impact in latency in this case, it will become a big issue when considering large nested loops and pipelines, where the while loop may run thousands of times.

Now obviously this optimization cannot always be made, depending on the target frequency and how the iter arg of the while loop is incremented. If it is incremented with an addition this optimization should almost always be possible because the comparison delay is small.

I think the correct solution here unfortunately requires some analysis of delay and chaining (or just assume a user who wants full performance will not use while ... with ... constructs), but I figured it was worth pointing out this limitation with examples.

@andrewb1999 andrewb1999 added the C: Calyx Extension or change to the Calyx IL label Nov 6, 2022
@rachitnigam rachitnigam added S: Discussion needed Issues blocked on discussion AMC Needed for Andrew's memory compiler labels Nov 13, 2022
@andrewb1999
Copy link
Collaborator Author

I think the solution here is easier than I originally though. #911 should remove this extra cycle without needing any delay information.

@sampsyo
Copy link
Contributor

sampsyo commented Nov 19, 2022

Yes, this makes sense to me… #911 feels like the root cause here.

@rachitnigam
Copy link
Contributor

Closing this because a combination of #1422 and the new static abstractions will allow us to represent the same code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AMC Needed for Andrew's memory compiler C: Calyx Extension or change to the Calyx IL S: Discussion needed Issues blocked on discussion
Projects
None yet
Development

No branches or pull requests

3 participants