Lesson 3: Local Analysis & Optimization #284

sampsyo · 2022-02-03T14:59:40Z

sampsyo
Feb 3, 2022
Maintainer

The tasks for this lesson include implementing basic dead-code elimination (DCE) and, as the main event, implementing local value numbering (LVN). I ❤️ LVN!

5hubh4m · 2022-02-07T01:04:18Z

5hubh4m
Feb 7, 2022

Dead Code Elimination

Implemented dead code elimination of both varieties (global use analysis and local reassignment analysis). The code for it is here. Tested the implementation using turnt. The tests are from examples/test/tdce in the Bril repository. The test folder is here.

Local Value Numbering

Implemented local value numbering based optimizations. I implemented the following optimizations:

Copy propagation
Common subexpression elimination
Constant folding

The code for it is here. Tested the implementation using turnt. The tests are from examples/test/lvn in the Bril repository. The test folder is here.

One caveat with testing LVN is that my implementation generates randomized variable names for clobbered identifiers. This can result in turnt reporting a test failing even though it's an identical program. The current workaround is to use the turnt --diff flag to manually verify that the only difference is in variable names.

I took the advice from @sampsyo's comment and added a Java property random.seed. If the program sees it exists it will try to set the seed using it. I also updated the test specification to supply this argument.

It can optimize all the examples shown in class. A good example is this test case which combines a lot of elements from all test cases.

@main {
  a: int = const 4;
  b: int = const 2;
                                  @main {
  # (a + b) * (a + b)               a: int = const 4;
  sum1: int = add a b;              b: int = const 2;
  sum2: int = add b a;              vefc8b: int = const 6;         @main {
  prod1: int = mul sum1 sum2;       sum2: int = id vefc8b;   DCE     prod1: int = const 36;
                              LVN   prod1: int = const 36;   ==>     print prod1;
  # Clobber both sums.        ==>   sum1: int = const 0;           }
  sum1: int = const 0;              sum2: int = id sum1;
  sum2: int = const 0;              sum3: int = id vefc8b;
                                    prod2: int = id prod1;
  # Use the sums again.             print prod1;
  sum3: int = add a b;            }
  prod2: int = mul sum3 sum3;

  print prod2;
}

Correctness

To verify the correctness of both DCE and LVN implementations, I created an additional test folder dce+lvn which tests the output of the LVN+DCE pass with brili and compares it to the unmodified program. This should test the correctness of the optimizations.

This test helped me catch a bug in my implementation. When going through the list of instructions I would maintain a map of assignments that have been renamed (because they will be clobbered in the future). However, I had forgotten to remove it from the map after encountering the instruction that clobbers it.

Speed Increase

Across all the tests in dce+lvn, executing brili -p reports a combined total of 90 dynamic instructions issued. After taking an LVN + DCE pass however, brili -p reports only 24 dynamic instruction issued, which is a ~73% decrease.

2 replies

sampsyo Feb 7, 2022
Maintainer Author

Awesome!! This sounds great. And thanks for the recap of the bug you ran into; that's a very easy problem to introduce.

It sounds like your fresh-name generator is nondeterministic. Makes sense, but it can make testing automatable if you force things to be deterministic. Not that you need to do it for this task, but have you considered (optionally) fixing a seed so the PRNG always picks the same sequence of numbers, at least in some manner of testing mode?

Finally, while it's of course chill to use the internal Cornell GitHub, you are also welcome to put your code on the "real" GitHub and show it off to the world, if you want. No big deal either way.

5hubh4m Feb 7, 2022

Thanks for the idea! I didn’t mess around with seeds that much because I plan to refactor my implementation to deterministically generate an as yet unused identifier.

I wasn’t sure whether we could put our source code on our GitHub. Good to know! I will migrate it in time for Lecture 4 tasks! Thanks.

tonyjie · 2022-02-10T06:28:09Z

tonyjie
Feb 10, 2022

The code is here. The testcases can be easily runned following the README.

Dead Code Elimination

Implemented DCE with both trivial one (only global analysis) and the plus one (with local reassignment analysis). The correctness is verified, and by checking static code size wc -l and dynamic instruction count brili -p, the code is actually optimized.

Local Value Numbering

First I implemented the basic LVN, and then extended it for three optimizations. They can be enabled using corresponding optional arguments.

-p: constant (copy) propagation.
-c: commutativity.
-f: constant folding.

Similar as DCE, I also tested the optimization result by checking static code size and dynamic instruction count on various testcases. Note that by constant folding, the dynamic instruction count would not decrease, but a const op is surely more efficient than the other op.

Discussions

I'd like to mention some difficulties & bugs during programming. The difficulties are partly from the difference between the pseudo code and real code implementation.

When changing name for overwritten variables, we need to change corresponding variable name used in the later instructions. Therefore, my implementation is that going through the block loop once, and generating all the new names required before the real optimization loop.
We need to build and maintain carefully about the Table data structure described in pseudo code. I use a list with its index as Number, and its value storing Value and Var. The list element is a tuple with a namedtuple for Value and a string for Var.
Deal with function with argument: in my implmenetation, I need to store those arguments into the Table before we go into the loop of the block.

3 replies

sampsyo Feb 11, 2022
Maintainer Author

Thanks for the detailed recap! Not that it's all that important, but I'm curious about why you needed a two-pass deal to do the renaming first before applying the optimization. Was this to "look into the future" and see which variables would be reused? Otherwise, it seems like the "canonical variable name" generated during LVN on the fly would suffice.

tonyjie Feb 11, 2022

Yes, the renaming pass aims to check which instruction uses the "reused variables".
Considering this code snippet:

x = a + b;
y = a + b;
x = 0;

In my previous implementation, when I go to y = a + b, it is transformed into y = x. Then when I go to x = 0, I knew that x = a + b is an unused instruction. Then I rename previous x = a + b to lvn.0 = a + b, but now the program is wrong because I still need to change y = x to y = lvn.0. Therefore, the renaming pass is to avoid this kind of errors.

I know this pass is avoidable if I can directly know whether x will be overwritten in the future when I go into x = a + b. But this actually still needs a "implicit" pass to look into the whole block. The way I wrote is to make this pass "explicit" (probably inefficient, but it comes from the initial implementation and bugs :D)

sampsyo Feb 11, 2022
Maintainer Author

Makes sense! Thanks!

zzzDavid · 2022-02-10T08:08:02Z

zzzDavid
Feb 10, 2022

My implementation of DCE and LVN is here.

Dead Code Elimination

I implemented both local and global DCE in dce.py, the default is local DCE, and adding -g will enable global DCE. I added the tests from Bril and used turnt for testing. I verified the pass with brili to make sure it doesn't change execution results.

Local Value Numbering

I implemented constant propagation, common sub-expression elimination, and constant folding in LVN. I added tests from Bril and used turnt to verify the results.

Taking rename-fold.bril as an example, we can run DCE as a post-process of LVN to perform CP, CSE, CF:

$ bril2json < lvn/rename-fold.bril | python3 lvn.py | python3 dce.py | bril2txt

@main {
  v1: int = const 4;
  v2: int = const 0;
  mul1: int = mul v1 v2;
  add1: int = add v1 v2;
  v2: int = const 3;
  print mul1;
  print add1;
}

@main {
  v1: int = const 4;
  lvn.0: int = const 0;
  print lvn.0;
  print v1;
}

Discussion

Data structure

I used a straightforward dictionary {int : {'val': tuple, 'cname': str}} to represent the table. But dictionary is not always convenient when we want to know if a value tuple is already in the table. Therefore, I added another list: [tuple] whose index corresponds to the row in the table to easily lookup value tuples.

LVN: optimization on-the-fly

Instead of building the table first and then generate the optimized code, we can loop over instructions only once and generate the transformed code on-the-fly. To do this, we need to carefully consider the following things:

Try to compute the result based on rules.
See if we need to rename the destination.
Copy values that are in the table, if it is constant, change current instruction to a constant.
Always update the environment with original name, or the instructions coming next won't be able to find the arguments in the table.
Remember to update the instruction arguments with new canonical names.

1 reply

sampsyo Feb 11, 2022
Maintainer Author

Awesome! Good idea to use two data structures for the two kinds of indexing you needed to use. And thanks for the detailed discussion of how to avoid the need for multiple passes by optimizing "on the fly"!

chhzh123 · 2022-02-10T08:39:39Z

chhzh123
Feb 10, 2022

My implementation of DCE and LVN can be found here.

Dead Code Elimination (DCE)

DCE is relatively easy to implement, just following the pseudocode would be fine:

Traverse all the arguments of instructions to get the used list (stored in a set)
Traverse again to check whether the dest of instructions are in used list. If not, remove the instruction.
For the case of reassignment, first check the used list, and then check the definition list, and only retain the last definition.
Remember to iterate the algorithm until it converges.

I also added basic control flow support for DCE. Basically I just follow the control flow (jmp and br) and traverse those blocks from the root of CFG (depth-first search). If some blocks cannot be accessed, they will be removed. After removing dead blocks, I viewed the remaining blocks as a whole and did DCE. (This has limitations when the CFG has complex branches.)

The following shows the test status of the benchmarks from examples/test/tdce.

Benchmark	Status	Comments
`combo`	✔️
`diamond`	✔️	Control flow (br)
`double-pass`	✔️	Same as `double-pass`, original test incorrect
`double`	✔️
`reassign-dkp`	✔️
`reassign`	✔️	Same as `reassign-dkp`, original test incorrect
`simple`	✔️
`skipped`	✔️	Control flow (jmp), remove dead blocks

Local Variable Numbering (LVN)

LVN is much tricky somehow and requires careful considerations. I use a list to store the LVN table. Each element is a tuple containing the value and the canonical name. This takes O(1) to access a row by index but takes O(n) in the worst case to find a value. I am not sure if there is data structure that can enjoy both of the world -- searching by index and searching by key can both achieve constant time (if not storing redundant data). It would be helpful to reduce the searching overheads if the program has lots of instructions.

Based on LVN, I implemented all three (1) common subexpression elimination, (2) copy propagation, and (3) constant propagation/folding. (1) only needs to follow the LVN algorithm and check if the value is computed before, and then replace it. (2) is also straightforward that needs to find the initial definition of the variable. (3) is a little bit tricky. What I do is write specific computation rules for each arithmetic operation, but it is very inconvenient and the if-else cases take almost 1/3 part of the code. Also, to make constants pass through the program from top to bottom, I iterate the algorithm several times until the program does not changed just like what we did in DCE.

The attached pesudocode is actually a simplified version of the LVN algorithm. In my implementation, I pay more attention to the following things:

Function arguments should be also added to the LVN table before traversing the instructions.
For constructing a fresh variable name, I recorded a global counter and attached that number behind the variable names in order to prevent further name conflicts. For example, the original variable name is x, then the new name will be x_new0.
If the instructions will be overwritten later, we should also not only update the argument names, but also update the latter instructions (before the next definition) that contain this variable. Otherwise, we may not find the requested variable in the var2num dict.

The following shows the test status of the benchmarks from examples/test/lvn.

Benchmark	Status	Comments
`clobber-fold`	✔️
`clobber`	✔️	Same as `clobber-fold`
`commute`	✔️
`divide-by-zero`	✔️
`fold-comparisons`	✔️	Original test incorrect
`idchain-nonlocal`	✔️	Control flow (jmp) + DCE
`idchain-prop`	✔️
`idchain`	✔️	Same as `idchain-prop`
`logical-operators`	✔️
`nonlocal`	✔️	Control flow (jmp) + DCE
`reassign`	✔️
`redundant-dce`	✔️
`redundant`	✔️
`rename-fold`	✔️

Correctness

Apart from the above tests given in the bril repository, I also wrote additional test cases to test the correctness of my algorithms. The result of LVN is passed to DCE to do further optimization. All the cases are tested using turnt, including the trickier example given in class.

Here I give another example to show how my passes work, and it should cover all the cases in LVN and DCE.

@main {                                        
  x: int = const 14;                           
  y: int = const 2;                            
                                               
  # constant propagation                       
  div_xy: int = div x y;                       
                                           # lvn
  # copy propagation                       @main {
  div_xy2: int = id div_xy;                  x: int = const 14;                        
                                             y_new0: int = const 2;            # lvn + dce        
  # dead code                                div_xy: int = const 7;            @main {
  add_xy: int = add x y;                     div_xy2: int = div div_xy;          y_new1: int = const 8;
                                             add_xy: int = const 16;             print y_new1;
  # common subexpression elimination         mul_xy: int = const 49;             y: int = const 10;
  mul_xy: int = mul div_xy div_xy2;          y_new1: int = const 8;              print y;
                                             print y_new1;                       sub_xy: int = const -39;
  # reassignment                             y: int = const 10;                  print sub_xy;
  y: int = const 8;                          print y;                            equality: bool = const true;
  print y;                                   sub_xy: int = const -39;            print equality;
                                             print sub_xy;                     }
  # double reassignment                      result: int = const -39;                  
  y: int = const 10;                         equality: bool = const true;              
  print y;                                   print equality;                           
                                           }
  # further propagation                        
  sub_xy: int = sub y mul_xy;                  
  print sub_xy;                                
                                               
  # constant folding                           
  result: int = const -39;                     
  equality: bool = eq sub_xy result;           
  print equality;                              
}

Using this testbench, I also found a subtle mistake that I should use // to implement integer division and / to implement floating-point division. We do not record data types in the LVN table, but it indeed matters when doing constant folding.

For control flow testing, I used the diamond.bril case as shown below. Actually the jmp operations can be further eliminated, which will leave as future works.

@main {
  a: int = const 47;           
  cond: bool = const true;     # after optimization
  br cond .left .right;        @main {
.left:                           jmp .left;
  a: int = const 1;            .left:
  jmp .end;                      a: int = const 1;
.right:                          jmp .end;
  a: int = const 2;            .end:
  jmp .end;                      print a;
.end:                          }
  print a;                     
}

5 replies

chhzh123 Feb 10, 2022

I should use dataflow analysis for CFG later :)

sampsyo Feb 11, 2022
Maintainer Author

Interesting idea to do some CFG-based dead code elimination! Removing dead blocks (which are never jumped to) is an important global simplification. I have a couple of questions about how this works:

Does your analysis track "fall-through" control flow? That is, if there are two blocks right next to each other without a jmp/br to connect them, does the second block still get considered "live"?
Once you've eliminated the dead blocks, how do you know whether assignments are dead (guaranteed to be overwritten in the future)? The read to a given variable could be several jumps away from the write, so (without doing something dataflow-like), this would seem to require analyzing all CFG paths forward from the write.

sampsyo Feb 11, 2022
Maintainer Author

Ah, I suppose I also have another comment about this:

Using this testbench, I also found a subtle mistake that I should use // to implement integer division and / to implement floating-point division. We do not record data types in the LVN table, but it indeed matters when doing constant folding.

It's true that, in Python, you have to take care to use integer division when you want to model the run-time semantics of div. However, I don't think you need to know the types of the values. If a Bril program ever wants FP division, it must use the fdiv opcode—not div. So you always know which division operator to use just by looking at the instruction itself, not the operand types.

sampsyo Feb 11, 2022
Maintainer Author

I guess one way of putting my concern about the "block chaining" idea is this program:

@main(cond: bool) {
  br cond .left .right;
.left:
  a: int = const 4;
  jmp .end;
.right:
  a: int = const 2;
  jmp .end;
.end:
  print a;
}

…which your dce.py optimizes to:

@main(cond: bool) {
  br cond .left .right;
}

So that's a bit too aggressive in two different ways (we lost our print altogether, and we now try to jump to labels that have been deleted).

chhzh123 Feb 11, 2022

Does your analysis track "fall-through" control flow? That is, if there are two blocks right next to each other without a jmp/br to connect them, does the second block still get considered "live"?

Yes, I follow the connection between blocks and do depth-first search. If a block is another's immediate successor, there should be a path for DFS to be accessed. Once a block can be accessed by DFS, it is alive.

Once you've eliminated the dead blocks, how do you know whether assignments are dead (guaranteed to be overwritten in the future)? The read to a given variable could be several jumps away from the write, so (without doing something dataflow-like), this would seem to require analyzing all CFG paths forward from the write.

I think removing dead assignments should be eliminated by another pass? Simply removing dead blocks cannot eliminate those variables that are in a live block but are dead, and I indeed did not consider this case in my implementation.

…which your dce.py optimizes to:
@main(cond: bool) {
br cond .left .right;
}

Oh, this is not the expected behavior. In your piece of code, since we do not know cond is true or false at compile-time, we should assume both branches can be accessed, so there are no dead blocks in this case, and the generated code should remain the same. I'll fix it later.

anshumanmohan · 2022-02-10T20:14:46Z

anshumanmohan
Feb 10, 2022

My code is here.

My implementation of TDCE is straightforward. As suggested, I define "one pass" variants of the two strategies, compose those, and then iterate the composition to convergence. I do not support jumps and branches. Instead of throwing an error, I check for jumps and branches and, on finding any, correctly do nothing.

I tested my implementation by running it on the suite of reference examples. There were a few programs that I did a "better" or comparable job on (modulo alpha renaming) so I used turnt's --save utility to clobber the expected values.

My implementation of LVN is organized in a similar way; I compose LVN with TDCE and iterate the composed method until convergence. I used a DataFrame for my table, which was a bit clunky and may have been a mistake. Anyway I got through it with a few getters and setters. As an extension to the basic algorithm, I expose the semantics of arithmetic, logic, identity, copying, and const, and realize profit.

Again, I tested my implementation against the suite of reference examples for TDCE and LVN. Where appropriate, I clobbered the expected .out values with my own. Introducing constant-folding and exposing the semantics of arithmetic collapses many of the previous examples into two-liners.

A few problems I ran into:

Introducing constant-folding and exposing the compiler to math rules raised the interesting question of what one should do when tasked with illegal math, e.g. division by zero. I currently deal with instructions that divide by zero by leaving them alone.
Composing and iterating compiler passes means that it's sometimes a bit tricky to see how different passes are working independently. Constant-folding in particular "magics away" lines of code that would have otherwise been evidence of other correctly-implemented optimizations. I believe my implementations to be correct because I worked on constant-folding towards the end, and so I had already tested other features against my .out files. However, this is clearly not a good long-term solution. In a similar vein, a few examples such as logical-operators.bril and fold-comparisons.bril are handled correctly by LVN, but then, before anyone can admire this, the code is cleaned up aggressively by TDCE. I made sure to get them right by turning TDCE off temporarily. However, the .out files I've submitted look rather sparse. I rejected a grungy solution involving many sets of runner files, test directories, and turnt.toml files, but, on glancing through some of the discussion above, I can see there's a better way using arguments etc.

2 replies

sampsyo Feb 11, 2022
Maintainer Author

Wow, interesting! By DataFrame, do you mean from pandas? That does seem nice because you can get convenient printing, but I can imagine it getting a little clunky (in my experience, pandas is always a little clunky, even when used as intended).

Leaving divide-by-zero instructions alone seems like a good idea—or perhaps the only idea, if you want your optimization to be correct in the sense that it always preserves program semantics (pending sampsyo/bril#139).

anshumanmohan Feb 11, 2022

Yup, from pandas! Tabulate was great for printing. I'm not particularly familiar with Python (using it in this course to get myself to learn) so it's all pretty clunky haha

charles-rs · 2022-02-10T21:24:48Z

charles-rs
Feb 10, 2022

My implementation is here

TDCE:

For dead code elimination, i wrote a function that collects all of the "used" variables, and then removes any instruction that assigns into a variable that isn't used, and then looped it until there was no update. I didn't worry about dead code wrt control flow, since that will be dealt with by LVN

CFG:

I made a CFG of basic blocks so that I could work on those, and then reconstructed the program (also kept track of the initial order of the basic blocks, since there can be "fall through" when there is sth like

.lbl1:
       a :int = add v1 v2;
.lbl2:
      b :int = add v1 v3;

The reconstruction deals with removing dead code "between" blocks, but it isn't smart enough to get rid of blocks that are never jumped to. Oh well.

After making a the cfg stuff (and a function that printed DOT code so i could check if it was right), i made a mapping function that took an "optimization" function as an argument, and then applied it to every basic block in the program, finally reconstructing them all in order. Nothing too fancy here; it would probably be nice to get rid of/merge labels when possible, so that we can have big basic blocks, but this didn't really seem to be an issue.

LVN

This was when i implemented LVN (didn't realize till lecture today that I didn't really need the CFG, but oh well, head start for next week) Basically just did the algo from the lecture, and then added copy propagation by checking if the op was id, and then looking using the same value number. Constant folding was also pretty straight forward; in my translation from insn -> value, i checked if all the args were constant and then just returned the constant, which since i didn't get it quite right at first ended up with some funny programs like this:

a :int = const 1;
b :int  = const 5;
c : int = mul a b;

being optimized to

a :int = const 1;
b :int = const 5;
c :int = id b;

but i fixed it pretty easily. This was probably the most tedious, as i had to write out a mini-interpreter for bril insns, but that was only like 15 lines or something. I also added a check for if an instruction had two of the same argument, there is sometimes special things to do, for instance x - x = 0, and x < x is always false. Leveraged divide by zero being undefined behavior a couple times here: 5/0 will be 0, but 0/0 might be 1 sometimes, depending on the order of things.

I also dealt with commutative operations by sorting the values.

Testing strategy

Optimizations should preserve correctness, so I used turnt to check that the output of the program stayed the same after optimizing. This was pretty convenient, and only the divide by zero one changed, which again, I expect. Then, to see if I'm actually doing anything, I made a separate script that applies the opt, and then puts it through bril2txt, and i just looked at all the files, and didn't see anything that wasn't being optimized that should have, so i'll call that a win.

One difficulty i ran into was that the turnt/bril workflow is very different from the LISP workflow, the latter of which is very interactive and you do pretty much everything in a repl. I ended up loading in a few bril programs to test with, but in the end i felt a little torn between working in the REPL and then running all the tests with turnt, but by the end i felt pretty efficient.

2 replies

sampsyo Feb 11, 2022
Maintainer Author

Very cool; thanks! One small note on the whole CFG "reconstruction" business: preserving order is a perfectly good way to do it, but another one is to just insert jmps to implement fall-through. Then the order doesn't matter, and all control transfers are explicit. If that's something that appeals.

Way to exploit the pending divide-by-zero semantics issue in sampsyo/bril#139, BTW.

sampsyo Feb 11, 2022
Maintainer Author

As a slightly broader discussion on the inconvenience with Turnt: I feel the need to note that some of this friction is intentional. Part of the idea of Turnt-style "snapshot" testing is to force, or at least strongly encourage, a Unixy: interface: tested tools kinda have to be able to take in a text file, do their work, and produce a new self-contained text file. One can have a debate about the positives and negatives of architecting compiler-like tools this way, but one undeniable advantage is that it makes for clear boundaries at which to reason about correctness (program equivalence). And it means that testing commands are slightly "self-documenting," in that they tell humans exactly what command line to run to invoke the compiler. So despite the inconvenience, I'm glad the workflow forced you to break out of typing manually at the Lisp REPL and make fully automated command-line wrappers.

alaiasolkobreslin · 2022-02-11T01:09:31Z

alaiasolkobreslin
Feb 11, 2022

DCE

My DCE implementation is here. I pretty much just implemented the pseudocode we saw in class. I tested my implementation using on examples in the tdce directory using turnt. I had to update a few of the .out files, as some of them did not remove all dead code (for example, one .out program only had one pass of DCE performed). I also verified that the code from the dce output had the same behavior as the original code for each example by running the brill interpreter.

LVN

My LVN implementation is here. I had to use some additional data structures than what was described in the pseudocode to get things working- in addition to a table mapping value tuples to canonical variables, I also kept a list called table_list which stores canonical variables indexed by value numbers. This made it easier to retrieve a replacement for an argument based on the value number found in var2num. Something else worth mentioning is that with every definition instruction encountered, I added an entry to table with key ('id', num) and value (num, dest). This is so that when an id instruction is encountered using that same variable as the argument, the value tuple associated with that instruction can easily be looked up in the table. This was the trick I used to make copy propagation work.

To test my implementation, I used turnt on many of the examples in the lvn directory. I set turnt to run lvn and then dce. I tested my implementation on the examples below, which all resulted in code with fewer instructions but the same behavior. I did not test my code on all examples in the lvn directory since I did not implement any additional optimizations.

clobber.bril
idchain.bril
reassign.bril
redundant-dce.bril
redundant.bril

Discussion

Some challenging parts of this assignment included figuring out how to tweak my algorithm to get copy propagation working as I mentioned above. Other than that, it took me a while to get turnt working since I didn't realize that comments in a bril program had an effect on turnt. I really would have liked to implement constant folding and common subexpressions with commutativity if I had more time.

1 reply

sampsyo Feb 11, 2022
Maintainer Author

Great! Good idea, in general, to "duplicate" the data structure for the table so you can index into it in both necessary ways.

And thanks for the feedback about embedded comments controlling the way Turnt works. I would like to document that a bit more clearly in the future… 6120 may, in fact, need a standalone lesson on debugging/testing techniques.

JonathanDLTran · 2022-02-11T03:05:12Z

JonathanDLTran
Feb 11, 2022

Links

Here is my implementation of dce and here is my implementation of lvn. Tests are at dce-tests, lvn-tests and lvn-dce-tests.

Summary

I implemented trivial DCE with both the local deletion of overwritten instruction destinations within a basic block, and also added global deletion of unused variables. To run these, you can use python3 dce.py for both optimizations, and python3 dce.py -l for local and python3 dce.py -g for global.

I also implemented local value numbering with constant propagation, copy propagation, some basic simplification as an attempt to do common subexpression elimination and constant folding. All these optimization attempts are run on the pass, and I did not add flags to individually enable these optimizations.

For constant propagation, every time a generated LVN value was a constant, instead of returning the variable, I replaced it with a constant. For copy propagation, every time an identity expression is generated, I replace it with the previous variable it refers to. For common subexpression elimination, I implemented a simple approach to commutative expressions like addition or multiplication. This allows add a bto be judged the same as add b a. In the case of constant folding, I did simple interpretation. If all of the expression's arguments were already constants, I evaluated the expression completely. Otherwise, I returned the same expression, with no constants substituted.

Testing

For testing, I first checked the generated code and any printed statements were correct manually. I then ran these tests using turnt, to ensure correctness as more changes were made.

Some of the tests are smaller and attempt to test certain features, such as variables that are not used or variables that are immediately overwritten. Other tests combine multiple optimization opportunities. The lvn-dce-tests in particular leverage both lvn and dce opportunities, and run the lvn, and then the dce pass, in order to clean up the code. The results of these tests can be found in the directories dce-tests, lvn-tests, and lvn-dce-tests.

Optimization

To check for optimizxation, I ran tests in the lvn-dce-tests directory. I paid attention to whether the output was smaller than the input in terms of dynamic and static instructions. In general, I was pleased with the amount of redudancy that was eliminated.

An example of this is the original program:

# ARGS: 1
@main(x: int) {
    x: int = add x x;
    x: int = const 4;
    x: int = const 5;
    y: int = id x;
    z: int = id y;
    w: int = id z;
    print w;
    a: int = add x y;
    b: int = add y x;
    print a;
    print b;
    c: int = mul x y;
    d: int = mul y x;
    e: int = add c d;
    f: int = mul e e;
    g: int = sub f c;
    print g;
    g: int = const 5;
    print g;
    f: int = const 4;
    print f;
    e: int = const 3;
    print e;
}

which can have constant propagation, copy propagation, constant folding and common subexpression elimination applied. There are also examples of dead code and overwritten variables.

The result is:
@main(x: int) {
  x_2: int = const 4;
  x: int = const 5;
  print x;
  a: int = const 10;
  print a;
  print a;
  g_5: int = const 2475;
  print g_5;
  print x;
  print x_2;
  e: int = const 3;
  print e;
}

in which many of the assignments are removed, and the constants are able to be propagated effectively. In particular, only 5 of the statements are non-print statements while the remainder are print statements and cannot be eliminated.

Difficulties

Some of the difficulties I had included creating an unsound procedure for both dce and lvn, dealing with arguments in LVN, as well as not dealing with division by 0.

For dce, when I split into basic blocks, I removed all variables as dead at the end of a basic block. I realized these variables could persist past the end of a basic block and fixed the problem appropriately. In the case of lvn, I failed to take into account that variables can be reassigned multiple times. I changed this by creating unique variable names, by adding on a numerical index to the variable. I also failed to deal with arguments in LVN correctly, initially. I later added each argument into the LVN table before running the LVN optimization, to overcome this problem. For division by 0, I initially allowed the lvn pass to crash when the divisor was 0. After the question raised in class, I changed this so that the compiler would not raise an error or crash, and would instead just not optimize that specific instruction where the division by 0 occurs. The division by 0 will occur at runtime.

1 reply

sampsyo Feb 11, 2022
Maintainer Author

Thanks for the detailed writeup! I like your recap of the realization about variables that remain live at the end of a basic block; needless to say, this is something that only matters after you scale up your local analysis to run on multi-block programs. 👏 And good job dealing with the divide-by-zero issue; it's good that your optimization preserves the program's (error-triggering) semantics.

atucker · 2022-02-11T03:47:44Z

atucker
Feb 11, 2022

Testing

I extended the testing from the example tests by creating a script that runs all the bril programs in a directory, writes their outputs in a new directory, copies the programs to that directory, and creates a turnt config to show that transforming them with dce or lvn still works. After some light editing to give things arguments and such, this shows that on the example programs my dce and lvn implementations produce programs with the same outputs.

I also started some work to store the profile output, but still have to write the python code that runs the transformed programs to show that they always decrease the number of dynamic instructions.

Further testing is described in the relevant sections for DCE and LVN.

DCE

My DCE implementation is here. It's very basic and only eliminates the lines that never get used, without doing anything to try to look at the CFG or remove code that way.

It has two parts, one which eliminates unused variables, and another which eliminates redundant assigns within a basic block. The first is run repeatedly until there are no changes, and the second is run once at the end.

I think that it is correct because it matches the example test code outputs, except for on double-pass.out and reassign.out, where my implementation removes a bit more code.

LVN

My LVN implementation is here.

I did copy propagation and common subexpression elimination from class, with a few extensions:

Rearranging arguments in commutative operations into a canonical order
Constant propagation
Constant folding
Special-casing some of the constant folding so that if (for instance) you multiplied a variable argument by 0, it simplifies it to 0, rather than saying that the expression could be anything. The fanciest feature here is that (for instance) adding 0 to a variable argument can be turned into an id lookup of that variable.

Getting basic LVN to work was pretty challenging until I realized that I was making horrible mistake in putting (id num) into the table ever, since that goes against the whole spirit of trying to maintain a mapping from variable names to their "real" value. This resulted in a bunch of horrible messes of trying to unravel the id lookups while rewriting the instruction, rather than just knowing that every value that the main program is seeing is simplified, and that its table index is either to the effect operation where it gets defined, or to the first place that that constant gets used.

After that, constant propagation was pretty tricky, since it meant not just knowing that a value was already in the table, but knowing whether it was the result of an effectful operation, or if it was a constant.

Another challenge was self-inflicted -- to try to future-proof against all of the issues where trying to evaluate code on the constants could cause an error (which honestly might only be division by zero...) I initially put a lot of my code for constant folding with a try/except Exception branch. This meant that a lot of things silently failed, until I restricted the scope of that try/catch to only one line that I knew worked.

With that all in place, the commutativity checks, constant folding, and special-casing felt straightforward because they all can be implemented in the step where we construct the value of an instruction, since all the values that you need are in the table and you don't really need to mutate anything or deal with control flow.

The constant folding shows up in many of the tests (basically everything that says fold) and produces reasonable outputs, and the special casing is best showcased in logical-operators.bril.

@main(arg1: bool, arg2: bool) {.     ->     @main(arg1: bool, arg2: bool) {
  ...                                         ...
  no_fold1: bool = and t arg1;                no_fold1: bool = id arg1;
  no_fold2: bool = and arg1 t;                no_fold2: bool = id arg1;
  no_fold3: bool = or f arg1;                 no_fold3: bool = id arg1;
  no_fold4: bool = or arg1 f;                 no_fold4: bool = id arg1
  no_fold5: bool = and arg1 arg2;             no_fold5: bool = and arg1 arg2;
  no_fold6: bool = or arg1 arg2;              no_fold6: bool = or arg1 arg2;
  no_fold7: bool = not arg1;                  no_fold7: bool = not arg1;
}                                           }

Overall, the changes between the original test code outputs and the result of my code can be found in this git commit: atucker/bril@88c504f

1 reply

sampsyo Feb 11, 2022
Maintainer Author

Awesome summary; thanks!! I liked hearing about this realization especially much:

Getting basic LVN to work was pretty challenging until I realized that I was making horrible mistake in putting (id num) into the table ever, since that goes against the whole spirit of trying to maintain a mapping from variable names to their "real" value

Indeed—that's a very subtle tweak but a crucial one to making the whole value-numbering "philosophy" pay off.

michaelmaitland · 2022-02-11T03:48:33Z

michaelmaitland
Feb 11, 2022

DCE

I implemented DCE using a trivial global analysis that removed unused variables and a trivial local analysis that removed variables that were overwritten before they were used.

Testing

There were two approaches used to gain confidence and understand the flaws of my solution. The first was to modify the turn.toml file to use my DCE transformation in the examples/test/tdce directory. Using this helped me find bugs in my solution and fix them until all tests passed. All tests but two passed. Upon manual inspection, my tests were failing because the optimization I was doing was stronger than the optimization in the expected output. For these two tests, I compared the output of the original program to the output of the transformed program in the interpreter.

The second way I tested this transformation was by modifying the turnt.toml file in the benchmarks directory to emit bril code that was transformed by my DCE pass to the brill interpreter. This helped me understand flaws in my solution such as division by zero. It helped me gain confidence that other tests were passing.

LVN

I implemented LVN which performed CSE and commutative CSE. In order to get commutative CSE working I sorted arguments for commutative operations using the python sorted function so that commutative arguments were always compared in an order that checked for commutative equality. I noticed that the sample solution only supported add and mul for commutativity, but I believe this can be extended to eq and and as well since they operate on primitive boolean values and there is no worry about something short circuiting (for and).

Testing

I tested this transformation in a similar way to testing DCE. I first used turnt against the test/lvn files. I got all the tests to pass except those that were meant to have constant prop, copy prop, and constant folding done on them. I did compare the output of my transformation vs the original for these examples to make sure I did not break anything. One problem I did have during this process was that my renaming of registers differed from the renaming of registers that the expected output had. I decided to conform to this standard with one extra stipulation: the lvn.X register generated could not be a register anywhere else in the program. Since lvn.X is a register that could be used by a human programmer, its possible (although unlikley) that they chose to name a register with the same name at some other point in the program. If this occurs, I rename it as lvn.X + 1 until the condition holds that it does not exist elsewhere in the program. For example:

lvn.0 = 1 <=== generated by transformation (my transformation would try to rename it to lvn.1 instead)
b = 2
...
lvn.0 = 2 <=== user defined variable

I also tested against the benchmark programs and passed many of those tests. I did suffer from issues like divide by zero and the situation where variables were defined in one basic block with a use in another.

Discussion

Some difficulties I had were with division by zero and handling variables defined by other basic blocks that were not defined in the current block. For DCE, I removed instructions that divided by zero if they were unused throughout the entire function or if the variable that stored the result of division by zero was redefined before a use. A similar thing occurred in LVN. This is still a problem that is not addressed by my solution.

With regard to division by zero, I wonder what production compilers must have to deal with as there are many more side effects than just division by zero. For example, any programs that call out to the OS seems like another class of side-effects. A compiler writer must be aware of these types of side-effects when generating optimizations. It's clear that a good baseline test sweet is beneficial to a compiler writer.

In addition to this, I had some difficulty getting LVN to work with variables that were defined in another basic block and used in another basic block before another definition. This was not something I originally accounted for in the pseudo-code and I did not realize this was a problem until the very end.

Next, testing was a bit difficult because the existing solutions for DCE and LVN had slightly different outputs as me. In the future I plan to look at the example outputs to see what the optimizations are doing so that I can conform to any standards so I can reuse those tests. Another thing that was difficult with regard to testing was understanding when my optimizations were stronger than the expected output and when the optimizations were weaker than the expected outputs (or wrong).

2 replies

sampsyo Feb 11, 2022
Maintainer Author

Wahoo! So glad you tried this out on all the benchmarks:

The second way I tested this transformation was by modifying the turnt.toml file in the benchmarks directory to emit bril code that was transformed by my DCE pass to the brill interpreter. This helped me understand flaws in my solution such as division by zero. It helped me gain confidence that other tests were passing.

Using our amassed collection of code here seems like a wonderful way to gain some confidence that any optimization you have, regardless of its optimization potential, is at least correct.

And nice work being careful about naming collisions.

sampsyo Feb 11, 2022
Maintainer Author

Issues about divide-by-zero (and the general issue of side effects in optimizations) are indeed pretty tricky for all compilers. The natural way to deal with them, in many compilers, is to carefully consider that the possibility of an exception is part of the semantics of a program—so you can almost never delete or move a div instruction unless you can prove that the divisor is guaranteed to be nonzero. And then there's the wild world of C/C++, where undefined behavior reigns: there, you can assume that the divisor is always nonzero and optimize accordingly, including totally breaking the program if it's guaranteed to divide by zero.

andrewb1999 · 2022-02-11T04:31:53Z

andrewb1999
Feb 11, 2022

DCE

My DCE pass implements the trivial dce discussed in class, including both global and local trivial dce and iterating to convergence. The trivial dce was tested using a set of benchmarks and turnt that cover a number of difference global and local dce cases. I also spot checked these tests to ensure the results are as expected.

LVN

My LVN pass performs commutative CSE and copy propagation. These options can be enabled with the -c and -p flags respectively. I started working on an implementation of constant folding, but did not have enough time to finish due to difficulty handling different literal types in rust. LVN is tested with another set of benchmarks and tests. Some of these benchmarks do not work currently because I did not finish constant folding.

While it is difficult to ensure that my LVN pass works in all circumstances, I spot checked a number of the benchmarks. For example, we can see the code below is optimized correctly after being passed through the tdce pass.
Before:

@main {
  a: int = const 4;
  b: int = const 2;

  # (a + b) * (a + b)
  sum1: int = add a b;
  sum2: int = add a b;
  prod1: int = mul sum1 sum2;

  # Clobber both sums.
  sum1: int = const 0;
  sum2: int = const 0;

  # Use the sums again.
  sum3: int = add a b;
  prod2: int = mul sum3 sum3;

  print prod2;
}

After:

@main {
  a: int = const 4;
  b: int = const 2;
  lvn.2: int = add a b;
  prod1: int = mul lvn.2 lvn.2;
  print prod1;
}

Discussion

The LVN implementation definitely took some time to implement in a real program, even when referencing the pseudo code. However, by far the most challenging aspect of this assignment for me was constant folding, because it requires actually accessing and manipulating constant literals. The current rust API is not very well suited to this task, as rust is strongly typed and will not allow you to pass a lambda that can act on multiple different literal types (ints and floats for example). I believe the proper way to implement this in rust is to extend the current API with trait implementations for each of the possible operators, allowing the user to just call the * operator on a Literal struct. I started working on this some, but it is not a small amount of work. I still don't have that much practice using rust in real programs, so this issue might just be from my lack of experience.

6 replies

andrewb1999 Feb 11, 2022

The issue comes up more for conciseness reasons than correctness. It would technically be correct to write a function like:

fn apply_fold(op : &String, a : &Literal, b : Option<&Literal>) -> Option<Literal> {
    match op.to_string().as_str() {
        "add" => { 
                if let Literal::Int(a) = a {
                    if let Literal::Int(b) = b.unwrap() {
                        Some(Literal::Int(a + b))
                    } else {
                        None
                    }
                } else {
                    None
                }
        },
...
        _ => None,
    }
}

Where every case handles all of these possible error cases, but this quickly becomes a gigantic function when there are more than a few ops to match against. What I wanted to do was implement a function like follows:

fn binop<T>(a : &Literal, b : &Literal, func : &dyn Fn(T, T) -> T) -> Option<Literal> {
    match a {
        Literal::Int(x) => {
            if let Literal::Int(y) = b {
                Some(func(x, y))
            } else {
                None
            }
        },
        Literal::Float(x) => {
            if let Literal::Float(y) = b {
                Some(func(x, y))
            } else {
                None
            }
        },
        Literal::Bool(x) => {
            if let Literal::Bool(y) = b {
                Some(func(x, y))
            } else {
                None
            }
        },
    }
}

This would allow the apply fold function to be much more concise:

fn apply_fold(op : &String, a : &Literal, b : Option<&Literal>) -> Option<Literal> {
    match op.to_string().as_str() {
        "add" => binop<Int>(a, b.unwrap(), |a, b| a + b),
...
        _ => None,
    }
}

However, as far as I can figure out, a function like the binop above is not possible in rust because there are no true generic lambdas like would be possible in an interpreted language. To compile, the function would have to be revised with all the Literal types in the function definition replaced with T, which brings us right back to the issue of having to match a bunch of stuff within the apply_fold match statement.

It's very possible I am missing a simple solution here from being in Python land for too long, but I hope that makes the issue more apparent.

My proposed solution would be to directly support these operators as traits in the Literal struct. This would let you simply write "add" => a + b and the potential errors involved with mismatched types would be handled internally.

sampsyo Feb 11, 2022
Maintainer Author

Interesting; thanks for elaborating! Yeah, I don't think you're missing anything—the need to explicitly check the types (via matching on the enums) is definitely an inconvenience that doesn't exist in a dynamically-typed language.

One silly idea you might consider is writing several versions of your binop helper above: one for each primitive type. If the two argument literals are not of the given type, then the function could just return None. Then you could use the right one depending on the operator, like this:

fn apply_fold(op : &String, a : &Literal, b : Option<&Literal>) -> Option<Literal> {
    match op.to_string().as_str() {
        "add" => binop_int(a, b.unwrap(), |a, b| a + b),
        "fadd" => binop_float(a, b.unwrap(), |a, b| a + b),
...
        _ => None,
    }
}

andrewb1999 Feb 11, 2022

That seems like a reasonable solution, I hadn't thought of that. I will give it a shot, thanks!

andrewb1999 Feb 11, 2022

That worked! Just pushed the updated code. Not sure how I didn't think of that solution before.

sampsyo Feb 11, 2022
Maintainer Author

Awesome. It is, admittedly, a little unnatural… doing foo_int and foo_float in a language where it's possible to do foo<T> for all types at once should always make one suspicious! 😂

ayakayorihiro · 2022-02-11T05:06:10Z

ayakayorihiro
Feb 11, 2022

Both of my implementations for DCE and LVN are in this directory. Both implementations are of the basic algorithms for DCE and LVN. In order to test that my implementation was correct, I used the pre-existing benchmarks and compared the output of my DCE and LVN passes to the outputs expected from the benchmarks. I also used my existing tests from Lesson 2, which helped me find a bug in a portion of my code that generated the list of basic blocks.

Due to some time constraints, I was not able to manually write further tests for my implementations, but in the future I would like to explore how to write both simple and effective tests for the various optimizations I am performing.

1 reply

sampsyo Feb 11, 2022
Maintainer Author

Sounds great! Just out of curiosity, when you say "benchmarks," do you mean the benchmarks/ directory in the Bril repo? Or do you mean the LVN-specific tests? Not that it's a big deal; I'm just curious because checking correctness on all the benchmarks comes with a slightly higher confidence (maybe?) that just using the tests themselves.

gsvic · 2022-02-11T15:55:40Z

gsvic
Feb 11, 2022

My implementations for both DCE and LVN can be found here. The tests can be using the Makefile in the tdce and lvn directories.

DCE

I implemented two versions of DCE, the trivial one, and one which also eliminates the reassignments. The implementations can be found here.

Test

In order to test both correctness and performance, I used turnt, and I am checking both if the outputs of the programs are the same, and if there is any difference in the number of dynamic instructions. The outputs are the same while the average number of total_dyn_inst is reduced. The table below depicts the average total_dyn_inst for each implementation.

Without DCE	With DCE
4.85	4.12

However, not all the tests pass in this implementation. Specifically, the failing ones are divide-by-zero, logical-operators and fold-comparisons.

LVN

My implementation of local value numbering can be found here. This is the basic implementation of LVN, but it also captures commutativity. However, it does not support branching yet, but I am planning to improve this.

Test

Similar to the previous testing approach, I used turnt in order to check the output validity, and the performance improvement. The results are depicted in the following table, as the average total_dyn_inst in the unoptimized and the optimized version of the code.

Unoptimized	With LVN & DCE
6.45	5.09

1 reply

sampsyo Feb 11, 2022
Maintainer Author

Cool! Just to be clear about supporting branching (by which I think you mean br instructions): remember that this is supposed to be a local analysis, so the only thing you need to do is "restrict" your analysis to a single basic block, i.e., stop if you hit a label or control flow operator. Handling control flow in general is the topic of L4.

barabanshek · 2022-02-11T17:26:27Z

barabanshek
Feb 11, 2022

Both the DCE and LVN implementations are here: https://github.com/barabanshek/bril/tree/assignments/assignment_2 . Please, use the test_* scripts to run tests. Some tests don't output numerical results (e.g. "division by zero" which outputs the exception), please, run these tests manually to see that the outputs match.

Implementation status:

DCE - all tests pass
LVN: (1) common sub-expressions, (2) const propagation, (3) folding - all tests pass

Issues in the current implementation:

although it passes tests, DCE optimizations fails on some benchmarks (due to lack of time, I was not able to debug it)
I'm not sure if for all tests, the optimizations produce the best possible result, i.e. that I've covered all corner cases. For example, I forgot to cover this (from fold-comparisons.bril): 'should_fold1: bool = eq arg1 arg1' - the result of the program is correct, but the program is "under-optimized".

Example of the LVN + DCE pipeline on the test clobber-fold.bril

@main {
  a: int = const 4;
  b: int = const 2;
  sum1: int = add a b;
  sum2: int = add a b;
  prod1: int = mul sum1 sum2;
  sum1: int = const 0;
  sum2: int = const 0;
  sum3: int = add a b;
  prod2: int = mul sum3 sum3;
  print prod2;
}

After passing through LVN and DCE:

@main {
  sum1: int = const 6;
  prod1: int = mul sum1 sum1;
  print prod1;
}

After fixing a bug and allowing to run till convergence, I got the most optimal output:

@main {
  prod1: int = const 36;
  print prod1;
}

Discussion:

I'm wondering how we can make sure that optimizations cover as much as possible corner cases (not for correctness, for optimization)? For now, the only solution I have in mind is manual covering with very carefully crafted sets of tests, but what is the guarantee that these tests indeed cover everything?

1 reply

sampsyo Feb 11, 2022
Maintainer Author

I think the tricky part of your question at the end there is: what do you mean by "cover everything"? To test or verify that an optimization gets "everything," you need a way to formally specify what it should be able to optimize. (A straightforward definition, such as "there is no way to further optimize the program without breaking its semantics," seems too strong—a compiler that could meet that spec would be supernatural!)

yy665 · 2022-02-12T10:25:51Z

yy665
Feb 12, 2022

My TDCE Implementation and LVN Implementation are here.

DCE

My DCE implementation takes care of both Global unused instructions and local reassigned instructions (Both are done together in one pass). Testing on DCE is done by using turnt and the testing has covered all testcases under examples/test/tdce.

LVN

My LVN implementation includes value propagation, canonicalization, and constant folding. Testing on LVN is also done by using turnt. I had most of the tests passed except for fold-comparison and logical-operators tests. Additionally, enabling constant folding fails divide-by-zero test, although I believe it's a good idea to throw the error at compile time. I am still actively debugging.

Meanwhile, I have spot checked some test cases. For example with commute.bril:
The original program is:

@main {
  a: int = const 4;
  b: int = const 2;
  sum1: int = add a b;
  sum2: int = add b a;
  prod: int = mul sum1 sum2;
  print prod;
}

After LVN (including value propagation, canonicalization, constant folding), and TDCE we get:

@main {
  _var_4: int = const 36;
  print _var_4;
}

which looks pretty aggressive in terms of instructions removed!

Discussion

Overall, I don't find TDCE and LVN conceptually hard. However, implementing LVN (and associated extensions) correctly requires a reasonable amount of attention on corner cases. I bet it must be much trickier to implement LVN and associated extensions in real compilers due to much more complicated typing and other factors. I was totally unaware of cases covered by fold-comparison and logical-operators tests when I implementing LVN. Other than fixing those cases, I might need to improve my code quite a bit as the current code is far from elegance!

1 reply

sampsyo Feb 14, 2022
Maintainer Author

Thanks for the detailed writeup!

Additionally, enabling constant folding fails divide-by-zero test, although I believe it's a good idea to throw the error at compile time. I am still actively debugging.

I encourage you to read some other discussions in this thread about what to do in the case of divide-by-zero! They might be enlightening.

orkosinha · 2022-02-15T04:55:51Z

orkosinha
Feb 15, 2022

A bit (a lot 😬) late but here's my implementation of tdce and lvn.

Usage

usage: lesson3.py [-h] [-l] [-d]

options:
  -h, --help  show this help message and exit
  -l, --lvn   perform optimizations using local value numbering and pass of dce
  -d, --dce   perform trivial dead code elimination optimization and dce pass

In the context of other bril tools, I use bril2json < ./tests/lvn01.bril | ./lesson3.py -o | bril2txt to get the more readable versions of bril programs.

The most optimized program my implementation produces first uses my lvn implementation, then passes through local dce for each block and the whole function is ran through global dce.

Dead Code Elimination

Trivial dead code elimination is implemented from the pseudocode in class.

The pseudocode discussed in class was very straight forward to follow, but while testing my lvn implementation I found this example to be pretty interesting

@main {
  a: int = const 4;
  b: int = const 2;
  sum1: int = add a b;
  prod1: int = mul sum1 sum1;
  sum1: int = const 0;
  print prod1;
}

Mostly because I failed to realize the kinds of cases trivial dead code elimination does not capture.

Local Value Numbering

Local value numbering is implemented with CSE exploiting commutativity and copy propagation.

I originally started by trying to follow the pseudocode as closely as possible, but it wasn't that simple. The first problem I ran into was the control flow of when to assign a new number and when to lookup in the table due to the variety of operations.

Variables that were declared before the block but used in the current block proved to be an issue. I took care of this by assigning a lvn number to these previously declared variables.

Another major issue I ran into was handling re-defined variables. I used the following example discussed in class

@main {
    a: int = const 1;
    b: int = const 2;
    c: int = const 3;
    d: int = const 4;
    x: int = add a b;
    x: int = mul c d;
    y: int = add a b;
    print y;
}

I originally renamed the variable and wanted to rename all occurrences before the next assignment to the new name, but it was simpler to just add it to my var2num map. Eventually, this will get overwritten by the redefinition, so it seemed okay to do.

I was originally a bit ambitious and added a type to my value tuples to support constant propagation and folding constants as they came by matching the type, but I didn't get a chance to implement this.

Testing

I tested my optimizations using the ./tests/ directory and Turnt. Most of the tests are from the examples in class, but I also cherry-picked some of the benchmarks using core Bril and tested with Turnt. Although, I'm not sure if I'm using Turnt correctly, as I originally started with generating the .out files using Turnt with the following in my turnt.toml file

command = "bril2json < {filename} | brili"

and then modifying it to be

command = "bril2json < {filename} | ../lesson3.py -l | brili"

3 replies

sampsyo Feb 15, 2022
Maintainer Author

Interesting idea to consider eagerly renaming variables as soon as you generate new names! I think what you ended up with, just putting the old name into the var2num map, feels a little closer to the overarching "philosophy" of value numbering.

I'm curious what you were thinking about doing with the types. Is there a situation where knowing the type allows you to do more than you could with just knowing the value (which seems to imply the type)?

orkosinha Feb 15, 2022

Thanks! Tracking types was I guess kind of hold-over from my false assumption core Bril had operations that could use multiple types. This doesn't really happen with arithmetic and comparators only use int and boolean operations only use bool, so tracking type was not necessary. The specific situation I can think of (and falsely assumed) is if core Bril had a float type with the same operations as int which would have similar values to ints but the type distinction allows us to differentiate when doing something like division.

sampsyo Feb 15, 2022
Maintainer Author

Ah, got it! Yeah, we made an intentional choice to distinguish div from fdiv so that stuff like this is easier.

susan-garry · 2022-02-17T20:15:21Z

susan-garry
Feb 17, 2022

Even later, but here is my implementation for dce and lvn!

Dead Code Elimination

I implemented both global and local reassignment here, but I ~~decided~~ initially tried to go off-script. My dead code elimination performs a backwards pass of the code, starting with the last instruction in a function. However, since the last pass of the code may jump to earlier instances of code, this optimization would not work without control flow analysis, so I moved my buggy implementation to a file called dce_naive.py and implemented dce in accordance with the pseudocode given in class, adding a new test case to catch similar errors.

Here is the buggy optimization I first wrote:
For each instruction in the core bril language that produces a side effect (print, return, call, jmp, and br), I add its arguments to a local set of working variables and a global set of working variables to keep track of variables that will be needed later in the code.
When we encounter an assignment to one of the variables in the local working set, we remove it from the local working set; if we encounter an assignment to a variable that is not in the local working set, then it is safe to remove it. When we begin processing each block, set the local working set equal to the global working set. This means that my implementation of dead code elimination is conservative, and would not be able to optimize something like

@main(input: bool){
    br input .iftrue .iffalse;
  .iftrue:
    a: int = const 0;
    return;
  .iffalse
    a: int = const 1;
    return a;

Because it would not recognize that the true branch has no successors in the control flow graph. However, this approach has the benefit that it requires only one pass, and it would be easy to extend this approach to use a program's control flow graph to recognize when an instruction is not needed in any of a block's successors, as in the example above.

Local Value Numbering

My implementation deals with common subexpression elimination and copy propagation, but not constant folding. I followed the pseudocode in class as closely as possible, but there are a few cases that my implementation does not optimize. To avoid the issue of renaming subsequent uses of a variable when it gets clobbered, I simply delete the old entry in the value table when the variable is clobbered, so that future instructions with the same subexpression cannot replace it with the old variable name. This means that

x: int = a + b
x: int = a
y: int = a + b

does not get optimized.

I also did not explicitly give each value its own number. Instead, I kept track of them using canonical variables. This is because I initially intended to perform a separate pass to rename each variable with a unique identifier, but I was unable to get this code working in time.

Testing

I tested these algorithms using turnt and the provided example tests, as well as added a few tests of my own.

Discussion

I found dead code elimination and LVM conceptually difficult. I forgot how complex global analayses that utilize the structure of the control flow graph are, so I spent a while working on optimizations that I eventually abandoned because I decided they were a little outside the scope of this lesson.

1 reply

sampsyo Feb 18, 2022
Maintainer Author

Super interesting; thanks for the summary! Even if it looks "obvious in hindsight," I think there is a cool lesson here about how subtle the line is for global analyses/optimizations that you can accomplish without analyzing the CFG in a data-flow-like style.

Also interesting that you sidestepped the whole "reusing variable names" issues by just not optimizing those cases. I wonder how much simpler that made the implementation in the end.

Out of curiosity, since you didn't use actual numbers for the variables, is there a way to summarize how you represented values? Were they tuples like we discussed in class, except that the operands were represented as strings (variable names) instead of numbers?

thomasyang18 · 2022-05-14T20:23:47Z

thomasyang18
May 14, 2022

FINALLY DONE! TWO WHOLE WEEKS!

This was one of the hardest programming things I've ever worked on (or at least felt like it). I've written greater quantities of code, but in terms of rewriting and thinking and implementing and failing, this has to take the cake. I mean, other programming assignments/projects or whatever, I'm at least making visual progress, like the file size is going up and I'm making small tweaks here and there. This? Nah.

A lot of it probably could've been contributed to the fact that I thought I could do LVN better and tried implementing something myself by renaming everything beforehand and then trying some really scuffed no-row indexed LVN, but just having it with 3 columns and following the pseudocode made it so much easier. Although doing it the "incorrect" way made me see the benefits of the current implementation.

No constant folding and no copy prop and all of that, although I may revisit those later if I'm feeling good about it.

After and before for clobber.bril:

@main {
  a: int = const 4;
  b: int = const 2;
  sum1_2: int = add a b;
  prod1: int = mul sum1_2 sum1_2;
  print prod1;
}
----------------------------------------------
#
@main {
  a: int = const 4;
  b: int = const 2;

  # (a + b) * (a + b)
  sum1: int = add a b;
  sum2: int = add a b;
  prod1: int = mul sum1 sum2;

  # Clobber both sums.
  sum1: int = const 0;
  sum2: int = const 0;

  # Use the sums again.
  sum3: int = add a b;
  prod2: int = mul sum3 sum3;

  print prod2;
}

DCE

Local

This one was fairly straightforward, although I thought about it backwards. Because when I implemented this for a previous course, the algorithm I came up with was to go backwards and do the consumption backwards, instead of forwards In a sense, I think it's similar to the going forwards, but idk it makes more sense to me backwards.

"Consumption" in this sense meaning to match uses with declarations, so every use has a preceding declaration to match (if the block ends that's also fine), but if there's excessive declaration, the first "use case of variable" would've been matched already so the declaration would get skipped and therefore optimized away.

Global

This one was also straightforward, I think? It was like just have certain, really restrictive qualities for dead code, and they were for me:

If there's no destination, then in the bril language it's probably a very important instruction like a label or jump or something, so keep it
If the variable is used, then keep it
If the instruction has potential side effects like a call or a jump, don't optimize it away

I think this is enough to guarantee correctness in the bril language.

LVN

This one was really, really annoying as stated.

When I finally sat down and decided to actually implement the algorithm provided in class instead of trying to make my own, I still had a lot of difficulties resolving edge cases. I think the three most important ones were:

If there's no destination, I don't want to optimize the instruction at all
If the operation is a side effect operation, I don't want to optimize the instruction at all, but I want to keep the destination as a target for optimization
Variables that come from outside the function

My solution was to instead of making the table a map, making it an array and then storing each object as a (tuple, string) tuple. The variables that came from outside the function and the variables that have a destination have just straight up no value in the tuple, it's empty. I don't know this is the best idea, but I took the idea from the lecture because the first row was empty when storing variable x.

There's probably some better way, but I don't want to think more about this problem for some time tbh.

(actually thinking about this, the side effect operation one is a combination of the first and last one, so I overcomplicated it, but agh whatever)

Testing

I ran testing by first confirming that on my test cases (I copied the ones from the lecture and made a few of my own involving jumps) and the ones in the lvn folder all ran fine and had the same outputs (or in the case of divide-by-zero, the code looked compilable), then I looked at the difference between each file and the optimized output. I didn't look at the dyn-ops but given that the code was much much shorter I'd imagine it'd be better.

1 reply

sampsyo May 15, 2022
Maintainer Author

This all sounds like a very reasonable solution to the real-world challenges!

Lesson 3: Local Analysis & Optimization #284

sampsyo Feb 3, 2022 Maintainer

Replies: 18 comments · 35 replies

Dead Code Elimination

Local Value Numbering

Correctness

Speed Increase

sampsyo Feb 7, 2022 Maintainer Author

Dead Code Elimination

Local Value Numbering

Discussions

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

Dead Code Elimination

Local Value Numbering

Discussion

Data structure

LVN: optimization on-the-fly

sampsyo Feb 11, 2022 Maintainer Author

Dead Code Elimination (DCE)

Local Variable Numbering (LVN)

Correctness

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

TDCE:

CFG:

LVN

Testing strategy

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

DCE

LVN

Discussion

sampsyo Feb 11, 2022 Maintainer Author

Links

Summary

Testing

Optimization

Difficulties

sampsyo Feb 11, 2022 Maintainer Author

Testing

DCE

LVN

sampsyo Feb 11, 2022 Maintainer Author

DCE

Testing

LVN

Testing

Discussion

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

DCE

LVN

Discussion

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

sampsyo Feb 11, 2022 Maintainer Author

DCE

Test

LVN

Test

sampsyo Feb 11, 2022 Maintainer Author

sampsyo
Feb 3, 2022
Maintainer

Replies: 18 comments 35 replies

sampsyo Feb 7, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author

sampsyo Feb 11, 2022
Maintainer Author