Lesson 3: Local Analysis & Optimization #451

sampsyo · 2025-01-21T20:28:40Z

sampsyo
Jan 21, 2025
Maintainer

The tasks for this lesson include implementing basic dead-code elimination (DCE) and, as the main event, implementing local value numbering (LVN). I ❤️ LVN!

UnsignedByte · 2025-02-05T23:55:56Z

UnsignedByte
Feb 5, 2025

Code

DCE

I first implemented a simple dead code elimination pass. For global DCE, I simply generated a set of all used variables using a first pass, and removed any instructions writing to destinations not in this set. For local DCE, I needed to detect situations where writes occur after writes with no reads in between, and then delete the first write. For this, I decided to actually loop through instructions in reverse, where I would add destinations to a "written but not read" set until they were read, in which case they were removed (In cases like x = x + 1, we add to the set and then immediately read to remove it from the set). One thing about this version of local DCE though - wouldn't LVN make this optimization totally useless as LVN already renames variables that are written to again later in a block?

LVN

LVN was definitely the majority of the work for this lesson. I decided to represent my table as a number of maps between numerical IDs, variable names (representatives) and values. Values were stored in a few ways, with constants stored as ("const", type, value) and known operations stored as (op, *args). For unknown instructions, my implementation was conservative and assumed that all unknown instructions have side effects. Values for this look like ("unknown", x) where x is a unique integer that increments every time such a value is created. This ensures that the algorithm never eliminates instructions like call or alloc which generate values but also have side effects.

I also had the time to implement canonicalization, which includes commutative operators (like add) as well as constant folding for integer, boolean, and float operations. Copy propagation is also implemented, and I wrote a number of benchmarks to test various aspects of the algorithms. I also used brench to test my LVN and DCE combination on benchmark algorithms implemented in the benchmarks repository in bril to make sure that the passes did not cause any issues with program correctness.

Summary

By far the biggest challenge occurred when implementing LVN and dealing with all the edge cases where issues occurred. My main issues were when dealing with unknown instructions that generated values. Initially, I entirely ignored values created by unknown instructions which lead to problems when variables were renamed, and afterwards I had issues with instructions like alloc which had side effects. In order to resolve all these issues, I worked my way through each benchmark algorithm and used manual inspection of the optimized code to analyze where bugs were occurring. One particular issue came up when blackbox values written in other blocks were overwritten, like in the following case:

  x: int = const 1;
.new_block:
  y: int = id x;
  x: int = const 2;
  z: int = id y;

This caused problems as x was stored as the representative for both y and z, whereas in reality we needed to rename x just like when two consecutive writes occur in a block. Over all, I would say there was not one specific challenge when implementing the algorithms, but instead mostly just keeping readable code when dealing with a number of edge cases. Over all, I would say I deserve a star mainly because I implemented all the extra examples, and especially because I made sure my implementation worked across all the benchmarks, as my initial test cases definitely did not cover everything, which took me by surprise when I attempted to optimize real programs.

1 reply

sampsyo Feb 10, 2025
Maintainer Author

Nice work distilling that trickery with overwritten live-ins! That's a good example.

One thing about this version of local DCE though - wouldn't LVN make this optimization totally useless as LVN already renames variables that are written to again later in a block?

Yeah, absolutely, seems like it to me. (In other words, the phase ordering LVN->DCE would be equivalent to just LVN.)

scober · 2025-02-06T03:33:33Z

scober
Feb 6, 2025

Repo

Testing

I wrote my own testing tool that takes a bril-optimizing unix-shell-pipe-command and a bunch of bril programs and checks that the optimization doesn't change the behavior of the program and doesn't make the programs any worse (for a configurable definition of worse). I thought this would be worth the effort because I didn't want to manually parse the output of brench to make sure every program was getting faster.

DCE

I implemented my optimizations as Python scripts and by far the best decision I made during my DCE implementation was to create decorators for global and local optimizations as well as iterating to convergence. That way I can implement each optimization as a single pass over a single basic block or function and push the shared logic of disassembling and reassembling full program files into the decorators. Besides that, this was pretty straightforward. I implemented the global and local dce passes we discussed in class and used my testing tool to confirm that they did not break or slow down any programs in the dce testing directory or the core benchmarks.

LVN

This one was much more of an adventure. One fun thing my early lvn implementations did was abuse the boolean representation to perform optimizations like turning

a: int = const 0;
b: bool = const False;
c: bool = id b;

into

a: int = const 0;
b: bool = id a;
c: bool = id a;

I thought that was kind of clever and I was proud of it but as it turns out that isn't allowed so I had to add some type information to my value representation. Obviously there were lots of other bugs as well, but I thought this one was the most fun.

I tested my lvn implementation as well as my lvn implementation together with my global dce implementation on the tdce and lvn test directories as well as the core bril benchmarks. They do not break any of those programs nor do they make them slower.

When I run both passes together I can remove 144 (or 11%) of the static instructions and 7235 (approximately 0%) of the dynamic instructions in the core bril benchmarks. But running both passes on the combined contents of the tdce and lvn test directories (which are presumably designed to be highly optimizable by these passes) saves 59 dynamic instructions (51%) and 93 static instructions (58%).

1 reply

sampsyo Feb 10, 2025
Maintainer Author

Awesome; all looks good!

I thought that was kind of clever and I was proud of it but as it turns out that isn't allowed

Well, it is definitely clever!!! The fact that it's ill-typed is merely a footnote on the main story, which is that it's clever and cool. 😃

ananyagoenka · 2025-02-06T16:55:07Z

ananyagoenka
Feb 6, 2025

TDCE

I started with trivial dead code elimination (TDCE) to get rid of instructions that define variables never used. Global TDCE collects all the variables actually used in a function and removes any assignments that don’t contribute to the final result. Local TDCE eliminates cases where a variable gets overwritten before it’s ever read, which I handled by tracking the last write to each variable in a block and removing earlier, unused definitions. This helped clean up a lot of unnecessary instructions, but I realized some of these optimizations would be redundant once I implemented LVN.

LVN (w Copy Propagation)

LVN took more effort. The basic version eliminates redundant computations by keeping track of a value table that maps expressions to numerical IDs and canonical variables, meaning repeated calculations get replaced with a single stored value. I also added canonicalization for commutative operations like add and mul so that equivalent expressions (like a + b and b + a) weren’t recomputed unnecessarily.

Then I added copy propagation, which turned out to be trickier than expected. At first, my implementation wasn’t fully resolving id chains, meaning redundant assignments stuck around. The fix was resolving each variable’s original value before inserting instructions, so print copy3 would correctly turn into print x. Once that was working, TDCE was able to clean up the leftover assignments entirely.

I tested everything using brench with Bril’s core benchmarks to make sure the optimizations worked without breaking anything. For copy propagation, I ran programs with multiple redundant id assignments and checked that only the necessary ones remained. LVN consistently reduced instruction count by eliminating redundant computations, though the actual improvement varied across benchmarks.

The hardest part was finding the right balance between optimization and correctness. Early versions of LVN were too aggressive, removing function arguments and messing up control flow. Debugging involved carefully stepping through intermediate program output and adjusting LVN to preserve necessary computations while still optimizing effectively. Making sure function parameters and branch conditions weren’t accidentally eliminated was a key fix.

1 reply

sampsyo Feb 10, 2025
Maintainer Author

Then I added copy propagation, which turned out to be trickier than expected. At first, my implementation wasn’t fully resolving id chains, meaning redundant assignments stuck around. The fix was resolving each variable’s original value before inserting instructions

Good point; it seems like picking exactly where in the LVN algorithm to insert these tweaks is one of the main things about extending the LVN framework.

parthsarkar17 · 2025-02-06T17:45:38Z

parthsarkar17
Feb 6, 2025

I implemented global and local trivial dead code elimination here and a local value numbering pass here, which I coupled with copy propagation and optimization of arguments for commutative operations. Finally, because I was initially having some issues with Brench, I wrote a tiny correctness script to diff the outputs of brili with and without my transformations here.

TDCE Passes

For global TDCE, my implementation is a single function that looks through all of the functions in the program. For each function, I used a higher-order function to gather a set of "used" argument variables (which is just a set of strings), and then made another pass through the function to get rid of all instructions that write to variables that are not in the "used" set. If I got rid of an instruction, I tail-recursively called this global TDCE function again. For local TDCE, my implementation is centered around a function called converge_basic_block that scans through the instructions in a BB and adds an instruction F to a to_delete set if it encounters an overwrite to F's destination d before d is used. Same as before, it tail-recursively calls converge_basic_block on the basic block again if such an instruction was found and deleted.

LVN Pass

My LVN is implemented by looking through a basic block and maintaining the cloud and table data structures we talked about in class via one large call to fold_left. It wasn't too hard to canonicalize the arguments of an operation w.r.t. commutativity, but copy propagation took a little more work. I ran into a bug because, given an operation like dst : type = id arg, I would mistakenly bind dst and arg to the same abstract value. This wouldn't always work of course because, if arg was written to later, dst would no longer keep the original value of arg. Maintaining the table data structure also took some imagination, and the central challenge was creating the data type for a Value.t. I eventually implemented a module Value whose equality function relied heavily on string equality (e.g. for constants, comparing the strings representations of the types and the data). It works, but felt a little hacky haha.

Correctness + Performance Improvements

Before I got Brench working, I used the script I mentioned earlier to ensure correctness across all benchmarks. After I nailed down a few corner cases, my script indeed ended up showing no diffs. I also had to add onto bril-ocaml to support Bril floating point operations. After realizing that Brench wasn't working because dune exec didn't work with multiple threads, I used Brench and got this output. I wrote a statistics collection script; the following results show the maximum, mean, median and standard deviation of the percent decrease in dynamic instruction count caused by tdce (lvn (prog)):

maximum: 70.4225352113
minimum: 0.
mean: 11.7682102982
stdev: 18.1292650621
median: 0.

While at least half the benchmarks didn't get any benefit from my passes (but also didn't get any worse!), I'm happy with the average 11.77% decrease. A lot of these were probably hand-written, and it's cool that, even so, "trivial" passes like our TDCE can make such a big improvement. I also saw that, based on the sample of benchmarks, the probability of a greater than or equal to 5% decrease in dynamic instruction count is 35.21%. That's not bad!

Challenges

The hardest part was deriving general rules for nailing down corner cases. For example, I ran into an issue where I would sub-expression match something like alloc 3; this isn't correct because if I change a : ptr<int> = alloc 3; b : ptr<int> = alloc 3; to a : ptr<int> = alloc 3; b : ptr<int> = id a;, then a is physically equal to b which breaks Bril's semantics. It took some time to generalize what I learned from these examples -- the alloc case, for example, taught me that any pointer-producing expression should probably not be sub-expression matched.

I think my work deserves a Michelin star because I implemented and thoroughly evaluated what was asked, plus a couple extensions.

1 reply

sampsyo Feb 10, 2025
Maintainer Author

Very cool move to name the LVN environment data structure cloud. ☁️ For some reason, this is the correct name!

I eventually implemented a module Value whose equality function relied heavily on string equality (e.g. for constants, comparing the strings representations of the types and the data). It works, but felt a little hacky haha.

I think this is entirely the right thing to do for this task. FWIW, at some point (given infinite extra time) it might be fun to think about how to "minimize the surface area" of strings in an implementation like this. As in, surely it has to deal with string variable names at some point, so it can correctly regenerate instructions. But what is the absolute minimum that the program can touch strings? Can stringiness be confined to the "perimeter" of the implementation, somehow?

After realizing that Brench wasn't working because dune exec didn't work with multiple threads

FWIW, as we've discussed a bit on Zulip, IMO the simplest solutions here are to (a) disable parallelism in brench, or (b) to just hard-code a path to the executable you're trying to run. But presumably that is what you already did!

dhan0779 · 2025-02-06T19:13:48Z

dhan0779
Feb 6, 2025

I've attached my implementations for trivial DCE and LVN in this folder https://github.com/dhan0779/cs6120-impl/tree/main/l3. I used Brench for comparing the outputs for correctness and created my own script to test for dynamic instruction count changes.

Trivial DCE:
I implemented DCE in Python (same as L2) since it seemed simple enough to create a small script. Overall, creating the algorithm for this was fairly simple. I first looped over all the instructions in the basic block and recorded which variables were being used (in the argument field). Then, I would create a new block to add only the instructions that were being used or didn't produce any output to a variable/destination. This new block would then be compared to the old block to see if any changes were made, until convergence (blocks are identical in the loop). I used brench for testing and verified that all outputs were correct in the TDCE folder.

LVN:
Like others, doing implementing LVN was the most challenging part of this lesson. Using the pseudocode provided, I wrote a brief outline of structs that I needed to create and the dictionaries I needed for storing Values, canonical variables, etc. Like TDCE, I first break down into basic blocks before performing the LVN optimization. I ran into a couple of issues storing const values in the Value table, since I wasn't sure at first whether to include all Values such as ('const', 4) and ('add', 1, 2) in the same table. I realized this wasn't the best solution and created another dictionary for storing constant values.

Another issue I had was figuring out how to implement when to use fresh variable names. We mentioned in class that we could rename all variables, but then we would have to create another map for resolving the canonical name at the very end. I decided the easiest way to resolve this was looping through the instructions in a basic block in reverse and check whether variables were overwritten. This is similar to the pass I did in TDCE for checking whether a variable was being used or not. Implementing this resolved all issues with my basic LVN implementation.

As a small optimization, I added a very naive function for constant folding (including some of my own test cases to make sure that this worked). When I first created the Value tuple, I would check if this Value could be folded into a constant by looking up possible values in my num2canonical dictionary. Obviously, there could be cases where Values would need to be recursively checked (for nested Value args) but I didn't really bother with this in the end.

Correctness and Performance:
I used Brench for testing correctness on both DCE and LVN. I didn't really understand how the extract argument in the toml files worked for extracting the dynamic instruction count, so I manually just piped the scripts into different functions (something like bril2json < lvn/redundant-dce.bril | python3 dce.py | brili -p). I didn't see too much decrease in dynamic instruction count, but running both DCE and LVN combined typically resulted in around 5-10% decrease in instruction count.

1 reply

sampsyo Feb 10, 2025
Maintainer Author

All looks good, @dhan0779! FWIW, about this:

I didn't see too much decrease in dynamic instruction count, but running both DCE and LVN combined typically resulted in around 5-10% decrease in instruction count.

Since you seem to have run these tests manually, it would be nice to know (a) which input programs you tried, and (b) how you came up with these summary statistics ("typically" 5-10%). Was it just eyeballing a small handful of benchmarks, or did you manage to produce broad statistics?

mt-xing · 2025-02-06T19:41:28Z

mt-xing
Feb 6, 2025

My implementation of DCE and LVN are here: https://github.com/mt-xing/cs6120/tree/main/l3

The implementation is split into three passes. First, a variable renaming pass handles renaming all variables in a block whose values are overwritten, renaming the earlier occurance. This renaming works by using a colon : character, which my implementation assumes the original source code would not be using inside the variable name. This assumption was made because in the .bril text format, a colon denotes the type annotation, so hopefully this discourages people from using it in the variable name, even if it's valid in JSON. Dead Code Elimination and Local Value Numbering are then implemented as separate passes using my pseudocode I wrote in class. The final lvn function runs in the order of first renaming variables, then LVN, and then finally iterating DCE until convergence.

My code is in TypeScript, using Deno. You can call it by cd-ing into the l3 folder and then running deno index.ts <filename> where <filename> can either be a bril json file, or a .bril text file. For the latter, my code will automatically attempt to spawn bril2json to convert the file to json, and so will require execution permissions via Deno and for bril2json to be on the path.

My implementation of LVN also includes constant folding, copy propagation, and commutative reordering. The constant folding and commutative operators were very easy; I simply needed to examine which operator was being processed, and to sort them for the commutative ones (add, multiply, equals), or to just do the math itself. Note that, since the JSON format itself does not support 64 bit integers, my implementation (being written in TypeScript) will also fail to parse ints larger than can be safely stored in a double. This matches spec-compliant JSON behavior.

Copy propagation gave me the most trouble, specifically in handling live-ins. Since my variable renaming pass only works within one block, it can't rename the inbound variable, and so a naive implementation of copy propagation (which had assumed all rewrites were renamed) ended up clobbering variable values. I ultimately resolved this by choosing not to copy propagate any live-ins for the first copy.

For testing, I built my own testing harness I call brilTest which works like turnt but with additional customized features. It too, given a list of files or folders with .bril programs within them, will run my optimization against them all. Unlike turnt, it does not rely upon previously computed answers being committed to the repo, instead running both the original and my optimized code in sequence in the interpreter to compare their outputs for correctness. Furthermore, my library will then parse the number of dynamic instructions. The power of my library is that I can specify, on a per-file or per-directory basis, whether I expect the number of dynamic instructions to simply be less than or equal to, or strictly less than, the original file after optimization. This allows me to run against the large benchmark directory to test for correctness, while also handcrafting examples that I know must be optimized, and for one test command to handle all of these scenarios.

To run all test cases, cd into the l3 folder and run deno test --allow-read --allow-run --parallel. You will need both brili and bril2json on your PATH.

My test library prints out a final optimization report with some simple statistics. Running against all my hand-crafted tests (which consists primarily of examples from class) as well as the entire bril benchmark folder, my final optimization averaged a staggering 32974 fewer instructions per file than the original, although this number is highly misleading, as the median was in fact 0, and even the 3rd quartile was 2 instructions. Rather, it was just function_call.bril and especially delannoy.bril which dragged up the average, with delannoy.bril being optimized by a massive 5600910 dynamic instructions!

I believe my work here definitely deserves a Michelin star (and I'm not sure what the criteria is for multiple stars). Not only did I implement DCE and LVN, as well as all the optional extensions, but I also built a custom estensible testing framework that I intend to use for all my future assignments in this class. The ability to encode how much optimization to expect from each file in the test itself, and to change it for each optimization, allows me the ability to code with high confidence in the correctness of my final solution and the efficiency of my optimizations.

1 reply

sampsyo Feb 10, 2025
Maintainer Author

This all seems great, @mt-xing!

Note that, since the JSON format itself does not support 64 bit integers

To nitpick: it's not the JSON format that doesn't support big integers; it's the reference interpreter's JSON deserialization. (See sampsyo/bril#304.)

my final optimization averaged a staggering 32974 fewer instructions per file

FWIW, the relative (percentage) reduction might be a tad more revealing than the absolute number in general!

ngernest · 2025-02-06T22:12:42Z

ngernest
Feb 6, 2025

Repo for TDCE + LVN (Joint work with @emmanueljs1)

We implemented trivial DCE (removing instructions whose results are never used again, and deleting instructions whose result is unused before reassignment) and LVN in OCaml.

Testing
We used Turnt to make sure that our programs behave the same post-optimization, and we used Brench to make sure that our optimizations actually reduce dynamic instruction count.

Out of all 51 benchmarks in the Bril repo, our implementation optimizes 21 of them. (We used a Python script to parse Brench's output and to generate the plot below.) Here's a plot visualizing the % reduction in dynamic instruction count resulting from our optimization: (In the plot, full means running both TDCE and LVN.)

Challenges
Some of the challenges we encountered stemmed from our desire to write idiomatic OCaml. In OCaml, cons-ing an element to the front of a list is preferred to appending to the end of the list, so we had to figure out when to insert appropriate calls to List.rev to reverse our list of instructions in each block. This was complicated by the fact that we tried to express loops in the Python pseudocode as calls to List.foldi (essentially folding a function f over the list while also keeping track of the loop index at each call to f), and keeping track of the right indexes when our list of instructions may have been reversed was tricky. We also encountered a few non-termination bugs when we naively translated while-loops in the Python pseudocode to OCaml, and we spent some time figuring out how to implement "iterating to convergence" in terms of recursive functions.

When debugging LVN, we also adopted the following strategies:

Storing constant literals in the LVN table
Avoid rewriting call operations but still update the table we encounter them (as they are value operations)
Cleaning up / clobber dangling references in table when a variable is overwritten in the environment

Overall, we think our work deserves a Michelin star because our optimization preserves the same program behavior for all the benchmarks and makes a non-trivial subset of them faster.

1 reply

sampsyo Feb 10, 2025
Maintainer Author

Interesting commentary about the awkwardness of writing this in an idiomatic functional style! I wonder what it would look like to depart from our ruthlessly imperative pseudocode and rethink the approach with a function approach. Maybe it would require inventing a few auxiliary data structures? Not sure!

neel-patel-1 · 2025-02-06T23:20:29Z

neel-patel-1
Feb 6, 2025

DCE
I implemented globally unused instruction and locally killed instruction elimination and tested on the full set of 'core' benchmarks using brench. A few programs executed fewer dynamic instructions (See Verification/Performane)

LVN + Optimizations
I implemented a version of LVN that works on the examples from lecture and the core bril benchmarks. It recognizes commutative operations that reappear in the program with their operands swapped and treats them as the same value. It also does constant folding and helps do common subexpression elimination by revealing redundancies for a subsequent dead code elimination pass.

Verification/Performance
I tested the LVN implementation on all the core benchmarks using brench. Sometimes LVN is able to expose more redundancy, enabling dead code elimination. When there is already dead code to eliminate, LVN usually helps find more.

Challenges Encountered
Handling instruction sequences that did not appear in the in-class examples made up the bulk of my time implementing as I set out to get the implementation working on the core benchmarks in the bril repo. Questions I had to answer while extending my LVN implementation included:

How to handle reassignments?
How to handle live-on-entry variables?
How to handle instructions whose 'dest' also appeared in the 'args'?

1 reply

sampsyo Feb 10, 2025
Maintainer Author

Looks good, Neel! Nice work.

zihan0822 · 2025-02-07T00:43:18Z

zihan0822
Feb 7, 2025

source

DCE

I implemented one forward pass version of DCE. The index of instruction that can be potentially deleted is tracked along the way. To make our DCE work in global context, we conservatively assume that last assignment to a variable in each basic block will be used in somewhere in other descendant blocks.

LVN

I implemented LVN with copy propagation, commutativity exploit and const folding. To avoid the re-assignment problem, before we do lvn, we scan the entire basic block to perform variable random renaming. For those variables that either come from ancestor block or will be potentially used in descendant block, we keep its name untouched.

The basic lvn framework is similar to the pseudo code shown in class. For each instruction, we first try to query every argument in var2num table. This step might fail when argument comes from ancestor blocks, in that case we keep the variable name itself as its “numbering”. Next, we query the numbering table with the canonical tuple (op-type, *args). Three bonus features can actually be implemented by adding minor modifications to this step.

copy propagation: when we see an id op, we immediately query var2num table, if arg is present, we substitute it with the canonical variable the entry points to
commutativity exploit: for operations like add, mul, we sort the numbered args when constructing its canonical tuple
const folding: we add another optional field: const_lit to our number table entry. For a given canonical tuple, if it is not currently present in numbering table and the const_lit field of all its args are available, we evaluate the value for the queried tuple before inserting it to the table. And when we try to substitute an expression with its canonical variable, if the const_lit field is present, we just do a const assignment instead of id operation.

When constructing canonical tuple, call is treated differently than other deterministic ops. The assumption is that even the input args have the same numbering, a call to the same function may still produce different results. Therefore, we will assign a new numbering to every return value of function call.

Testing

I have tested my code on both simplified case I manually constructed, including those in the slides and on more general cases in bril/benchmarks to make sure my optimization will not break the code. Without global information from other blocks, in order to guarantee the correctness of code, I have to adhere to some conservative assumptions I mentioned before. One consequence of that is I can only see little improvement of a small portion of codes in general benchmarks. I am able to see copy propagation and const folding taken place but there is little DCE going on (mainly because I assume that dangling assignment will be used somewhere in descendant block for all non-leaf node). For those simple cases, I am able see great improvements on the number of dyn inst executed.

Conclusion

One of the most challenging part is to balance between aggressiveness of optimization and correctness. I also try to back propagate liveness info from leaf node of CFG to do a more aggressive global dce, but I fail to handle the cyclic CFG case. I gave up then. I am planning to incorporate what I have learned today: worklist algorithm on liveness detection to my local DCE optimizer on the next lesson to see how for I can go with that. I think I deserve a Michelin Star because I tried my best to implement a good local optimizer (w/ all bonus features, though still a lot of space for improvement as a global dce) and my code is thoroughly tested on both manually constructed cases and general benchmarks

1 reply

sampsyo Feb 10, 2025
Maintainer Author

To avoid the re-assignment problem, before we do lvn, we scan the entire basic block to perform variable random renaming.

The two-pass approach is creative and seems simple! Nice.

When constructing canonical tuple, call is treated differently than other deterministic ops. The assumption is that even the input args have the same numbering, a call to the same function may still produce different results. Therefore, we will assign a new numbering to every return value of function call.

That's certainly a cool way to deal with side effects. Another option, broadly, is to just exempt these instructions from being LVN'd at all, i.e., never even appear in the value table.

I am able to see copy propagation and const folding taken place but there is little DCE going on (mainly because I assume that dangling assignment will be used somewhere in descendant block for all non-leaf node).

Maybe this is exactly what you were getting at, but FWIW, I think the main limiting factor here is that it's a local optimization. I predict there are a lot more DCE opportunities globally in a typical function, but not so many within a single basic block.

One of the most challenging part is to balance between aggressiveness of optimization and correctness. I also try to back propagate liveness info from leaf node of CFG to do a more aggressive global dce, but I fail to handle the cyclic CFG case. I gave up then.

Indeed; there is probably not much better you can do without switching to a "real" global value numbering (GVN) algorithm, which requires a somewhat different approach.

Annacaro22 · 2025-02-07T01:17:03Z

Annacaro22
Feb 7, 2025

Code: DCE is here, LVN is here

DCE: I implemented our pseudocode for forward Dead Code Elimination. The algorithm itself was pretty standard, though I did struggle a bit with implementing the iteration to convergence, since I had set up my defined_not_used function (getting variables that had been defined but not yet used) to operate on an entire function, while my rewritten function (getting instructions that had been overwritten) operated on each individual block. I eventually figured it out, though, and threaded the defined_not_used and rewritten calls in with each other to iterate to convergence.

LVN: As predicted, LVN was much trickier than DCE. I originally had renaming variables in an instruction and replacing an instruction with an id of another as two separate parts of my code; I realized I had to thread them together, though, because an instruction might be a direct copy of another only AFTER its variables are renamed. Also, translating between blocks, and realizing that there were unknown values of variables that were passed in from other blocks, was a real pain, as I had to make sure I didn't overwrite these. At first I had hoped to get away with renaming everything except the first definition of a variable, but I had to rename every instance except the last instance of the variable definition, to keep it consisting when passing that variable's value onto other blocks. Also, full credit, when creating new variables for the LVN program, I borrowed Michael Xing's idea of starting the variable with a colon, as users would not be likely to do this in their programs. To make sure I always had unique new variable named, I just kept a counter that was globally incremented to make sure that, even after the block had cleared out the table, there would always be a new variable name when we needed one.
Note: I did not include the last step of LVN-- the part of getting rid of instructions we had copied over from previous variables-- in my LVN code itself, but this can easily by achieved by running LVN and then piping the result into my DCE program.

Testing: In order to ensure correctness, I wrote a python script to run LVN and DCE against all the core benchmarks in the bril repo, checking to make sure the desired output remained the same. This helped me catch a lot of bugs that I didn't get with my toy example; I struggled for a while what to do when receiving a variable as passed into a function/block that gets rewritten within that block-- we don't want to rename this because then the args don't match the variables being used, but we also want to make sure anything calling the variable after its rewriting isn't accidentally referring to the wrong value. It also helped me catch memory issues; renaming variables and alloc-ing said variables does NOT mix well. Eventually, I had to put the stopgap on that my lvn would not optimize blocks that dealt with alloc-ing variables and handling pointers. Still, when I first ran my test suite on the benchmarks, I failed 9 tests, with unique reasons behind nearly all of them, so checking against the benchmarks for correctness was definitely necessary. I also of course checked that my LVN correctly substituted variables for the example toy programs we saw in class, and I checked my DCE against the toy programs from class as well. I hope to add more to this testing suite I built over time.

Overall, I thought this assignment, and LVN in particular, much harder than last week's. I do think I deserve a Michelin star because I spent several 1AM nights working to make my test suite work, and debugging various problems that I hadn't previously considered for my LVN; it was like every time I thought I was done, a new issue popped up. But I'm now confident in correctness as well as the performance increases of my programs.

1 reply

sampsyo Feb 10, 2025
Maintainer Author

Great; nice work, @Annacaro22! Sorry that the LVN stuff required some late nights of hacking.

Eventually, I had to put the stopgap on that my lvn would not optimize blocks that dealt with alloc-ing variables and handling pointers.

I think this is an extremely defensible approach to dealing with side effects! Another good option is to only ever run your optimization on programs that don't use memory. 😃

devv64 · 2025-02-07T02:52:30Z

devv64
Feb 7, 2025

DCE

https://github.com/devv64/6120/blob/main/lesson3/tdce.py

I extended my trivial dead code elimination algorithm from last lesson's work. Here I simply deleted instructions that are never used before they are reassigned and iterated to convergence. The actual algorithm was very straightforward, I ended up spending more time on setting up brench (I made some silly mistakes that significantly delayed this step) and figuring out how to dump the data to stdout properly.

LVN

https://github.com/devv64/6120/blob/main/lesson3/lvn.py

This was significantly trickier. I spent some time trying to implement this prior to the actual lecture, so I probably spent a lot more time figuring out some small details and the approach than I needed to. I also started out using different data structures than what I ended up with, so some refactoring there did also take some time. I ended up opting for a mapping (Variable:ID) to represent the cloud and a list of tuples (Variable, Value) to represent the table, with the list index as the ID because I am never removing any element from the list so these indices are unique. I ran in to a weird case where my program was preferring to replace duplicate usages of an "id" function rather than the actual original value from the first "id" usage.

a: int = const 1;
b: int = id a;
c: int = id a;

would turn into

a: int = const 1;
b: int = id a;
c: int = id b;

This was a quick fix but I thought it was interesting. Apart from one case (an issue is happening with the way python handles 1's and 0's vs. True's and False's - it should be a simple fix but I'm running out of time! I am hoping to come back and fix this soon!), my pass has shown to provide correct results based on the brench tool. In many cases, it ends up lowering the line count of programs, which is a great sign. It results is at least the same amount of line removals as TDCE, and in some cases it removes more! I have also left comments in place where I can easily extend the functionality to support copy propagation, constant propagation and eventually constant folding. These are all features I hope to add when I get a chance.

Once again, I really enjoyed this assignment. I spent a lot of time working out the small details to finally get a working implementation. Making progress on this was always a great feeling, every time that I saw slight improvements gave me satisfaction and motivation. I think I deserve a Michelin star for my work because I believe my work meets the expectations defined in the lesson tasks and I had a great learning experience.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

That initial bug you illustrate is indeed kinda cool! I'm glad it was an easy fix once you identified it. I can see how that would be easy to introduce.

I ended up spending more time on setting up brench (I made some silly mistakes that significantly delayed this step)

Sorry this was troublesome! Any chance it's possible to summarize what went wrong, especially if it's something where we could improve the documentation?

In many cases, it ends up lowering the line count of programs, which is a great sign.

FWIW, it would be especially cool to quantify "many," i.e., to know how often a reduction was possible.

gerardogtn · 2025-02-07T03:08:15Z

gerardogtn
Feb 7, 2025

LVN + DCE code here

Summary

I implemented trivial dead code elimination and used linear value numbering for optimizing common subexpressions, copy propagation, and constant folding). I used an approach that was quite verbose and relied on a lot of code duplication, but that allowed me to implement instructions incrementally and test that the optimizations were working in isolation (i.e. i started out only optimizing add instructions and only when that was correct i moved on to optimize mul operations and so on and so forth); although i'm happy with this approach for ease of debugging and testing i'm not too happy about the amount of code duplication and i'm exploring other approaches that could reduce duplication.

How it works.

DCE: I used a trivial algorithm for global optimization, tracks all the variables that are used in a function and removes all declarations of variables that are unused. On top of variables I also removed any unused branches in the function.

LVN:
My program first builds up a LVN table and then iterates over the table and the instructions to produce the optimized function, I know that in class this approach was not recommended as there wasn't any benefit in doing it in different passes instead of a single one and can confirm that this is the case, the only benefit i saw was that i was able to confirm that the LVN was built as I expected without having to rely on the output program to test the implementation but it lead to move boilerplate in my solution so I wouldn't recommend this approach either.
Optimizations are applied on a per-instruction basis, i'll explain how it works for add instructions but the same template is used for any other instruction:

When an add instruction is encountered, the left and right arguments are checked to see if they are local values.
(option 1) If they are not local values, the add instruction remains in the program unchanged.
If they are local values, there are three possible options:
- (option 2)Both of the local values are constants, in that scenario precompute the sum of those constants and use a const instruction instead of an add.
- (option 3)If there is already an instruction that adds those two values, then use an id instruction to refer to that value instead of add.
- (option 4)Otherwise, keep track of this instruction as a local add.

The same approach is used for all other core instructions (add, mul, div, sub, eq, le, ge, lt, gt, and, not or) with the appropriate adjustments for each instruction (changing add for the right instruction and exploiting commutativity in those instructions that meet the property like add, and mul). A benefit and drawback of this approach, is that there are no "automagical" optimizations for any new instructions (i.e. float operations are not optimized at all with the current implementation) every new instruction needs to implement a way of converting that instruction into a LocalValue and back into a BrilInstruction with all the supported techniques modifying the instruction (constant folding, common subexpression, commutativity if appropriate), this means more work but also means less surprises out of the box.

Tests and results

I tested most implementations in isolation and only worked incrementally (i used manual tests and not automated tests):

I first made sure that the LVN table was build correctly and with the values i expected.
Then I tested copy propagation and made sure that it was optimized correctly.
I then implemented constant folding in add operations and made sure that it was working correctly.
Finally i moved on to the rest of the instructions, testing each step in between.
I tested that all the parts worked together as a whole correctly, with both built input programs that served as examples as well as benchmarks.

I saw impact in some benchmarks but not enough changes to dynamic instruction counts for it to be significant. The largest (absolute) difference I saw in a benchmark was in primes-between with a baseline of 574,100tdi and an optimized 571,439tdi (0.5% difference) and saw similar improvements to the results that others have posted in this discussion thread.

Hardest part

The hardest part was to build an intuition of when to keep track of an instruction as a local value and when not to do so. I decided to define a simple program and what its output should be with simple local numbering and no dce (shown below) and that gave me a place to start before implementing more fun techniques. The rest of the task felt more like repetition of the techniques that were already working (i.e. not only support add and mul but all the other instructions in bril's core).

Example:

# original program.
a: int = const 4;
b: int = const 2;
c: int = add a b;
d: int = add a b;
print d;
# expected output for lvn with only cse. 
a: int = const 4;
b: int = const 2;
c: int = add a b;
d: int = id c;
print d;

Michelin ⭐?

Yeah, i implemented DCE and LVN (with support for cse, copy propagation elimination, and constant folding). I hope that michelin inspectors would agree with me.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Awesome! Nice work overall here, and it's cool to see your handcrafted test.

My program first builds up a LVN table and then iterates over the table and the instructions to produce the optimized function

Despite the lack of concrete benefit to the final outcome, I think it's cool that you tried this alternative approach! I can see how it would lead to some redundancy, but it's neat that you were able to test the analysis separately from the program manipulation.

gabizon103 · 2025-02-07T03:59:39Z

gabizon103
Feb 7, 2025

source

dce

I implemented the dead code elimination algorithms that we talked about in class. One pass over the entire CFG eliminates assignments to variables that aren't used anywhere. Then, another pass over each block in the CFG eliminates definitions that get overwritten before being read. This was pretty straightforward as expected, and I didn't have any major bugs. I wrote some simple tests by hand to make sure that the pass was doing what I expected, which it was.

lvn

This pass was a bit trickier, and took some time to get right. I started out by following the pseudocode we developed in class, and the pass's main "table" is a hashmap from Values to (int, string) tuples. I implemented my Value type as an enum with variants for constants, other instructions, live-ins, and function calls. Since we kind of wanted this table to be "double-ended", I also implemented a function that gets the "canonical home" variable from an index by scanning over the table. I also implemented the commutativity extension by canonicalizing things I add to the table.

For me, probably the hardest part was figuring out what to do about renaming. In class we used a scheme that only renames variables when necessary, which is when it'll be overwritten later in the block. Past experience with implementing compilers has made me very scared of name collisions, so I thought it would be a good idea to just rename everything. This added some complexity to the pass that wasn't strictly necessary, and it made it harder for me to understand my own code because any time I wanted to reference a variable I had to look up what it was renamed to, then look that up in the table. It got confusing with figuring out when I should use the renamed, source name, or the table index when reconstructing programs, so I decided to scrap this idea and just go with what we discussed in class. This made things more straightforward, and I guess I'll have to hope that programs don't use a tmp_xx naming scheme for variables.

Also, I had to wrap my head around what to do with instructions like function calls, pointer allocations, and pointer frees. We shouldn't eliminate sub-expressions that use these instructions, since they have side effects. I decided that the most straightforward thing to do would be to add these to the table, but if we find a match in the table for a value that is a call, alloc, or free, we shouldn't use it. This was able to solve the bugs that I saw arising from these types of instructions.

testing

I used brench to test my pass on all the benchmarks in the bril repo, with a similar setup to what was demonstrated in class. It was really nice to be able to run everything with a single command. When things were incorrect, I examined the code that my pass was generating and compared it to the original program. Usually this made it clear what was going wrong, and I was able to pinpoint the issue in my code.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Nice work overall!

I thought it would be a good idea to just rename everything. This added some complexity to the pass that wasn't strictly necessary, and it made it harder for me to understand my own code because any time I wanted to reference a variable I had to look up what it was renamed to, then look that up in the table.

I think this is a super useful observation. There's clearly a trade-off here, between the risk of naming collisions in the actual generated code and the intelligibility of the data structures during development. One can imagine a really fancy system for somehow abstracting away the name bookkeeping in a scheme like this, but of course one would have to debug that too.

decided that the most straightforward thing to do would be to add these to the table, but if we find a match in the table for a value that is a call, alloc, or free, we shouldn't use it.

Yes—this seems like about the simplest possible solution to the side-effects problem.

KabirSamsi · 2025-02-07T04:27:23Z

KabirSamsi
Feb 7, 2025

Overview

Code

I implemented passes for blocking, dead-code elimination and local value numbering with a series of extra optimizations in TypeScript, a challenge for me since I have previously not really built any significant software with it before. Moreover, I had a great time exploring designing custom types, parsing Bril expressions and developing It was also a terrific exercise in building up the fundamental algorithm, and integrating extra optimizations.

Trivial Dead-Code Elimination

I implemented trivial dead-code elimination in TypeScript as two passes iterating until convergence – the first pass removes all variables that are purely unused, and the second removes initializations that are never re-referenced. This then iterates until convergence, until the algorithm terminates.

Local-Value Numbering

I implemented local-value numbering, initially following my idea of the algorithm that was covered in the online videos. I began by setting up types for the data structure representing the LVN table; a store to index into rows of the table, and a handful of other helpers. Subsequently, I implemented iteration through blocks in an LVN pass, and was able to deal with trivial LVN eliminations.

I quickly ran into the problem of dealing with variables rewritten to within a block; I initially attempted to implement my own renaming technique before reconciling it with the demo done in pseudocode in class. I then repeatedly tested this along with the files in the examples directory, before later upgrading to working iwth brench.

Subsequently, I was able to add on a few optimizations.

Extension 1 – Commutativity

This extension was straightforward. I simply stored a list of opcodes which are commutative (add, eq, mul) etc in a global set. Then for any processed instruction, to ensure it was consistently stored in the LVN table regardless of argument order, I sorted the argument list by position (index) in the LVN table, and then stored it. For any subsequent lookup with a commutative operation, the arguments can then be sorted and queried. This was a significant contributor in lowering operation counts.

Extension 2 – Copy Propagation

This extension involved repeatedly querying the LVN table, given an ID chain, to find the previous element. This worked well with basic blocks, though hit limitations when working across blocks (although, this is somewhat expected).

Extension 3 – Constant Folding

To assist with constant folding, I stored a table of all constant-based operations – add, mul, eq, and etc. I then mapped these to functions to accordingly process them (for example, add => fun (x, y) -> x+y). Subsequently, when I actually iterated over a table, I confirmed that two arguments in a table are constants, and subsequently simplify the expression to a const.

Notably, a special case was needed for division, in he case of division by zero.

Testing

As I initially built up my implementation, I made full use of the tests in the examples/test/tdce and examples/test/lvn directories – I pulled together a .toml script that would allow me to view the pre-optimization and post-optimization programs in succession, and then run this over all elements in each entire entire directory. I utilized brench to perform this.

Once satisfied with that, I extended my test suite to the benchmarks directory. This initially proved an interesting revelation for new bugs, as extending to deal with calls and multiple functions revealed a few. I saw significant improvements in the performance of several of the benchmarked programs; a few remained unchanged in performance.

Some Diagrams Of Success
I used matplotlib to graph a few programs of my optimizations' successes, both against the LVN directory and against many of the benchmarks files (here I show a large subset of the core subdirectory):

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Nicely done, @KabirSamsi! This all looks great. And I'm glad (I think) that you got some TypeScript experience out of it also. 😃

katherinewu312 · 2025-02-07T04:32:25Z

katherinewu312
Feb 7, 2025

TDCE + LVN Repo
Group (@katherinewu312, @samuelbreckenridge)

Trivial Dead Code Elimination
Code here
Our implementation of "trivial" dead code elimination matches closely what was discussed in lessons. In each
function we loop over all basic blocks and first remove any instructions defining globally unused variables, then
remove instructions that define variables which will be overwritten before being used, which we track using a set of
variables that are defined but unused. The nice thing about our implementation is that we eliminate globally unused instructions and locally killed instructions all in one pass, whereas the example implementation in bril does these as separate passes, where a flag for each option must be specified. We iterated to convergence by allowing our tdce_loop() function to return either true or false depending on whether any optimizations were performed, and put this function inside a while loop in the tdce() function.

Local Value Numbering
Code here
Our implementation of local value numbering also closely matches the pseudocode we went over in the lesson, however we identified a lot of corner cases that needed to be handled. In particular, it was a bit challenging to iron out which instructions could be ignored, we initially just ignored labels, then found that jmps and nops caused us to crash because they had no args, then later on realized some rets and calls did not have args either and needed to be handled. Using brench to test on all benchmarks was really helpful here as it turned up a lot of cases we wouldn't have thought of otherwise. The hardest part of getting LVN to work was probably the variable renaming, we realized we needed to make our implementation aware of variable use across basic blocks to guarantee our fresh names didn't overwrite variables in other blocks, so we included a set of reserved variables as part of our LVN state. We also ran into a nasty bug where we were updating the "cloud" using newly generated variable names rather than the original, which caused subsequent instructions referencing the value to not get renamed correctly as they were doing a lookup using the original variable name (covered by the lvn chained_dest_overwrite.bril test case). We also extended our implementation to handle CSE exploiting commutativity and copy propagation. Copy propagation was done by identifying those instructions with the ``id" operation and making sure to point the dest variable in this instruction to the initial variable that was defined. This ended up being tricky because we ran into a corner case in the (identified from the euler.bril benchmark) where doing copy propagation on variables defined externally to the basic block caused incorrect execution because it would assign the same value to the variable throughout the block even if it was overwritten (our lvn test case func_arg_reassign.bril covers this). Exploiting communativity was relatively straightforward, we simply took the elements in the value tuple from the second element onwards and sorted them in non-decreasing order. This was done to instructions with the add, mul, eq, and, or operations. We tried getting constant folding to work, and spent a considerable amount of time on this, but unfortunately our implementation introduced a slew of issues that we couldn't exactly pinpoint how to debug.

Testing
Generated brench results, script to analyze our results
We tested our implementation of tdce by running turnt on all the tdce examples in the bril repo, which was successful. To test our LVN implementation we used turnt to test cases that we knew would be important before beginning (variables defined in previous basic blocks, variable renaming, CSE etc.) and augmented these with cases to catch bugs as they came up. We also used brench to confirm that our implementation maintained correctness on all bril benchmarks.

We paired our LVN implementation with DCE as a post-processing step and used brench to confirm all bril benchmarks
still ran correctly. Our results are in the file brench.out. To further elucidate our optimizations, we wrote a script that generates a bar graph (see below; due to space constraints the bar graph only displays those benchmarks with less than 1000 total_dyn_inst) as well as counts the number of optimized programs based on total_dyn_inst. We found that we optimized 41/96 programs, not including the one program function_call that timed out for both the baseline and lvn. We also noticed a drastic difference in the instr counts between the baseline and lvn for several of the programs.

Overall, I think this work deserves a Michelin star because of our implementation of several of the bonus features (copy propagation, and commutativity, and trying to get constant folding to work), and we tested our LVN code rigorously after each implementation to ensure that all benchmarks were passing (which initially, they weren't after implementing each bonus feature).

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Fantastic work!

We tried getting constant folding to work, and spent a considerable amount of time on this, but unfortunately our implementation introduced a slew of issues that we couldn't exactly pinpoint how to debug.

No problem when that happens, of course, but in case you have a chance, it would be kinda fun to know how this was failing. Even if it's just an example of a program where your optimization did the wrong thing, that might be interesting to look at! With zero urgency, of course.

not including the one program function_call that timed out for both the baseline and lvn

Yeah, that one's kinda squirrelly on purpose. It should probably go in its own category, tbh.

mb64 · 2025-02-07T04:54:18Z

mb64
Feb 7, 2025

Optimizations source

TDCE. This optimization was fairly straightforward. To make it easier, and hopefully simplify future tasks too, I added some helper functions for traversing the whole code in particular ways, using the visitor pattern. This made it very easy to e.g. collect all uses -- it became a one-liner. The code is pretty declarative too -- for instance, the line used' = (used Set.\\ varDefs i) `Set.union` varUses i mirrors exactly the pseudocode math that we had in class.

LVN. This one was far trickier, mostly because of all the cases involved, like, does the instruction have a destination? is it a call or an operation? is its value already in the table? was there a previous write to this variable that needs saving? etc. One other tricky part was generating fresh variable names for saving temporary values. I decided on a scheme where I renamed all program variables to v_{original_name}, so that I could come up with new names and be sure they didn't conflict, and then I named the temporaries lvn_{block_label}_{value_number}.

Correctness testing. I am preferring to write my own scripts for testing rather than using Turnt, since I already know how to write Bash so it's lower effort. I made a simple script which runs any bril files you give it with and without optimizations, and compares the outputs. I ran my code on the whole core benchmarks directory, so I am pretty confident it's correct.

Performance testing. I modified the script to also compare dynamic instruction count, and check that the optimized version is indeed faster. However, at first, it was often slower! This came from my transformation to and from a CFG, which ends up shuffling the basic blocks around (I was storing them in a sorted map, by label) and inserting jumps all over the place. So all the extra dynamc instructions were jumps. To fix this, I implemented an algorithm that orders the basic blocks in a statically-optimal order; that is, one that minimizes static instruction count. After doing this, every program in the core benchmarks folder ran at least as fast with my optimizations.

Conclusion. In summary, I implemented TDCE and LVN, as well as some code traversal helper functions which I think are neat, a simple testing/benchmarking script, and a CFG linearization function that chooses a smart way to order the basic blocks. My optimizations don't change program behavior, sometimes make it faster, and never make it slower. I think this deserves a Michelin star.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Nice! I like your scheme for aggressively renaming everything to be guaranteed to avoid name conflicts. That's creative!

However, at first, it was often slower! [...] To fix this, I implemented an algorithm that orders the basic blocks in a statically-optimal order; that is, one that minimizes static instruction count.

That's really cool! This is a fun way to get rid of the essential overhead of reconstructing a program from a CFG, and it probably beats just preserving the old basic block order. I'd be kinda curious to know what the performance impact is of just running this algorithm alone.

ethanuppal · 2025-02-07T04:55:43Z

ethanuppal
Feb 7, 2025

LVN 🎉

Note

This post has images and embedded interactive code previews which may take a few seconds to load.

Source Code

Source code for trivial dead code elimination passes: https://github.com/ethanuppal/cs6120/tree/main/lesson3/tdce
Source code for local value numbering: https://github.com/ethanuppal/cs6120/tree/main/lesson3/lvn

I spent a few hours configuring CI for both of the above optimizations.
That serves as proof that my optimizations do work and that they work on every
single benchmark file. You can view the CI status here: https://github.com/ethanuppal/cs6120/actions/runs/13193475068/job/36830529234

The CI takes a bit to run because of some benchmarks which required me to set the brench timeout to 200 seconds.

As usual, all the passes are scripts that either take in Bril programs from standard input or a file and produce Bril's textual representation. I made some small tweaks to the build_cfg library/tool I had made for lesson 2. I've continued to use Conventional Commits style, so you can easily view the history of the repository. Most of the LVN work was done in a squash merge of ethanuppal/cs6120#4.

I wrote a wrapper script that you can pipe the output of brench into and receive:

Nice colored output showing the optimization status and performance of different pipelines (e.g., plain LVN is a nop on this benchmark, TDCE is faster on this benchmark).
Automatically errors if an optimization pass is slower

https://github.com/ethanuppal/cs6120/blob/46de2372c7b32109cf56b4e3ed16bcd1f54e03d9/lesson3/check_brench.py#L1-L49

Here's example output:

My total time spent was 10 hours according to Toggl.

Trivial Dead Code Elimination

I implemented both dead code elimination passes shown in the video.

https://github.com/ethanuppal/cs6120/blob/46de2372c7b32109cf56b4e3ed16bcd1f54e03d9/lesson3/tdce/src/main.rs#L22-L48

https://github.com/ethanuppal/cs6120/blob/46de2372c7b32109cf56b4e3ed16bcd1f54e03d9/lesson3/tdce/src/main.rs#L50-L82

I tested them by running brench over every benchmark file and the example test/tdce files.

Local Value Numbering

I implemented LVN with a basic value interner. It supports every single Bril instruction and program and works on every single benchmark.

https://github.com/ethanuppal/cs6120/blob/629e60992c8cc66c8b2762d38a768695e35985fc/lesson3/lvn/src/main.rs#L45-L52

Here's something I thought was funny. To make sure that calls or allocss (which have anti-LVN semantics) don't mess things up, I created a variant which held a unit struct as follows:

https://github.com/ethanuppal/cs6120/blob/629e60992c8cc66c8b2762d38a768695e35985fc/lesson3/lvn/src/main.rs#L22-L31

I tested my implementation by running brench over every benchmark file and the example test/tdce files.

I had two issues when implementing LVN.

Floating-point literals are sometimes specified without decimal points. Since I represented an interned value with a sum type, I initially had a single "constant value" variant that conflated integers and floats. Adding a special variant just for floats fixed this issue.
Finally, I was failing only three tests due to weird pointer offset issues. After preventing subexpression recall of alloc, I was passing all benchmark files.

Note

One thing to note is my strategy for coming up with new names is not entirely robust. I have a counter that strictly increments whenever a new temporary (that is, before the final assignment in a basic block) and appends __t{counter} to the end of the variable name, which could cause collisions, however unlikely. A better implementation (which is one I did in my builder API for the Calyx intermediate representation, a language for building hardware accelerators) is to prefix all existing identifiers to guarantee no name collisions when you introduce an identifier without that prefix.

Extensions

Commutativity

https://github.com/ethanuppal/cs6120/blob/4caa5fa2de8b7935b5e6f46801b1820089f68adc/lesson3/lvn/src/main.rs#L247-L270

 $ bril2json <../bril/examples/test/lvn/commute.bril | ../target/debug/lvn | diff ../bril/examples/test/lvn/commute.bril -
1,2c1
< # (a + b) * (b + a)
< @main {
---
> @main() {
6,7c5,6
<   sum2: int = add b a;
<   prod: int = mul sum1 sum2;
---
>   sum2: int = id sum1;
>   prod: int = mul sum1 sum1;

Constant Folding

https://github.com/ethanuppal/cs6120/blob/a62a19f2a3584df6e9a499e726225912e86a3b27/lesson3/lvn/src/main.rs#L306-L318

$ bril2json <simple_fold.bril | ../target/debug/lvn | diff simple_fold.bril -
1c1
< @main {
---
> @main() {
4c4
<   c: int = add a b;
---
>   c: int = const 3;

Performance

Here's a quick comparison of how the optimization passes affect performance (I didn't include the LVN extensions). I wrote a simple script called make_better_chart.py:

https://github.com/ethanuppal/cs6120/blob/0c07b8b472b4ee661b316cabb4dee829e55ab1bb/lesson3/make_better_chart.py#L1-L13

To get the data yourself, you can run brench brench.toml ../bril/benchmarks/**/*.bril ../bril/examples/test/tdce/*.bril | python3 make_better_chart.py > stats.csv from the lesson3 directory.

I graphed the data -- that seemed like a low-hanging fruit. The massive range of orders of magnitude meant that even after I split the data some small benchmarks times were not visible, and the low resolution meant not all benchmarks names appeared visibly on the x-axis. The important thing to notice is that the green bar is smaller and the yellow bar is usually smaller, so TDCE / LVN + TDCE is doing something. (You can make out the key at the top of the first chart).

2 replies

ethanuppal Feb 7, 2025

I forgot to add this, sorry! I'm posting this part of the writeup after 12:00 AM.

Michelin Star

I believe that this assignment deserves $\geq$ one Michelin star (most likely one) for the implementation and testing work.

sampsyo Feb 11, 2025
Maintainer Author

Awesome; this all looks great!! The CI stuff is excellent and fun to look at. I hope you found it useful!

Maybe this goes without saying, but bar charts like this that compare across benchmarks can end up more legible when normalized to a baseline (at the cost of obscuring the absolute differences).

lisarli · 2025-02-07T04:57:21Z

lisarli
Feb 7, 2025

DCE: source
I implemented TDCE by first doing a single global pass to remove unused variables and then repeating this to convergence. To detect variables overwritten before they are read within a single basic block, I opted for the forward pass approach, keeping a map of variable names (destinations) to a boolean indicating whether they were read and the index of the instruction. This was combined with the initial TDCE approach to perform both unused variable and overwritten value checks until no more instructions could be eliminated. Overall, I found this pretty straightforward by following the pseudocode from class.

LVN: source
Implementing LVN was more intensive than I expected, as there were many edge cases to consider, and these often resulted in subtle bugs that required me to trace through the LVN output and manually compare it with the original program. I mainly stuck to the table and cloud structure presented in class, although I separated the representation into three maps, and implementing this allowed for basic CSE. In testing this, I realized that saving values for expressions like function calls with side effects was problematic, so I decided to only construct values in the table for a set of known operations. I also ran into issues with live-ins since my initial code allowed for them to be renamed, which created bugs in loops in which the live-in was updated in the loop body.

I also implemented canonicalization for known commutative operations, converting values into a standardized string format to compare equality. I included copy propagation by adding a special case for id operations to fetch the table number corresponding to the argument to id, and after resolving a bug due to performing id on live-ins which were later overwritten, this was able to reduce chains of id operations, which was powerful when combined with a subsequent DCE pass.

Testing and Evaluation: I tested my TDCE implementation by comparing the output with the Bril TDCE tests using Turnt, and I eventually familiarized myself with Brench to more easily observe the performance difference over all of the benchmarks. Similarly, I tested the correctness of LVN (and LVN combined with DCE) using Brench, and I manually inspected many of the outputted programs to ensure that the renaming and copy propagation behavior was as expected.

I observed the performance difference (in terms of dynamic instruction count) of LVN by running with Brench over the core benchmarks. My optimization reduces the total dynamic instruction count by 56% over the LVN test suite, and by 14% over the core benchmarks.

Conclusions: Implementing LVN was a fun activity, as it allowed me to explore the edge cases of renaming and live-ins, and it was satisfying to observe the instruction count difference after optimization. Overall, I think my work deserves a Michelin star because of my implementations with extensions and thorough testing and analysis across all Bril benchmarks.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Excellent! This all sounds great. I hope the proliferation of edge cases was somewhat educational, in addition to being a technical hurdle!

tean-lai · 2025-02-07T04:57:56Z

tean-lai
Feb 7, 2025

I read somewhere recently that "Python is the second best language for every programming task", so I decided to use Python for this task's implementation instead of OCaml; I think they were right. I <3 Python, but I digress.

TDCE

Trivial DCE Code

For my tdce, I implemented a local one that operates basic blocks and a global one that operates on functions.

Local TDCE

This implementation was similar to the one covered in class, where you iterate until convergence. Each pass consists of finding which variables in the block get re-defined, and checks if there are any usages of those variables in between definitions, and removes them if not.

Global TDCE

For this one, I tried two implementations, but ultimately settled on the simpler one because that didn't lead to timeouts. The simpler implementation was simply checking which definitions never get used in an entire function, and removes those definitions. The other implementation I attempted took it a step further. Once a definition gets removed, it re-evaluates all the arguments it used, since those arguments now have one less use. I created several more data structures, to keep track of instructions where variables were used and defined, and continually tried removing instructions. I had a lot of timeouts though for this implementation.

LVN

LVN Code

This part definitely was the trickiest, for several reasons. There were many edge cases to consider, especially with constants. For the most part, this implementation followed similarly to the one demoed in class. For unknown variables though, I make an assumption that they existed before the basic block, so I just add an entry in the table as a placeholder.

Although Python was nice for just writing less lines of code, it really came back to bite me hard for this one tricky case that took a while for me to catch, and it's that Python considers "1 == True" to be True.

I did implement some kind of mechanism to reduce some computation if there was some renaming of a variable that held a number in the LVN table, and I did this by replacing the variable in the table with another variable that was pointing to this entry. I attempted to rename things with fresh variables, but it felt quite wrong for a local optimization since I felt like renaming those variables would have consequences for future blocks.

Update: Implemented copy prop, commutativity, constant prop, and constant folding!

Correctness

I ran brench on the core Bril benchmarks to check for correctness and reductions in instruction count. All my optimizations preserve correctness, which is relieving; but the decrease in instruction count was a little disappointing. I think this may be due to people writing benchmarks that don't involve a lot of dead code. I did run my optimizations through a couple more contrived programs to make sure the optimizations were actually working thogh.

Update: After extensions implemented, the most dynamic instructions I cut was 531458, for the delannoy benchmark! That was pretty cool to see. The median instructions cut for the core benchmarks was still 1 though.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Python is the second best language for every programming task

Well put!

I had a lot of timeouts though for this implementation.

Interesting! Is your hypothesis here that your implementation had infinite loops? Or was it just extremely slow?

I think this may be due to people writing benchmarks that don't involve a lot of dead code.

I agree—it's easy to see a difference between the hand-written Bril code and the ones generated from frontends, which have more obvious local optimization opportunities.

bryantpark04 · 2025-02-07T04:57:57Z

bryantpark04
Feb 7, 2025

code

All implementation tasks were done in TypeScript. One small hiccup that keeps coming up is that I have to manually check for the existence of every field that I intend to access in the Instruction type. Usually checking inst.args?.[prop] or inst.args ?? [] would work, but for some reason when using the Bril TS types with Deno it throws a type checking error. I plan on investigating these errors and possibly designing better types for the bril-ts library that play better with TS type inference, since my team is leaning towards completing the rest of the implementation tasks in TypeScript.

TDCE

I implemented trivial dead code elimination with two passes, one for deleting unused variable definitions without side effects and the other for deleting assignments to variables that are rewritten before being read. This implementation task was pretty straightforward. One I did my best to write the code for this in a functional style.

LVN

I implemented basic local value numbering with support for patching live-ins. Although I used the pseudocode provided in the lecture notes as a starting point, there were quite a few edge cases that broke my code. I spent a lot of time fixing those edge cases. The general pattern was that my LVN kept deleting things that it shouldn't have.

One such case was my LVN renaming a variable that gets read and overwritten in the same instruction, which caused the program to crash. I fixed this by constraining the lookahead for the next write to be before the next read.

Another case was my LVN replacing a function call with an `id` instruction. This was because I didn't disambiguate between calls to different functions in my ValueTuple type, but I figured it was a bad idea to replace function calls that could possibly have side effects, so I changed it to just ignore function calls.

Yet another case was that my LVN was optimizing an alloc call into an id, thus causing a double-free later when vectorB gets freed. Curiously, the `alloc` op is not included as an `EffectOperation` in `bril.ts`, but it does have a side effect. I updated my code to ignore `alloc` in addition to the other side-effect ops, and after this change, all of the tests in `benchmarks/{core,mem}` passed.

I also wasn't handling variable re-definitions correctly. At first, I thought I was just missing some detection of when variables are overwritten. I started to try to invalidate table entries, but then I realized I could just rename the variables. After attempting to rename variables incorrectly the first time (I was only renaming the first use after a rename and before the re-definition of the old name), I changed the behavior to rename all reads of the variable up until the overwriting instruction.

Finally, I implemented copy propagation and ran into this issue, where invalid table entries were being used. I fixed this by properly invalidating and ignoring invalid table entries if they were being renamed. After fixing this, I finally had all of the benchmarks passing with correct behavior!

Testing

I tested the correctness of both of my implementation tasks by using Brench to run the optimized version of every benchmark and verify that the outputs remained the same. Once correctness was achieved, I checked the instruction counts using wc -l for all Bril programs in the examples/test/tdce and examples/test/lvn directories, and confirmed that the total number of static instructions across all programs decreased. I also used Brench to run the entire benchmark suite and using the built-in profiler to output the number of dynamic instructions, plus awk to add up all of the dynamic instructions for each level of optimization. For one full benchmark suite run, the baseline had 21340861 instructions executed, tdce had 21293182 instructions, and lvn had 20860280 instructions, which was about a 3% reduction. After implementing copy propagation, lvn had 19940903 instructions, making around a 7% reduction in dynamic instructions. For static instructions, baseline had a total of 5154 in the benchmark suite, tdce had 5100, and lvn had 4589 (a 11% reduction!).

Overall, I believe my work deserves a Michelin star because I was able to get LVN with copy propagation working despite all of the issues I ran into and when combined with my tdce implementation optimize all of the Bril programs in benchmarks and examples/test/{lvn,tdce}, while maintaining correctness.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Fantastic! Great work illustrating the cases where weird things can go wrong, beyond the basic pseudocode from class.

Usually checking inst.args?.[prop] or inst.args ?? [] would work, but for some reason when using the Bril TS types with Deno it throws a type checking error. I plan on investigating these errors and possibly designing better types for the bril-ts library that play better with TS type inference, since my team is leaning towards completing the rest of the implementation tasks in TypeScript.

Huh, that is indeed kind of annoying. The ? and ?? business in TypeScript does not seem to be intended to deal with possibly-missing fields, which makes this awkward. I don't immediately have a great idea for a solution.

mse63 · 2025-02-07T08:21:17Z

mse63
Feb 7, 2025

Source Code

Deadcode Elimination

The original deadcode elimination was pretty simple. I just had to look through the instructions for writes to variables that were never read or two consecutive writes to the same variable with no read in between them. I did stray away from the in-class pseudo code a bit, though - instead of going through instructions in a block in order and keeping track of candidates for elimination, as the in-class pseudo-code did, I went through the instructions in reverse, adding them to a new list. I found this easier to code up because it meant that we found an instruction to eliminate, we're already at the location to eliminate it by simply not adding it to the new list, and we don't need to do an awkward shift to eliminate previous instructions.

Local Value Numbering

I had a bare-bones implementation of LVN up pretty quickly. However, I kept running into edge cases I didn't consider. For example, for a bit my LVN code didn't see the difference between ints and booleans, so long as Python thought they were equal. This means a const int 0 was being copied from a const boolean false.

Additionally, my code originally didn't understand that function calls and alloc can't be copied. For example, my code would originally "optimize":

a: ptr<float> = alloc three;
b: ptr<float> = alloc three

into

a: ptr<float> = alloc three;
b: ptr<float> = id a;

I fixed this by adding checks for alloc and function calls, and assigning unique values to them so they would never interfere with each other in the value table.

I tested my code for correctness by creating a simple bash script. I copied the benchmarks from the bril repository and ran the original bril code for each one, as well as my optimized bril code, and making sure they both printed out the same thing. I found most of the bugs in my program when this test failed, and I manually went through the bril inputs and outputs until I found the error.

Once I had LVN working, I decided to implement a couple of the optimizations that it makes possible. I was originally a bit disappointed looking at the optimization of some of the code I generated with ts2bril, because it was clearly suboptimal. The biggest impact optimization I did was adding copy propagation. The code for this was pretty simple. When creating the instructions, I just need to check if it's an id command, and if it is, use the source variable's value instead of creating a new one. The code for that looks like this:

#copy propogation. If this is "id x", make val the val of x, instead of a new "id x" value
if val.op == "id":
    source_var = instr["args"][0]
    if source_var in var_to_id:
        val = id_to_val[var_to_id[source_var]]
    else:
        source_var_id = new_id_from_var_name(source_var)
        val = id_to_val[source_var_id]

With copy propagation complete, I also decided to implement constant propagation. To do that, I just went to the case when one variable had the same value as another. In that situation, you would typically just make an instruction like a: int = id b. In this case, I just had to check whether the value was a constant expression, and if it was to use the appropriate const instead of copying the variable. The result of that looks like this:

#constant propogation
if val.op == "const":
    new_instrs.append({"op": "const", "dest": var, "type": instr["type"], "value":  val.value })
else:
    new_instrs.append({"op": "id", "args": [canonical_var], "dest": var, "type": instr["type"]})

I also implemented commutative exploition by simply checking if an instruction was an add or a mul, and if it was, sort the arguments.

I didn't end up implementing constant folding because it was getting late, but I see how to: when writing the result of an arithmetic expression, check of both instructions are consts, and if they are, just do the math then and there and write it as a const instead of writing the arithmetic expression.

Conclusion

I do believe my work deserved a Michelin Star, because I implemented LVN, got a bit more practice handling edge cases in intermediate languages, and I went further and implemented a couple of the optimizations that LVN allows,

3 replies

sampsyo Feb 11, 2025
Maintainer Author

This generally looks good! But I think we are missing a description of how you tested your code—did you try it on any of the benchmarks from the monorepo, for example?

for a bit my LVN code didn't see the difference between ints and booleans, so long as Python thought they were equal

Sounds like this one bit a lot of people who opted for Python! Pretty annoying, tbh.

mse63 Feb 11, 2025

Here's the description of how I tested my code from my post:

I tested my code for correctness by creating a simple bash script. I copied the benchmarks from the bril repository and ran the original bril code for each one, as well as my optimized bril code, and making sure they both printed out the same thing. I found most of the bugs in my program when this test failed, and I manually went through the bril inputs and outputs until I found the error.

sampsyo Feb 11, 2025
Maintainer Author

That sounds about right!

aw578 · 2025-02-07T21:20:14Z

aw578
Feb 7, 2025

code

Dead code elimination wasn't bad. My code makes a pass to remove unused variables, then another pass to remove unused writes. I couldn't use brench to test that my DCE implementation reduced the number of lines, so I wrote my own shell script to run all the benchmark files with and without optimization, then check that the optimized code produces the same number of results without taking up more lines. I also manually compared the outputs to the DCE examples to make sure they matched.

LVN took much longer, so I didn't manage to implement any of the additional features. In no particular order, some problems I ran into:

ints and booleans are treated the same, so constant ints and bools were copied from each other when they shouldn't have been. I solved this by manually iterating through the keys, which is a little stupid but what can you do?
The pseudocode (or at least my implementation) didn't really handle empty return statements so I just ignored them manually.
Live-ins weren't mentioned, so I handled them by adding a row with no value and an entry in var2num when I encountered one.
When renaming variables, I needed to add a mapping from the old variable name to the new variable's row in var2num. Otherwise, in future instructions the old variable would be missing from var2num, so it would be treated as a live-in and cause errors.
I tested it using the same metholodogy as DCE with a slightly different shell script.

Overall, I think the hardest parts of this assignment were getting the shell script to work and handle arguments (which took a lot of Googling and trial and error), as well as wrapping my head around LVN. I think that I deserve a Michelin star for implementing LVN, along with a new tool for parsing the bril scripts and comparing the number of instructions run.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Generally sounds good, @aw578! We are, however, missing a complete description of your testing strategy, the outcomes, and any observations you were able to make about the effectiveness of your optimizations.

mariasoroka · 2025-02-08T02:38:43Z

mariasoroka
Feb 8, 2025

Here is the code: link

Implementing the trivial dead code elimination was straightforward, and I faced no difficulties with that. It was way trickier to make LVN work correctly. I spent quite a bit of time fixing my code for all the corner cases. For example, there were cases when I was wrongfully reusing already computed call @f a b or reusing alloc size. Still, the hardest bit was to correctly implement the variables renaming. I think I did it a bit differently than was suggested in the slides and I think it may have affected the performance of my optimization. During the lecture, we discussed that a destination variable should be renamed if it is used further in the code. In my implementation, I rename a destination variable if the current name is already used to name the result of some other operation. While my implementation works correctly, it often creates things like i0: int = add i one; i: int = id i0; from I: int = add I one at the end of basic blocks.

My implementation takes care of copy propagation and takes into account commutativity.

I tested my implementation on all of the benchmarks using Brench. I found it to be quite convenient:) Here are the plots for the three largest benchmark sets: core, float, and mem. Sadly, gains in performance are not significant on average. Quite often my optimizations slow down the code. Here are some statistics for (old performance) / (optimized performance):

Benchmark	Average	Best	Worst
Core	1.03	0.54	1.80
Float	1.01	0.38	1.64
Mem	1.30	0.61	1.93

I routinely use copilot autocompletion when writing code. It helps we to avoid retyping things like 'import matplotlib.pyplot as plt' over and over again. I did not use any text prompts for the assignment.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Sounds good overall; nice work!

Quite often my optimizations slow down the code.

If you have infinite extra time someday, it would be interesting to know what caused these slowdowns. Are there places where you generate extra instructions that you think are to blame?

polybeandip · 2025-02-08T08:35:24Z

polybeandip
Feb 8, 2025

TDCE

tdce.ml implements both forms of trivial dead code:

global eliminates globally unused instructions
local eliminates locally killed instructions

I made to sure to iterate until convergence! In fact, I factored out a general function for this: converge repeatedly transforms its curr with step until curr = step curr. As usual, I verified correctness with Brench, making sure the tdce pass preserves semantics of all benchmarks: see benchmarks.toml and benchmarks.csv.

LVN

This part was far trickier and took me quite a while! lvn.ml implements local value number and its extensions:

constant folding
copy & constant propagation
exploiting commutativity

The module Env (for "environment") implements helpful data structures (table, var2num, value tuples, etc) while lvn.ml performs the numbering algorithm. To verify everything was working correctly, I handcrafted tests (some based on lecture) with various common subexpressions, each designed to benefit from different extensions. To be convincing, Brench (benchmarks.toml and benchmarks.csv) helped verify the LVN pass preserved the semantics of all benchmarks.

Evaluation

I wrote a script that graphs and computes statistics about the speedup ($\text{baseline total dyn inst} \div \text{optimized total dyn inst}$) from tdce and lvn (or, more precisely, lvn followed by tdce).

TDCE avg = 1.0037392950882524 LVN avg = 1.1030006281369233
TDCE min = 1.0 LVN min = 1.0
TDCE max = 1.0503597122302157 LVN max = 1.6222222222222222

speedup on core benchmarks

TDCE avg = 1.0098063032711728 LVN avg = 1.042533879566856
TDCE min = 1.0 LVN min = 1.0
TDCE max = 1.1136363636363635 LVN max = 1.4411764705882353

speedup on memory benchmarks

Phew, thankfully neither optimization slows programs down (min speedup = 1.0 for both)! They might even speed up programs in the average case (for instance, avg LVN speedup >= 1.1 on core benchmarks). Notably, certain long green bars drastically deviate from all orange and blue ones: this show LVN must be making meaningful optimizations beyond simple TDCE.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Awesome! Nice work, @polybeandip!

Jonahcb · 2025-02-09T03:03:48Z

Jonahcb
Feb 9, 2025

Code

DCE

I implemented trivial dead code elimination with no problems. After implementing LVN and running it on some benchmarks, I kept outputting incorrect programs. I quickly realized that my trivial dead code elimination was deleting assignment instructions I needed in other blocks. To fix this, I implemented a different version that only deletes assignments where the variable is overwritten before being used as an argument. However, when I ran Brench, I saw no reduction in instructions. I thought about it for a bit and couldn't figure it out so I peaked at the structure in the examples file and realized that the deadcode elimination done retains the same 'used' set over all the blocks so it was more of a global analysis than I thought we were doing. I reimplemented it with this in mind, and it worked well.

LVN

I thought this was fun. I had to do some thinking for the best way to represent values. I didn't like the feeling of chasing errors, though. It made me think I didn't have a good grasp of the fine details. I implemented copy propagation, constant propagation (although I may have messed one of these up because they merged in mind while I was implementing them), commutativity, and constant folding (for half the operators, I will come back and do the rest at some point).

Testing

I used Brench (it took a bit to figure out how to, though). My program reduces the number of dynamic instructions for plenty of the benchmarks, and for some of them, it has no effect. I didn't do any fancy analysis; I just looked at the Brench output for mine compared to Professor Sampson's, and it looked similar.

Sorry for it being late. I really ran into a lot of problems due to my incorrect understanding of local analysis, which led to errors.

1 reply

sampsyo Feb 11, 2025
Maintainer Author

Sounds good overall! It would be nice to have a quantitive basis for this claim:

My program reduces the number of dynamic instructions for plenty of the benchmarks

How many is "plenty," for example?

smd21 · 2025-02-11T04:31:14Z

smd21
Feb 11, 2025

Implementation link
Trivial Dead Code Elimination
I apparently had implemented a very rudimentary version of this (that just did variable counting and marked unused vars, no deletion) as part of the previous implementation task. I redid my implementation to more closely match the algorithm we learned in class, as it was much more understandable to a reader than my approach. This was pretty simple and didn't take me very long.

LVN
This was where I spent the bulk of my time on this assignment. To implement LVN, I made a custom type + class to support storing various types of Values in the table. I utilized a good portion of my code from the previous implementation task. I was able to refactor the rudimentary type system I had created and now have a much cleaner Typescript representation for Bril programs. For this task, I factored each section ("blockifying", running lvn, and running dce) into it's own method to make my program a little more readable (one of my faults from last task was that I didn't pay much attention to readability or reusability of my code). Currently, my LVN code is able to handle commutativity and copy propagation (all copies get removed due to the fact I bundled it with a DCE pass). I plan to extend my implementation to include constant folding; I'm not sure how long this will take, so I'm writing this up first just in case I run into issues.

I decided to have my dead code elimination pass iterate over the whole program in order to avoid issues with a variable being declared in one block but used in a later block. My dead code elimination pass is fairly basic -- it just directly implements the algorithm discussed in class.

Testing
I tested my program in two phases. First, I created some small test cases aimed at seeing how my program handled several of the cases discussed in class (including, but not limited to commutativity, variable reassignment, copy propagation, and variables that have the same value). Working on smaller test cases helped expose a lot of issues with my initial implementation and I was able to ensure everything worked as expected (without failing) before testing on the full Brench suite.

Testing with Brench is where I ran into issues. Unfortunately, as of writing this post I'm still unable to get Brench to output anything to the result csv. I've tried basically every change to the brench.toml I can think of, so I decided to do my writeup first and add my Brench results once I get those sorted out.

Michelin star: I did a good job with my implementation, but since I haven't tested on the full Brench suite I'm not fully convinced it actually works (since honestly I haven't done many "real program length" test cases). I also wasn't able to finish my LVN implementation to my satisfaction by implementing constant folding, so honestly I don't think I really deserve a Michelin star since I didn't meet the standards I set for myself for this assignment.

1 reply

sampsyo Feb 13, 2025
Maintainer Author

I know you have more you wanted to do here, especially w/r/t the evaluation, but it sounds like you already made a lot of progress! Nice work.

InnovativeInventor · 2025-02-13T06:30:01Z

InnovativeInventor
Feb 13, 2025

I implemented a super simple lvn pass. ¹ It does not do copy propagation or anything particularly fancy. It is also fairly messy. I also implemented a super simple dce optimization pass. ² It is also somewhat messy.

Correctness

Both passes appear to be correct on all all core benchmark programs (see my Brench test file for details). To be blunt: this does not convince me that my passes are correct with respect to all of Bril's extensions or even all possible (core) Bril programs, but you may might be convinced by this evidence . . .

Performance

To test that my passes will actually improve performance in some programs, I have two sample programs ³ ⁴, which appear to be optimized as-intended. Brench output confirms that my optimizations, as implemented and run over the entire core benchmark suite, are not as impressive as one would expect a more semantics-aware lvn pass + dce pass to produce.

1 reply

sampsyo Feb 13, 2025
Maintainer Author

Sounds straightforward enough, @InnovativeInventor; nice work!

noschiff · 2025-02-14T03:27:37Z

noschiff
Feb 14, 2025

Overview

Code

I apologize again for posting this so late. Nevertheless, this was again an incredibly fun and exciting assignment. I thought Lesson 3 as a whole was pretty interesting since we consider optimization without dataflow. I recall in CS 4120 that we started covering optimization immediately with control flow graphs and dataflow analysis, and it surprisingly was really fun to put that aside and consider optimizations solely globally across a whole function and within basic blocks without regard for control flow. I was impressed by how elegant and simple the code was while still being incredibly useful. Dataflow is obviously more powerful, but it was freeing to design my own algorithms for these optimizations from scratch without trying to fit a formal framework.

On the implementation side, I've never really programmed in python. I really enjoy TypeScript, but I decided to wait another week before diving into TypeScript with Kabir in L4 and take the opportunity to get more comfortable hacking together algorithms in python. I surprisingly gained a lot more experience, mostly from debugging interesting language quirks. It turns out that I didn't know the different operations for sets and lists, and I ended up creating some wacky bugs from not realizing how python was attempting to help me out by converting between data representations behind the scenes. I'll touch on some of this below.

Dead Code Elimination

Global dead code elimination was pretty straighforward; I simply followed the algorithm from class.

Local (block based) dead code elimination was actually challenging because I decided to complicate it by iterating through the list of instructions in a block in reverse order, rather than following our forwards traversal from class. In my head, this was more intuitive because by the time we reach an instruction, we will know with confidence whether we should delete it because we've already considered all instructions that will be executed after it, whereas the forward traversal requires us to remember instructions and decide to delete them later. In practice, this ended up being challenging. Perhaps I was just really overwhelmed and tired, but it took me a long time to work through the algorithm and decide what to store, when to check use vs defs, and when to delete. I ultimately landed on storing a set of variables that have been defined but not used (going backwards). These variables are candidates for earlier in execution instructions to be considered dead because a later instruction renders them redundant. Thus, at each instruction, we first check if it's assigning to a variable that is assigned to later in execution, and if so, we immediately delete it. If not, we consider it as potentially an instruction that subsumes / overwrites earlier instructions and add its destination to candidates. We then consider the arguments, because if this instruction writes to the same variable it reads from, it cannot render earlier definitions dead because it is using their definition.

After testing and playing around with this, it seems like it's somewhat more efficient than a forwards iteration of local dead code elimination. Since we immediately decide to delete an instruction when we encounter it, we can skip considering its uses, allowing us to more aggressively delete instructions we see later (earlier in execution). I'm not entirely sure about this though, and I'd like to reason about it more if I had more time.

Implementation was quite a pain because I didn't know Python as well as I thought I did. It turns out the way I was add to sets meant that every character in a string was put in the set, so a definition to "notdone" deleted earlier definitions to "n". In the end, I realized I just write simple silly little for loops rather than trying to do things fancy in one line that I am not entirely sure about the semantics.

Local Value Numbering

Local value numbering was, as expected, more complex than I thought. I spent way too much time on local trivial dead code elimination, so I had a lot less time for this. I tried to keep the data structures simple, mostly just using dictionaries and tuples. It probably would've been better to actually define a nice data structure for the table since it was a bit awkward dealing with variables that were read before being defined. There were also weird bugs with calls and python treating False as equivalent to 0!

Testing

Testing was again surprisingly nontrivial! I first tried to set up turnt as I did in L2. Initially I had my programs take in a file path, which required me to put the bril json in a temporary file. I then tried to use bash process substitution, but that didn't work with turnt, so I had to tell it to explicitly use bash with command = "bash -c 'python tdce.py <(bril2json < {filename}) | bril2txt'". I ended up just having my program read from standard in, and that made it simpler.

I also used brench to test across all benchmarks in the repo. I did find that dead-branch didn't seem to work with brilirs. Both the baseline and opt were missing. It seems brilirs doesn't match brili:

% bril2json < ../../bril/benchmarks/long/dead-branch.bril | brili  
50
% bril2json < ../../bril/benchmarks/long/dead-branch.bril | brilirs
error: undefined variable `v4`

Reviewing the brench CSV file shows that a lot of benchmarks actually showed some improvement! I was quite surprised to see this since I didn't implement the full bonus points for LVN, but it was fun to see that common subexpression elimination was actually still helpful. Of course, the best part was that it was correct!

⭐️

I know this was really late, but I hope that my work is still considered for a Michelin star. I believe I went quite above and beyond with dead code elimination by implementing my own algorithm rather than the one presented to us, and I still completed local value numbering successfully.

1 reply

sampsyo Feb 17, 2025
Maintainer Author

Nice work, and thanks for the detailed explanation!

After testing and playing around with this, it seems like it's somewhat more efficient than a forwards iteration of local dead code elimination. Since we immediately decide to delete an instruction when we encounter it, we can skip considering its uses, allowing us to more aggressively delete instructions we see later (earlier in execution).

FWIW, I think there is a reasonable argument to be made here. It seems like there is probably less total data that you need to store for bookkeeping when you go in this direction. But it would be fun to make that argument concrete somehow…

There were also weird bugs with calls and python treating False as equivalent to 0!

It seems like this specific thing bit several people who tried this in Python! Most of the time, Python is reluctant to do these implicit conversions, but the int/bool case is apparently an exception…

Lesson 3: Local Analysis & Optimization #451

sampsyo Jan 21, 2025 Maintainer

Replies: 28 comments · 31 replies

DCE

LVN

Summary

sampsyo Feb 10, 2025 Maintainer Author

sampsyo Feb 10, 2025 Maintainer Author

sampsyo Feb 10, 2025 Maintainer Author

TDCE Passes

LVN Pass

Correctness + Performance Improvements

Challenges

sampsyo Feb 10, 2025 Maintainer Author

sampsyo Feb 10, 2025 Maintainer Author

sampsyo Feb 10, 2025 Maintainer Author

sampsyo Feb 10, 2025 Maintainer Author

sampsyo Feb 10, 2025 Maintainer Author

DCE

LVN

Testing

Conclusion

sampsyo Feb 10, 2025 Maintainer Author

sampsyo Feb 10, 2025 Maintainer Author

DCE

LVN

sampsyo Feb 11, 2025 Maintainer Author

Summary

How it works.

Tests and results

Hardest part

Michelin ⭐?

sampsyo Feb 11, 2025 Maintainer Author

dce

lvn

testing

sampsyo Feb 11, 2025 Maintainer Author

Overview

Trivial Dead-Code Elimination

Local-Value Numbering

Testing

sampsyo Feb 11, 2025 Maintainer Author

sampsyo Feb 11, 2025 Maintainer Author

sampsyo Feb 11, 2025 Maintainer Author

LVN 🎉

Source Code

Trivial Dead Code Elimination

Local Value Numbering

Extensions

Commutativity

Constant Folding

Performance

Michelin Star

sampsyo Feb 11, 2025 Maintainer Author

sampsyo Feb 11, 2025 Maintainer Author

TDCE

Local TDCE

Global TDCE

LVN

Correctness

sampsyo Feb 11, 2025 Maintainer Author

sampsyo
Jan 21, 2025
Maintainer

Replies: 28 comments 31 replies

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 10, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author

sampsyo Feb 11, 2025
Maintainer Author