[Apr 28 Discussion] Chlorophyll: Synthesis-Aided Compiler for Low-Power Spatial Architectures #321

chhzh123 · 2022-04-21T04:59:02Z

chhzh123
Apr 21, 2022

This is the discussion thread for the paper:

Chlorophyll: Synthesis-Aided Compiler for Low-Power Spatial Architectures
Phitchaya Mangpo Phothilimthana, Tikhon Jelvis, Rohin Shah, Nishant Totla, Sarah Chasins, and Rastislav Bodik.
PLDI 2014.

I (Hongzheng Chen) will be the discussion leader. Please post your thoughts and questions here.

anshumanmohan · 2022-04-25T19:53:05Z

anshumanmohan
Apr 25, 2022

Another fun paper! I particularly enjoyed reading about the authors' deployment of the superoptimizer, though §6.2 went over my head. Superoptimizers benefit from small search spaces, and the target architecture, for independent reasons, has a small instruction set. This isn't quite a match made in heaven, however: certain early bifurcations of the code can prevent optimizations later on. I wonder if it would be feasible to sample a variety of options on this spectrum and pick the best one.

On a totally separate and non-technical note, I wonder what we think of GA's radically sparse, super efficient architecture. While it certainly is interesting to explore the extremes, is GA the answer to making computing environmentally sustainable? My knee-jerk reaction is that something that is more usable, even if less efficient, may see wider adoption. Similar, I guess, to the argument that 60% of us dropping beef is better than 10% of us going vegan. I'd love to discuss this either in class or separately!

3 replies

chhzh123 Apr 26, 2022
Author

I wonder if it would be feasible to sample a variety of options on this spectrum and pick the best one.

I'm not sure if I understand your question correctly. To make superoptimizer work for a larger scope of the program, I think some heuristics (or even reinforcement learning techniques) can be applied to guide the searching process. Maybe this is aligned with what you talked about?

I wonder what we think of GA's radically sparse, super efficient architecture. While it certainly is interesting to explore the extremes, is GA the answer to making computing environmentally sustainable?

I don't think GA's architecture can be called sparse? The systolic-array-like structure is basically built for dense workloads like deep neural networks. Apart from GA, we have FPGA, CGRA, and ASIC, all of them can achieve low power compared with CPU/GPU for some specific applications, and we do try to make computing environmentally sustainable using these emerging domain-specific architectures. For GA, seems it only costs $20 per chip according to their product page, which is much cheaper than other computing devices. Considering the price and the power consumption (actually I didn't find data on this), it could have been popular. The basic reason why GA is not prevalent is maybe GA is hard to program and its memory is still too small for general applications.

sampsyo Apr 26, 2022
Maintainer

Maybe “sparse” in the nontechnical sense that it is a bare-bones, minimal architecture without any fancy features.

Nowadays, I think we would classify this GreenArrays processor as being an insurance if a manycore processor, which is in turn a kind of spatial architecture, a group that also includes coarse-grained reconfigurable arrays (CGRAs). CGRAs and friends are all explicitly making a sharp trade-off between programmability and efficiency. CPUs are essentially way over on the other end of that spectrum: very programmable, not at all efficient.

I think it is an awesome question to ask where the right balance is on this spectrum. But I think it is a widespread consensus view that CPUs are probably not a good one-size-fits-all answer to that question. Nobody knows a general answer.

anshumanmohan Apr 26, 2022

Sorry, I could have been clearer. Adrian has already helped me out some.

In my point about the spectrum I just meant to comment on the tension between {granting the superoptimizer more information} and {code separation}.
Yup I meant "sparse" as in barren and lacking features; apologies that I bumbled over a technical word!

tonyjie · 2022-04-27T22:11:37Z

tonyjie
Apr 27, 2022

It is an interesting read to see how to program a low-power manycore spatial architectures, its challenges and how program synthesis aided method could help on that. The whole problem is decomposed into 4 main subproblems: partitioning, layout, code separation and code generation, where I think this work put main effort on optimizing 1st and 4th problem (in terms of the paragraph size).

For the Partitioner Synthesizer part (3.4.3), it is said that the SMT solver would run recursively for multiple rounds when we keep lowering the communication count upper bound until no solution can be found. I would assume that it would be very time-consuming, and the result in Figure 11 kind of reflects that: the Partition time ranges from 36s to 527s or N/A. I think there could be trade-off here: relax the communication count constraints and get faster partition.

Also, the author claimed that

program synthesis techniques enable compiler developers to quickly develop new high-performance compilers for radical architectures without knowing how to implement optimizations specific to an architecture.

As there's lot of spatial architectures targeting on NN models, I'm wondering if there's relevant work that uses program synthesis to map NN applications on these architectures.

4 replies

chhzh123 Apr 28, 2022
Author

I think there could be trade-off here: relax the communication count constraints and get faster partition.

Right, if the constraint is not too tight, the solver can quickly get the solution which probably has terrible performance. It would be better if the authors provide the curve of performance in terms of the runtime of the solver, then we can better know when we should stop exploring.

I'm wondering if there's relevant work that uses program synthesis to map NN applications on these architectures.

Good question! I know lots of works on neural-network-guided program synthesis, but I indeed don't know if there exist works on program synthesis for NN accelerators (at least I cannot easily find literature on Google). I would like to see what other people think about this topic:)

sampsyo Apr 28, 2022
Maintainer

It sounds interesting! I think something that has happened since Chlorophyll is that ML accelerators settled on systolic data flow as the "right way" to implement tensor operators. So the mapping is quite structured and doesn't need free-form place-and-route kind of compilation. But nonetheless, there are a lot of "knobs" in a systolic mapping; I don't think I've seen a synthesis-based approach to this, which could be cool!

chhzh123 Apr 29, 2022
Author

Ahh! I came across this paper from UWash, which can be viewed as another example of synthesis-aided compiler! It also cites the Chlorophyll paper:)
Basically, it leverages program synthesis to quantize NN models for ML compiler, and generate code for edge devices. It can be definitely extended to other backends like NN accelerators. Should be an interesting read!
https://www.cs.utexas.edu/~bornholt/papers/quantized-cgo20.pdf

sampsyo Apr 29, 2022
Maintainer

This is a good paper!

michaelmaitland · 2022-04-28T02:18:54Z

michaelmaitland
Apr 28, 2022

The authors say:

Future lowpower processors will likely be spatial with simple interconnects between resources or cores, and have radically different ISAs from what we commonly use today. They will likely be minimalistic, providing little programmability support and so placing a greater burden on programmers and compilers.

This obviously motivates the paper, and I feel like this lack of programability and burden on the programmer is in the other direction of what we want our programming languages to be like. I know this is not directly related to the paper but its interesting to think about how low power computing comes with trade offs that impact this high an abstraction level (the programmer writing in a programming language).

The authors also say

This growing gap cannot be easily addressed by classical compilation for two reasons: (i) it may take a decade to build a mature compiler with optimizations for the target hardware[33]; and (ii) low-power architectures will be actively investigated for a while, presenting a moving target and delaying compiler development.

While they raise two real reasons why using classical compilers has its own problem in this domain, I wonder how this drawback compares to using synthesis aided compilers that do not scale to large programs.

2 replies

chhzh123 Apr 28, 2022
Author

Yes! Programming languages should also evolve for these emerging computing devices. That's why we have lots of domain-specific languages (DSLs) nowadays! Languages designed for general-purpose computing always cannot capture low-level hardware architectures like manually-managed memory hierarchy, data communication between distributed cores, etc. Some recent efforts like Spatial [PLDI'18] and HeteroCL [FPGA'19] try to raise the abstraction level for FPGA and CGRA accelerators. They are not specific to low-power computing but should share the common goal to make emerging devices with complex architecture easier to program.

Actually, I didn't find any other high-level compilers for GreenArray on Github except for Chlorophyll, which means I cannot make a direct comparison between the compiler and synthesizer approach (Probably building a compiler for GA is already a big advance). But IMHO, there are some target-independent passes that can always be applied to different devices (e.g. CSE in LLVM). Building a compiler for single-core execution on GA may not be so hard. If synthesis helps improve performance, we can gradually add synthesis components to the compiler. Finally, we need to strike a balance between synthesis and compilation to make both compilation time and performance acceptable.

sampsyo Apr 28, 2022
Maintainer

I do think there is a broad lesson here: the GreenArrays processor is an example of the kind of complexity we're seeing arise in many categories of new hardware. And an "automatic" approach to compiling for complex hardware is extremely desirable, since hand-engineering a "classic" rule-based rewriter seems hard to do for every new weird piece of hardware that comes out. Not too hard for a handful of extremely popular CPU architectures, but very hard for a large number of specialized, narrowly targeted machines.

andrewb1999 · 2022-04-28T03:17:26Z

andrewb1999
Apr 28, 2022

I enjoyed the discussion in this paper about the splitting up the compilation/synthesis process into multiple components (partitioning, layout, code gen, etc.) and the potential trade-offs of this approach. Clearly splitting up the synthesis problem into multiple sub-problems is very beneficial for increasing the size of problems the compiler can handle. The way the compilation is split up into sub-problems is a sort of human heuristic as opposed to optimizing the entire problem together. This seems like a good trade-off because these kinds of human driven heuristics have been a core part of compilers for a long time and we are still able to get reasonably good performance, even if it isn't 100% optimal.

That being said, the compilation time for these simple benchmarks is pretty brutal (several hours). No matter how easy to program or performant your architecture is, long compilation times are a huge hit to programmer productivity. I wonder how much more engineering effort would be required to get a compiler that runs in a reasonable amount of time. Even if a faster compiler that uses more heuristics generates designs that are 2-3x slower, that would help significantly with the adoption of these low-power spatial architectures.

2 replies

chhzh123 Apr 28, 2022
Author

Right. The authors actually take a similar approach as what we do for FPGA (logic synthesis, placement, and routing). Though taking a long compilation time, it is the first to make this whole flow work from a high-level language to arrayForth, which is a huge advance. I agree that scalability for large applications is the main issue when popularizing a new architecture and its compiler. Replacing the algorithms in those steps with heuristics should not be so hard. Once a faster compiler is built, the feedback loop can be shortened, and bugs are easier to find. As a result, more applications can run on these new devices to form a positive agile development cycle.

susan-garry May 26, 2022

This is a good point. The authors found that their superoptimization, partitioning only provide 15% and 5% speedups on average, respectively, with their layout providing up to an 80% speedup. Forgoing these optimizations seems acceptable when developing code, but unfortunately the authors don't provide compilation times for their heuristic compiler, and even the time for each of these steps was reduced to a few steps, loop splitting on its own took up to 23 seconds on the benchmarks. This still seems much slower than what would be accepted for an interpreter, especially for such simple benchmarks. Would the traditional approach to loop splitting be faster? And even using heuristics (such as assuming that the flow between any partitions which communicate is one) still requires program synthesis, so I am not (yet) convinced that we would be able to reduce the compilation time down to under a few seconds using the techniques in this paper.

zzzDavid · 2022-04-28T04:06:40Z

zzzDavid
Apr 28, 2022

The GreenArray spatial processor immediately reminds me of the AI Engine on Xilinx's Versal ACAP platforms. Versal provides tens or hundreds of identical vector cores connected by a network-on-chip, and also requires programmers to manually program each core and handle communication, i.e., one has to write a cpp file for each core. I can imagine such program partition, layout, and superoptimization approach being transferred to modern platforms, which will potentially make spatial architecture programming much easier.

The other thing I like about this paper is the typing rules for partition checking. The typing rules enforce the operands and operations are in the same partition, so that communication is checked by the type checker, instead of manually managed. Using typing rules to manage communication also gives some guarantee of the communication routing results.

2 replies

chhzh123 Apr 28, 2022
Author

Oh, that sounds very interesting! This paper does shed light on how to decouple the problem and compile programs onto these spatial architectures. And yeah, specifying the partition as a type is quite an interesting view for this problem! In this way, most of the communication can be annotated using basic type inference.

sampsyo Apr 28, 2022
Maintainer

Absolutely; doing a Chlorophyll-like thing for Versal could make for a fun paper in itself…

charles-rs · 2022-04-28T04:56:18Z

charles-rs
Apr 28, 2022

This paper was really cool! Imma be honest there were definitely some bits that went over my head, but this isn't really a region of computing i've ever thought about before. I was particularly interested in their evaluation numbers: they tout 65% slower than expert handwritten code as an amazing win, which was kind of jarring to read, but I guess this is because there really isn't another option right now, and like they mentioned it took a whole summer to learn how to write the low level code, let alone become an expert.

It seems like this is really the first step in in writing compilers for GA144, and that future work might be able to apply more optimizations, but this does leave the question of what these chips are for. I found this thread, but didn't really see a clear answer. Interestingly enough, you can have one of these chips for the low price of $20, which is pretty cool, but as far as I can tell it's a novel thing you can play with, and isn't really used anywhere, but this might not be true

1 reply

sampsyo Apr 28, 2022
Maintainer

I also don't know if anyone is using the GA144 in particular these days. But spatial architectures in general are more popular than ever, what with the advent of systolic ML accelerators…

JonathanDLTran · 2022-04-28T07:11:35Z

JonathanDLTran
Apr 28, 2022

This paper's description of how to translate a high level language to a low-level language for a minimalist architecture intrigued me. In particular, I found it interesting how the compiler can handle much of the complexity that a programmer would face when writing code for the GA144, and handle this complexity automatically. For example, partitioning, something that would likely be difficult and error-prone for a human programmer, can be handled automatically. Likewise, sends and receives between cores can also be written automatically. For me, this really supports the hypothesis of how this compiler is able to make programming for the GA144 much more productive, and also more correct.

I'm also interested in whether the partition types synthesis techniques could be used in other domains. The paper lists the example of streaming applications. I wonder if partition types could also be used to optimize matrix optimizations, for instance, splitting matrix computations into smaller subproblems across many cores, and then combining information using a send-receive style of communication.

1 reply

sampsyo Apr 28, 2022
Maintainer

Yeah, I totally agree that the "placement types" idea in this paper is a big result that deserves broader application. The coarse-grained matrix-computation partitioning thing seems potentially cool, for example!

gsvic · 2022-04-28T13:01:06Z

gsvic
Apr 28, 2022

The main idea presented in this paper is a compiler that allows the definition of programs on top of spatial architectures using a high-level programming model. The compiler uses synthesis in order to generate programs on top of a spatial architecture, specifically the GreenArrays GA144. The key motivation behind this work is the complexity of writing highly optimized programs for such architectures, a task that requires a lot of expertise and specific knowledge of the hardware. For example, the machine code of GA144 is stack-based, which most of the developers are not familiar with. Chlorophyll removes this complexity off the developers by providing a high-level abstraction that they can use, and automate the program generation using synthesis.

One core component of the compiler is the decomposition of compilation into four discrete subproblems, 1. partitioning, 2. layout, 3. code separation and 4. code generation. Using synthesis, each of these problems can be addressed individually.

While I am unfamiliar with this area, the potential 65% slow-down of the generated programs compared with the human-crafted optimized ones looks a little bit high. However, it looks like the pay-off of democratizing programming on top of such architecture is worth the potential performance degradation. It would be also interesting to explore the ease of extendibility of this abstraction to different architectures, and I am wondering about the effort that it would take to.

5 replies

5hubh4m Apr 28, 2022

It is very interesting that even techniques like super optimization cannot match or exceed the efficiency of a hand-tuned program, especially in an architecture for which the hand-tuning would be very unintuitive.

I would be interested in a study that compares the programs generated by the system with those written by some number of human programmers with varying level of expertise. Perhaps the true benefits to Chlorophyll-like "compilers" lie in the fact that it can generate programs faster than most human programmers.

gsvic Apr 28, 2022

I think that sounds supercool! If you could convince people to participate to something like that (perhaps some kind of programming crowdsourcing) and send their code, it would be nice to collect all those programs and see where the compiler's performance lie between those implementations.

chhzh123 Apr 28, 2022
Author

Yeah, letting more people join the project may help know how well the compiler can achieve. The paper only mentions the experience of one grad student, which is definitely not enough. Probably some classes can be conducted at UC Berkeley, which can quickly gather large samples of data in one semester, but I guess not many people want to take it :)

charles-rs Apr 28, 2022

hmmm maybe if they make it a requirement for ugrads they can get lots of data?

sampsyo Apr 28, 2022
Maintainer

This is indeed the problem with every project that claims to make programming "easier" in some respect. The "right" way to measure this, methodologically speaking, is with a proper user study. But user studies are extremely costly, in terms of time and effort, and require specialized expertise that PL/compilers people typically lack. We should be doing more of this as a community to be sure, but it is a heavy lift.

[Apr 28 Discussion] Chlorophyll: Synthesis-Aided Compiler for Low-Power Spatial Architectures #321

Replies: 8 comments · 20 replies

chhzh123 Apr 26, 2022 Author

sampsyo Apr 26, 2022 Maintainer

chhzh123 Apr 28, 2022 Author

sampsyo Apr 28, 2022 Maintainer

chhzh123 Apr 29, 2022 Author

sampsyo Apr 29, 2022 Maintainer

chhzh123 Apr 28, 2022 Author

sampsyo Apr 28, 2022 Maintainer

chhzh123 Apr 28, 2022 Author

chhzh123 Apr 28, 2022 Author

sampsyo Apr 28, 2022 Maintainer

sampsyo Apr 28, 2022 Maintainer

sampsyo Apr 28, 2022 Maintainer

chhzh123 Apr 28, 2022 Author

sampsyo Apr 28, 2022 Maintainer

Replies: 8 comments 20 replies

chhzh123 Apr 26, 2022
Author

sampsyo Apr 26, 2022
Maintainer

chhzh123 Apr 28, 2022
Author

sampsyo Apr 28, 2022
Maintainer

chhzh123 Apr 29, 2022
Author

sampsyo Apr 29, 2022
Maintainer

chhzh123 Apr 28, 2022
Author

sampsyo Apr 28, 2022
Maintainer

chhzh123 Apr 28, 2022
Author

chhzh123 Apr 28, 2022
Author

sampsyo Apr 28, 2022
Maintainer

sampsyo Apr 28, 2022
Maintainer

sampsyo Apr 28, 2022
Maintainer

chhzh123 Apr 28, 2022
Author

sampsyo Apr 28, 2022
Maintainer