Replies: 8 comments 20 replies
-
Another fun paper! I particularly enjoyed reading about the authors' deployment of the superoptimizer, though §6.2 went over my head. Superoptimizers benefit from small search spaces, and the target architecture, for independent reasons, has a small instruction set. This isn't quite a match made in heaven, however: certain early bifurcations of the code can prevent optimizations later on. I wonder if it would be feasible to sample a variety of options on this spectrum and pick the best one. On a totally separate and non-technical note, I wonder what we think of GA's radically sparse, super efficient architecture. While it certainly is interesting to explore the extremes, is GA the answer to making computing environmentally sustainable? My knee-jerk reaction is that something that is more usable, even if less efficient, may see wider adoption. Similar, I guess, to the argument that 60% of us dropping beef is better than 10% of us going vegan. I'd love to discuss this either in class or separately! |
Beta Was this translation helpful? Give feedback.
-
It is an interesting read to see how to program a low-power manycore spatial architectures, its challenges and how program synthesis aided method could help on that. The whole problem is decomposed into 4 main subproblems: partitioning, layout, code separation and code generation, where I think this work put main effort on optimizing 1st and 4th problem (in terms of the paragraph size). For the Partitioner Synthesizer part (3.4.3), it is said that the SMT solver would run recursively for multiple rounds when we keep lowering the communication count upper bound until no solution can be found. I would assume that it would be very time-consuming, and the result in Figure 11 kind of reflects that: the Partition time ranges from 36s to 527s or N/A. I think there could be trade-off here: relax the communication count constraints and get faster partition. Also, the author claimed that
As there's lot of spatial architectures targeting on NN models, I'm wondering if there's relevant work that uses program synthesis to map NN applications on these architectures. |
Beta Was this translation helpful? Give feedback.
-
The authors say:
This obviously motivates the paper, and I feel like this lack of programability and burden on the programmer is in the other direction of what we want our programming languages to be like. I know this is not directly related to the paper but its interesting to think about how low power computing comes with trade offs that impact this high an abstraction level (the programmer writing in a programming language). The authors also say
While they raise two real reasons why using classical compilers has its own problem in this domain, I wonder how this drawback compares to using synthesis aided compilers that do not scale to large programs. |
Beta Was this translation helpful? Give feedback.
-
I enjoyed the discussion in this paper about the splitting up the compilation/synthesis process into multiple components (partitioning, layout, code gen, etc.) and the potential trade-offs of this approach. Clearly splitting up the synthesis problem into multiple sub-problems is very beneficial for increasing the size of problems the compiler can handle. The way the compilation is split up into sub-problems is a sort of human heuristic as opposed to optimizing the entire problem together. This seems like a good trade-off because these kinds of human driven heuristics have been a core part of compilers for a long time and we are still able to get reasonably good performance, even if it isn't 100% optimal. That being said, the compilation time for these simple benchmarks is pretty brutal (several hours). No matter how easy to program or performant your architecture is, long compilation times are a huge hit to programmer productivity. I wonder how much more engineering effort would be required to get a compiler that runs in a reasonable amount of time. Even if a faster compiler that uses more heuristics generates designs that are 2-3x slower, that would help significantly with the adoption of these low-power spatial architectures. |
Beta Was this translation helpful? Give feedback.
-
The GreenArray spatial processor immediately reminds me of the AI Engine on Xilinx's Versal ACAP platforms. Versal provides tens or hundreds of identical vector cores connected by a network-on-chip, and also requires programmers to manually program each core and handle communication, i.e., one has to write a cpp file for each core. I can imagine such program partition, layout, and superoptimization approach being transferred to modern platforms, which will potentially make spatial architecture programming much easier. The other thing I like about this paper is the typing rules for partition checking. The typing rules enforce the operands and operations are in the same partition, so that communication is checked by the type checker, instead of manually managed. Using typing rules to manage communication also gives some guarantee of the communication routing results. |
Beta Was this translation helpful? Give feedback.
-
This paper was really cool! Imma be honest there were definitely some bits that went over my head, but this isn't really a region of computing i've ever thought about before. I was particularly interested in their evaluation numbers: they tout 65% slower than expert handwritten code as an amazing win, which was kind of jarring to read, but I guess this is because there really isn't another option right now, and like they mentioned it took a whole summer to learn how to write the low level code, let alone become an expert. It seems like this is really the first step in in writing compilers for GA144, and that future work might be able to apply more optimizations, but this does leave the question of what these chips are for. I found this thread, but didn't really see a clear answer. Interestingly enough, you can have one of these chips for the low price of $20, which is pretty cool, but as far as I can tell it's a novel thing you can play with, and isn't really used anywhere, but this might not be true |
Beta Was this translation helpful? Give feedback.
-
This paper's description of how to translate a high level language to a low-level language for a minimalist architecture intrigued me. In particular, I found it interesting how the compiler can handle much of the complexity that a programmer would face when writing code for the GA144, and handle this complexity automatically. For example, partitioning, something that would likely be difficult and error-prone for a human programmer, can be handled automatically. Likewise, sends and receives between cores can also be written automatically. For me, this really supports the hypothesis of how this compiler is able to make programming for the GA144 much more productive, and also more correct. I'm also interested in whether the partition types synthesis techniques could be used in other domains. The paper lists the example of streaming applications. I wonder if partition types could also be used to optimize matrix optimizations, for instance, splitting matrix computations into smaller subproblems across many cores, and then combining information using a send-receive style of communication. |
Beta Was this translation helpful? Give feedback.
-
The main idea presented in this paper is a compiler that allows the definition of programs on top of spatial architectures using a high-level programming model. The compiler uses synthesis in order to generate programs on top of a spatial architecture, specifically the GreenArrays GA144. The key motivation behind this work is the complexity of writing highly optimized programs for such architectures, a task that requires a lot of expertise and specific knowledge of the hardware. For example, the machine code of GA144 is stack-based, which most of the developers are not familiar with. Chlorophyll removes this complexity off the developers by providing a high-level abstraction that they can use, and automate the program generation using synthesis. One core component of the compiler is the decomposition of compilation into four discrete subproblems, 1. partitioning, 2. layout, 3. code separation and 4. code generation. Using synthesis, each of these problems can be addressed individually. While I am unfamiliar with this area, the potential 65% slow-down of the generated programs compared with the human-crafted optimized ones looks a little bit high. However, it looks like the pay-off of democratizing programming on top of such architecture is worth the potential performance degradation. It would be also interesting to explore the ease of extendibility of this abstraction to different architectures, and I am wondering about the effort that it would take to. |
Beta Was this translation helpful? Give feedback.
-
This is the discussion thread for the paper:
I (Hongzheng Chen) will be the discussion leader. Please post your thoughts and questions here.
Beta Was this translation helpful? Give feedback.
All reactions