-
Notifications
You must be signed in to change notification settings - Fork 0
Global Optimization
The following section describes the part, which maps the APT to the cluster and target architectures with respect to several global optimizations. The distinctive feature of the optimization tool is that the mapping and the global optimizations are generated by search search rather than rules. The according search algorithm is completely implemented in the Pattern-OPT sub-project, which consists of the following major parts:
The Pattern-Opt sub-project exposes several classes as basic abstraction of the mathematical concepts. This serves the purpose of a direct correspondence of the Java objects to the mathematical structures. Furthermore, several classes are important to control the granularity, on which the mapping is investigated, over hyperparameters. To this end, the following classes fall under this category:
- Team: The team class summarizes multiple cores belonging to the same cache group. Teams need not use all cores of this group but are exclusive, i.e., there is no other team for the same cache group. The Teams class provides static methods for generating and modifying teams. Important: The same team (Java object) may not be used in two different time steps, since modification of the later team changes the former one. In this case, two different java objects must be created referring to the same cache group in two different steps. Accordingly, the Teams class always clones the team objects.
- DataSplit: DataSplit objects are simply smaller units of ArrayData. For later data dependency analysis based on theses objects, there must not be two DataSplit objects referring to the same indices of an array. Therefore, the DataSplit objects are created initially and stored in a lookup table called DataSplitTable.
- PatternSplit: Basically, a PatternSplit wraps a PatternNode into a node of the FlatAPT, which is explained in the next section. The important use case is the implementation as a ParallelPatternSplit, which splits the parallel workload associated with a ParallelCallNode into smaller units. In detail, a parallel workload typically refers to an interval of iterations, which is split accordingly. A split typically exposes several dimensions, which correspond to multidimensional splits (stencil) or splits over time (recurrences).
- Mapping: The mapping class defines the objects of the search space. Basically, the class is a wrapper of the LinkedList over StepMapping objects, which is used in form of stacks.
- StepMapping: The StepMapping class defines the mapping of a single step, which assigns PatternSplit objects to teams. The class is a wrapper of a HashMap. Note that some PatternSplit implementations are assigned automatically by the induced serial context. In fact, only ParallelPatternSplit objects and IOSplits are assigned to teams. TODO (Adrian): Update according to your needs.
A more formal explanation of the underlying mathematical concepts and the search can be found in Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures.
The FlatAPT is an optimized APT, which eliminates unnecessary linearization of the provided execution order of PatternNodes. Basically, the FlatAPT is however a simple data structure with an advanced insertion method:
- Data structure: Every parallel algorithm can be written within a single context, i.e., in a single main function. While this is hard to be read by developers, this is optimal for static analysis. The FlatAPT is such a flattened version of an APT, where all sub-contexts induced by serial function calls are removed. To this end, the FlatAPT is a stack of different levels. Each level contains nodes, which may be executed in parallel. Subsequent levels contain nodes, which must be executed after another. The nodes of this FlatAPT are PatternSplit objects.
- Insertion: The FlatAPT must be constructed from the APT in a specific way in order to have a correct representation. At first, the PatternNode objects of the original APT must be inserted in-order, which is done in the FlatAPTGenerator class. At second, the PatternNode objects must be wrapped into PatternSplit objects (and split accordingly). At third, the new split objects must be inserted into the correct level. To this end, the splits are inserted in the last level and then bubble up until read-after-write data dependencies are encountered. For reasons of complexity, the data dependency analysis is evaluated on the basis of disjoint DataSplit objects.
A formal explanation of the algorithm can be found in Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures.
The costs package of the Pattern-OPT project defines the PerformanceModel interface, which is used to define the performance model to be optimized by the search algorithm in the MappingGenerator class. This implementation provides a simple way to modularize and exchange the performance model. In fact, the interface directly defines a Java representation of the Inter-Node Dataflow Efficiency criterion. However, the methods need to be interpreted in a specific way, since they are called in different contexts:
- A-priori: executionCosts, networkCosts (RooflineModel: bandwidth), communicationCosts (Additional latency) are a-priori costs, which are used as the coefficients of the ILP when determining an assignment of PatternSplit objects to teams.
- Posteriori: evaluate is a strict posterior method, which evaluates a StepMapping object.
The MappingGenerator class defines the logic to derive the mapping to the cluster. It therefore iterates over the steps of the FlatAPT and generates according StepMapping objects. The Java implementation is mainly analogous to the pseudo code of the original algorithm and the math can therefore be best understood by reading through Automatic Mapping of Parallel Pattern-Based Algorithms on Heterogeneous Architectures, which explains the pseudo code in more detail.
An important aspect for practical usage are however the choice of the hyperparameter, i.e., the granularity of analysis
- PatternSplitSize: A small value implies high granularity of analysis, which may however not be desirable for two reasons. At first, a small value may rapidly yield a runtime of several hours. Typically, a result can be obtained within minutes when the PatternSplitSize is chosen such that the number of PatternSplit objects per step does not exceed 10-20 splits. At second, the implemented roofline model provides an abstraction of the hardware on a high level. Accordingly, the analysis must be on a high level to be accurate, i.e., smaller granularity makes performance aspects not being modeled by the roofline model relevant. For instance, when pattern splits become smaller, the behavior of the memory controller and communication can not be modeled accurately by a bandwidth and latency parameter only. The optimization then underestimates the costs for parallelization and yields "over-parallelized" mappings. A good strategy is to lower the value gradually until such behavior can be observed. TODO: Implement this strategy to automatically find value.
- DataSplitSize: The parameter defines the number of consecutive indices of an array associated by a DataSplit object. A small value is always desirable as this allows to better capture performance improvements due to caching. For instance, when the paramter's value is equal to the size of the largest array, performance improvements because of cache blocking can not be recognized by the optimization. However, the execution time may heavily depend on the parameters for parallel patterns with strided and complex memory accesses like a stencil, since every access (not split) corresponds to an integer variable in each ILP. A rule of thumb may be to chose the value according to a 8th-16th of the average array size. Again, this value should be gradually scaled down. For algorithms of heavily varying array sizes, there is no good strategy developed so far.