In this repository there is the work for the first assignment of the HPC course at the University of Trieste. The idea of this repository is to study processes interaction and do some benchmark on ORFEO data center (https://orfeo-documentation.readthedocs.io/en/latest/).
- implement in c or C++ an MPI program using P processors on a ring (i.e. a simple 1D topology where each processor has a left and right neighbour). The program should implement a stream of messages in both directions:
- as first step P sends a message ( msgleft = rank ) to its left neighbour (P-1) and receives from its right neighbour (P+1) and send aother message ( msgright = -rank) to P+1 and receive from P-1.
- it then does enough iterations till all processors receive back the initial messages. At each iteration each processor add its rank to the received message if it comes from left, substracting if it comes from right. Both messages originating from a certain processor P should have a tag proportional to its rank (i.e. itag=P*10) Provide the code that should print on a file the following output:
I am process irank and i have received np messages. My final messages have tag itag and value msg-left,msg-right
where irank,np,itag and msg and are parameters that should be printed by the program. Make sure that your code produces the correct answer, independent of the number of processes you use. Take the time of the program using MPI_walltime routines and produce a plot of the runtime as a function of P. If the total time if very small, you might need to repeat several times the iterations in order to get reasonable measurements. Model network performance and discuss scalability in term of the number of processors.
- implement a simple 3d matrix-matrix addition in parallel using a 1D,2D and 3D distribution of data using virtual topology and study its scalability in term of communication within a single THIN node. Use here just collective operations to communicate among MPI processes. Program should accept as input the sizes of the matrixes and should then allocate the matrix and initialize them using double precision random numbers.
Model the network performance and try to identify which the best distribution given the topology of the node you are using for the following sizes:
- 2400 x 100 x 100 ;
- 1200 x 200 x 100 ;
- 800 x 300 x 100;
Discuss performance for the three domains in term of 1D/2D/3D distribution keeping the number of processor constant at 24. Provide a table with all possible distribution on 1 D 2D and 3D partition and report the timing.
- Use the Intel MPI benchmark to estimate latency and bandwidth of all available combinations of topologies and networks on ORFEO computational nodes. Use both IntelMPI and openmpi latest version libraries available on ORFEO. Report numbers/graph and also fit data obtained against the simple communication model we discussed. Compare estimated latency and bandwidth parameters against the one provided by a least-square fitting model. Provide a csv file for each measure with the following format and provide a graph (pdf and/or jpeg or any image format you like)
#header_line 1: command line used
#header_line 2: list of nodes involved
#header_line 3: lamba, bandwith computed by fitting data
#header: #bytes #repetitions t[usec] Mbytes/sec t[usec] computed Mbytes/sec (computed )
0 1000 0.20 0.00
1 1000 0.20 5.10
2 1000 0.19 10.74
4 1000 0.19 21.55
8 1000 0.19 42.33
16 1000 0.19 84.66
32 1000 0.23 138.12
64 1000 0.23 277.57
128 1000 0.31 419.05
256 1000 0.34 760.29
512 1000 0.39 1329.67
1024 1000 0.50 2048.23
2048 1000 0.75 2716.02
4096 1000 1.13 3628.17
8192 1000 1.88 4360.87
16384 1000 2.98 5499.05
32768 1000 4.85 6760.47
65536 640 8.36 7835.13
131072 320 15.14 8657.80
262144 160 12.64 20742.31
524288 80 22.30 23508.26
1048576 40 50.52 20754.20
2097152 20 138.86 15102.29
4194304 10 308.95 13575.88
Discuss in detail performance obtained. In particular discuss the difference in performance among different network (infiniband vs gigsbit) and compare different implementations. Check if THIN and GPU nodes are behaving the same way.
The application we are using here is a Jacobi application already discussed in class.
Steps to do:
- Compile and run the code on on single processor of a THIN and GPU node to estimate the serial time on one single core.
- Run it on 4/8/12 processes within the same node pinning the MPI processes within the same socket and across two sockets
- Identify for the two cases the right latency and bandwith to use in the performance model.
- Report and check check if scalability fits the expected behaviour of the model discussed in class.
- Run it on 12 24 48 processor using two thin nodes
- Identify for this case the right latency and bandwith to use in the performance model.
- Report and check check if scalability fits the expected behaviour of the model discussed in class.
- Repeat the previous experiment on a GPU node where hyperthreading is enabled.
- Identify for this case the right latency and bandwith to use in the performance model.
- Report and check check if scalability fits the expected behaviour of the model discussed in class.