Calculate Q matrix using Chinese Remainder Theorem, FLINT and BLAS libraries #142

vasdommes · 2023-10-10T04:38:19Z

TODO:

(NB: Do not merge to the 2.6 release!)

Unit tests
Benchmark for realistic problems
Fix block timing estimates (see write_timing.cxx). Currently we cannot use run.step.initializeSchurComplementSolver.Q.syrk_XXX time, because all cores on the node are working together on their blocks.
Better load balancing for BLAS jobs (simple round-robin algorithm works bad if procs_per_node greatly exceeds num_primes)
bigint-syrk-blas: account for MPI shared memory limits #207: use smaller shared memory windows for P and Q (split P horizontally and split Q into blocks).
Mark --procGranularity option as obsolete #210
Add FLINT and BLAS to installation instructions #213.

Desciption

We calculate Q matrix using Chinese Remainder Theorem (implemented by FLINT library) and BLAS routines. The same idea is used in FLINT libarary for bigint matrices, see mul_blas.c. We reuse some FLINT code for our purposes.

The details and implementation are discussed in https://github.com/davidsd/sdpb/blob/master/src/sdp_solve/SDP_Solver/run/bigint_syrk/Readme.md

Brief description:

Suppose that we want to calculate Q = P^T P, where P is a BigFloat matrix. Instead of (slow) BigFloat arithmetic, we can proceed as follows:

Convert P to integer matrix: normalize columns of P, multiply all by 2^N (N is BigFloat precision).
Calculate residues of P modulo a set of (small enough) primes.
Calculate squares of these residue matrices, Q_i = P_i^T P_i, using BLAS, namely cblas_dsyrk() function.
Restore Q from residues Q_i using Chinese Remainder Theorem, remove normalization.

Since BLAS is highly optimized, this method proves to be much faster than BigFloat arithmetics: according to our preliminary (and rather imprecise) benchmarks, bigint_syrk_blas() is faster than El::Syrk() by a factor of ~10-20x.
It also allows to reduce RAM usage (only one copy of Q is stored on each node) and remove unnecessary --procGranularity option from SDPB.

vasdommes · 2023-11-10T17:10:19Z

I performed timing runs for both old and new algorithm on Expanse cluster, for GNY model.
Discussion in Slack: https://bootstrapcollab.slack.com/archives/CG450R9QX/p1699498297583039

First two plots show duration of one SDPB solver step for old and new algorithm.

First plot: nmax=8, 1 node, 128 cores. With the new algorithm, SDPB is 1.5x faster.
Second plot: nmax=14. With the new algorithm, SDPB is 1.8x faster for 3 nodes (384 cores). Old algorithm runs out of memory on 2 nodes, while the new one works and makes SDPB 2x faster than the old SDPB running on 3 nodes.

Note that these numbers reflect overall speedup for the program.
If we compare matrix multiplication itself, we can see that BLAS calls are up to 10x faster than El::Syrk from the old algorithm (e.g. nmax=14 on 3 nodes: El::Syrk takes 62s, BLAS calls - 6.5s).

Different parts of solver step shown in the plot:
Compute local Q (old algorithm): each process calculates its local Q_i.
Compute Q on node (new algorithm): all processes on a node work together to calculate their contribution to Q.
New algorithm turns out to be faster, thanks to BLAS optimizations.
Reduce-scatter Q: accumulate contributions to Q from all processes/all nodes.
Here new algorithm is faster too, since it has only on copy of Q per node instead of one copy of Q per process.
For a single node, it doesn't need reduce-scatter is required at all, since we already have Q calculated from shared memory window!
For several nodes, synchronization is still needed, but becomes faster as well.

The third plot shows different steps of Compute Q on node: normalize, compute residues, call BLAS etc., for nmax=8 nodes=1, nmax=14 nodes=2, nmax=14 nodes=3.

For nmax=8, "normalize P" takes more time than BLAS. But this time is mostly spent not on calculating column norms, but on waiting for other MPI processes (note that this step requires global synchronization). This is not surprising, since we cannot provide perfect load balancing for a small problem (190 SDP blocks) on a large number of cores (128).
For larger problems (nmax=14), BLAS is the longest step. This is a good news for us: we are spending most of the time running very effective and heavily optimized code.

…mainder Theorem See BigInt_Shared_Memory_Syrk_Context.bigint_syrk_blas() This allows to compute Q := P^T P, where P is a DistMatrix of big integers. Algorithm: - Calculate residues P_i of matrix P modulo set of primes (using FLINT library) - Convert residues to doubles, calculate residue squares Q_i = P_i^T P_i via cblas_dsyrk - Convert the result to int, restore Q from Q_i using Chinese Remainder Theorem The function will be used instead of El::Syrk to calculate Q matrix (compute_Q_group). Added dependencies: OpenBLAS, FLINT

According to preliminary (and rather imprecise) benchmarks, bigint_syrk() is faster than El::Syrk by a factor of ~10-20x

…atrices Q_IJ = P_I^T P_J (modulo some prime). Currently, Q is not split, and bigint_syrk_blas() behavior remains the same. TODO split Q and use proper job scheduling based on job costs.

See https://en.wikipedia.org/wiki/Longest-processing-time-first_scheduling TODO use it for Blas_Job_Schedule

Cost is equal to the number of output matrix elements to be calculated, which is true for naive multiplication (and big matrices, where all time is spent on number crunching). TODO: One big BLAS job can be significantly faster than many small jobs. Ideally we should account for it by adding extra overhead term, but it's hard to estimate its correct value. Currently, infinitesimal overhead is already accounted for in LPT_scheduling(), where priority queue for ranks is sorted by (cost, num_jobs).

…orithm, with minimal_split_factor() + add tests for different split factors to calculate_matrix_square.test.cxx Algorithm: - Uniformly distribute as many primes as we can, without splitting Q. - For the remaining primes, split Q so that each rank gets no more than one extra job. Consider, for example, num_primes=73 and num_ranks=10: - Create one BLAS syrk job for the first 70 primes. - For each of the remaining 3 primes, split Q into 2x2 matrix. This yields 2*3 syrk jobs plus 3 gemm jobs, i.e. 9 jobs to be assigned to 9 ranks. TODO: this could be non-optimal, probably it's better to split Q into more parts and give, e.g., one big job to one rank and two small jobs to another. We can calculate max_cost for several split_factors starting from minimal_split_factor(), and choose the optimal one.

1) Distribute as many primes as we can uniformly without splitting Q 2) Try to distribute remaining primes by splitting Q. Check different splitting factors: [min_split_factor, min_split_factor + 5) See details in create_blas_jobs_schedule.cxx and bigint_syrk/Readme.md

fixes waf configure for Expanse

openblas_set_num_threads(1);

…without BLAS

…nding. If one rank throws an exception and another one doesn't, the first rank will hang forever on the window fence. Thus, we disable this fence and assume that the program will abort. NB: if exception is caught after that and program continues working, it will probably hang on the next synchronization point!

# Conflicts: # src/outer_limits/compute_optimal/compute_optimal.cxx # src/sdpb/solve.cxx

Support optional suffixes: 100 or 100B -> 100 bytes 100K or 100KB -> 102400 bytes 100M or 100MB -> 104857600 bytes 100G or 100GB -> 107374182400 bytes

Previously, we defined syrk by checking I==J. This does not work when we are multiplying different matrices, C_IJ := A_I^T B^J (it will happen when we'll split Q window and multiply different vertical bands of P)

… blocks for each MPI group, refactor compute_block_residues() Fixes #203 bigint-syrk-blas: add --maxSharedMemory option to limit MPI shared memory window sizes TODO: currently it fails if limit is to small. We should split input and output windows instead.

…ory limit In unit tests, we test two cases: - no memory limit (no P window splitting) - memory limit ensuring that only 3 rows fit into P window. see calculate_matrix_square.test.cxx In end-to-end.test.cxx, we set --maxSharedMemory=1M for two realistic cases, thus enforcing split factors 4 and 6. In other cases, limit is not set. TODO: update Readme.md TODO: split also Q (output) window, if necessary.

…culating total_size Result is different when input_window_split_factor > 1.

…e_split_factor to remove ambiguity

…e --maxSharedMemory limit TODO: update also bigint_syrk/Readme.md Changed two end-to-end tests: set low --maxSharedMemory to enforce Q window splitting In unit tests, we set different shared memory limits - to calculate Q = P^T P without splitting, with splitting only P window, or with splitting both P and Q. Also supported both uplo=UPPER and LOWER for syrk. Fixed reduce_scatter(): Old version synchronized only upper half always, but for off-diagonal blocks Q_IJ, we need to synchronize all.

Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows)

…o check bigint_syrk_blas behaviour (in particular, reduce-scatter). We create virtual "nodes" consisting of 1/2/3 ranks each, and pass node communicator as comm_shared_mem to BigInt_Shared_Memory_Syrk_Context. NB: for all tests to pass, you should run unit_tests on 6 ranks (or 12,18,24 etc.): mpirun -n 6 build/unit_tests

For some reason, it didn't arise in production, but caused segfault when running new unit tests with memory sanitizer.

tests reduce_scatter() function for DistMatrix, used in bigint-syrk-blas for multinode case

vasdommes added the enhancement label Oct 10, 2023

vasdommes requested a review from davidsd October 10, 2023 04:38

vasdommes self-assigned this Oct 10, 2023

vasdommes force-pushed the bigint-syrk-blas branch 2 times, most recently from fa92aba to 17426eb Compare October 17, 2023 05:39

vasdommes added this to the 2.7.0 milestone Nov 14, 2023

vasdommes force-pushed the bigint-syrk-blas branch from 1507a00 to 0471f82 Compare November 16, 2023 19:51

vasdommes force-pushed the bigint-syrk-blas branch from 0471f82 to 3ca605e Compare November 28, 2023 23:40

vasdommes force-pushed the bigint-syrk-blas branch 3 times, most recently from 23d91b8 to 0e8e0e1 Compare December 20, 2023 05:13

This was referenced Dec 24, 2023

File reading in sdp2input/pvm2sdp is not parallelized #150

Closed

Incorrect block mapping for cyclic MPI job distribution across nodes #166

Closed

vasdommes force-pushed the bigint-syrk-blas branch from 0ec9e2f to 80f001f Compare December 28, 2023 23:31

vasdommes added 15 commits January 4, 2024 22:56

Use CRT+BLAS bigint_syrk() to compute Q matrix

0d2f1fc

According to preliminary (and rather imprecise) benchmarks, bigint_syrk() is faster than El::Syrk by a factor of ~10-20x

add bigint_syrk/Readme.md describing our matrix multiplication algorithm

dd8c887

Introduce Blas_Job and Blas_Job_Schedule classes for calculating subm…

c79e96d

…atrices Q_IJ = P_I^T P_J (modulo some prime). Currently, Q is not split, and bigint_syrk_blas() behavior remains the same. TODO split Q and use proper job scheduling based on job costs.

Add LPT_scheduling algorithm + tests

ea20f88

See https://en.wikipedia.org/wiki/Longest-processing-time-first_scheduling TODO use it for Blas_Job_Schedule

Look for FLINT libraries in both lib and lib64

752fc3c

fixes waf configure for Expanse

Add FLINT to Install.md

0899b19

Print Shared_Window_Array size in debug mode

e4c9956

Disable BLAS threading explicitly, each rank should work single-threaded

0cec1ac

openblas_set_num_threads(1);

Print block info in initialize_bigint_syrk_context()

b6efb1a

Account for shared memory window (of residues) in RAM-based block costs.

ddfc43b

compute_Q: add timers for normalize and restore

0543a39

vasdommes added 4 commits February 28, 2024 17:04

Fix #202 bigint-syrk-blas: Unit tests fail if FLINT library compiled …

b14f124

…without BLAS

Minor fixes in LPT_scheduling.test.cxx

ac2f4b2

Remove unused #include <cassert>

024d441

vasdommes force-pushed the bigint-syrk-blas branch from 6db8193 to eee8e88 Compare March 7, 2024 21:56

Merge branch 'master' into bigint-syrk-blas

4bba917

# Conflicts: # src/outer_limits/compute_optimal/compute_optimal.cxx # src/sdpb/solve.cxx

vasdommes force-pushed the bigint-syrk-blas branch from eee8e88 to 4bba917 Compare March 7, 2024 22:00

vasdommes added 12 commits March 7, 2024 14:00

WIP --maxSharedMemory option (not used yet) #203

ac5ee1b

Support optional suffixes: 100 or 100B -> 100 bytes 100K or 100KB -> 102400 bytes 100M or 100MB -> 104857600 bytes 100G or 100GB -> 107374182400 bytes

Introduce Blas_Job::Kind to distinguish between syrk and gemm

045c9a6

Previously, we defined syrk by checking I==J. This does not work when we are multiplying different matrices, C_IJ := A_I^T B^J (it will happen when we'll split Q window and multiply different vertical bands of P)

Calculate Blas_Job::cost in constructor

9974693

Fix update_block_timings_with_syrk(): use total_block_height when cal…

a0e7481

…culating total_size Result is different when input_window_split_factor > 1.

Extract do_blas_jobs() function

d435dea

calculate_matrix_square.test.cxx: rename split_factor -> blas_schedul…

881604d

…e_split_factor to remove ambiguity

If --maxSharedMemory=0, set it to infinity

f0c2170

Update docs for splitting windows: bigint_syrk/Readme.md and Usage.md

e31b2ba

Merge pull request #209 from davidsd/window-split

7b52f8b

Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows)

vasdommes mentioned this pull request Mar 9, 2024

Mark --procGranularity option as obsolete #210

Closed

vasdommes marked this pull request as ready for review March 9, 2024 07:13

vasdommes added 3 commits March 12, 2024 22:32

Fix stupid typo in bigint_syrk_blas.cxx - deallocate_unused_half()

d94d089

For some reason, it didn't arise in production, but caused segfault when running new unit tests with memory sanitizer.

Add reduce_scatter.test.cxx

dba4994

tests reduce_scatter() function for DistMatrix, used in bigint-syrk-blas for multinode case

vasdommes mentioned this pull request Mar 14, 2024

Optimize reduce-scatter for Q matrix #211

Closed

vasdommes merged commit 236679f into master Mar 15, 2024
2 checks passed

vasdommes deleted the bigint-syrk-blas branch March 15, 2024 01:22

This was referenced Mar 16, 2024

Fix #211 Optimize reduce-scatter for Q matrix #212

Merged

Add FLINT and BLAS to installation instructions #213

Closed

bharathr98 mentioned this pull request Apr 9, 2024

Add CMake support (Do not merge) #223

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate Q matrix using Chinese Remainder Theorem, FLINT and BLAS libraries #142

Calculate Q matrix using Chinese Remainder Theorem, FLINT and BLAS libraries #142

vasdommes commented Oct 10, 2023 •

edited

Loading

vasdommes commented Nov 10, 2023

Calculate Q matrix using Chinese Remainder Theorem, FLINT and BLAS libraries #142

Calculate Q matrix using Chinese Remainder Theorem, FLINT and BLAS libraries #142

Conversation

vasdommes commented Oct 10, 2023 • edited Loading

TODO:

Desciption

vasdommes commented Nov 10, 2023

vasdommes commented Oct 10, 2023 •

edited

Loading