Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate Q matrix using Chinese Remainder Theorem, FLINT and BLAS libraries #142

Merged
merged 55 commits into from
Mar 15, 2024

Conversation

vasdommes
Copy link
Collaborator

@vasdommes vasdommes commented Oct 10, 2023

TODO:

(NB: Do not merge to the 2.6 release!)

Desciption

We calculate Q matrix using Chinese Remainder Theorem (implemented by FLINT library) and BLAS routines. The same idea is used in FLINT libarary for bigint matrices, see mul_blas.c. We reuse some FLINT code for our purposes.

The details and implementation are discussed in https://github.com/davidsd/sdpb/blob/master/src/sdp_solve/SDP_Solver/run/bigint_syrk/Readme.md

Brief description:

Suppose that we want to calculate Q = P^T P, where P is a BigFloat matrix. Instead of (slow) BigFloat arithmetic, we can proceed as follows:

  • Convert P to integer matrix: normalize columns of P, multiply all by 2^N (N is BigFloat precision).
  • Calculate residues of P modulo a set of (small enough) primes.
  • Calculate squares of these residue matrices, Q_i = P_i^T P_i, using BLAS, namely cblas_dsyrk() function.
  • Restore Q from residues Q_i using Chinese Remainder Theorem, remove normalization.

Since BLAS is highly optimized, this method proves to be much faster than BigFloat arithmetics: according to our preliminary (and rather imprecise) benchmarks, bigint_syrk_blas() is faster than El::Syrk() by a factor of ~10-20x.
It also allows to reduce RAM usage (only one copy of Q is stored on each node) and remove unnecessary --procGranularity option from SDPB.

@vasdommes vasdommes requested a review from davidsd October 10, 2023 04:38
@vasdommes vasdommes self-assigned this Oct 10, 2023
@vasdommes vasdommes force-pushed the bigint-syrk-blas branch 2 times, most recently from fa92aba to 17426eb Compare October 17, 2023 05:39
@vasdommes
Copy link
Collaborator Author

I performed timing runs for both old and new algorithm on Expanse cluster, for GNY model.
Discussion in Slack: https://bootstrapcollab.slack.com/archives/CG450R9QX/p1699498297583039

First two plots show duration of one SDPB solver step for old and new algorithm.

  • First plot: nmax=8, 1 node, 128 cores. With the new algorithm, SDPB is 1.5x faster.
  • Second plot: nmax=14. With the new algorithm, SDPB is 1.8x faster for 3 nodes (384 cores). Old algorithm runs out of memory on 2 nodes, while the new one works and makes SDPB 2x faster than the old SDPB running on 3 nodes.

Note that these numbers reflect overall speedup for the program.
If we compare matrix multiplication itself, we can see that BLAS calls are up to 10x faster than El::Syrk from the old algorithm (e.g. nmax=14 on 3 nodes: El::Syrk takes 62s, BLAS calls - 6.5s).

Different parts of solver step shown in the plot:
Compute local Q (old algorithm): each process calculates its local Q_i.
Compute Q on node (new algorithm): all processes on a node work together to calculate their contribution to Q.
New algorithm turns out to be faster, thanks to BLAS optimizations.
Reduce-scatter Q: accumulate contributions to Q from all processes/all nodes.
Here new algorithm is faster too, since it has only on copy of Q per node instead of one copy of Q per process.
For a single node, it doesn't need reduce-scatter is required at all, since we already have Q calculated from shared memory window!
For several nodes, synchronization is still needed, but becomes faster as well.

  • The third plot shows different steps of Compute Q on node: normalize, compute residues, call BLAS etc., for nmax=8 nodes=1, nmax=14 nodes=2, nmax=14 nodes=3.

For nmax=8, "normalize P" takes more time than BLAS. But this time is mostly spent not on calculating column norms, but on waiting for other MPI processes (note that this step requires global synchronization). This is not surprising, since we cannot provide perfect load balancing for a small problem (190 SDP blocks) on a large number of cores (128).
For larger problems (nmax=14), BLAS is the longest step. This is a good news for us: we are spending most of the time running very effective and heavily optimized code.

image

image

image

…mainder Theorem

See BigInt_Shared_Memory_Syrk_Context.bigint_syrk_blas()

This allows to compute Q := P^T P, where P is a DistMatrix of big integers.

Algorithm:
- Calculate residues P_i of matrix P modulo set of primes (using FLINT library)
- Convert residues to doubles, calculate residue squares Q_i = P_i^T P_i via cblas_dsyrk
- Convert the result to int, restore Q from Q_i using Chinese Remainder Theorem

The function will be used instead of El::Syrk to calculate Q matrix (compute_Q_group).

Added dependencies: OpenBLAS, FLINT
According to preliminary (and rather imprecise) benchmarks, bigint_syrk() is faster than El::Syrk by a factor of ~10-20x
…atrices Q_IJ = P_I^T P_J (modulo some prime).

Currently, Q is not split, and bigint_syrk_blas() behavior remains the same.
TODO split Q and use proper job scheduling based on job costs.
Cost is equal to the number of output matrix elements to be calculated, which is true for naive multiplication
(and big matrices, where all time is spent on number crunching).

TODO: One big BLAS job can be significantly faster than many small jobs.
Ideally we should account for it by adding extra overhead term, but it's hard to estimate its correct value.

Currently, infinitesimal overhead is already accounted for in LPT_scheduling(),
where priority queue for ranks is sorted by (cost, num_jobs).
…orithm, with minimal_split_factor()

+ add tests for different split factors to calculate_matrix_square.test.cxx

Algorithm:
- Uniformly distribute as many primes as we can, without splitting Q.
- For the remaining primes, split Q so that each rank gets no more than one extra job.

Consider, for example, num_primes=73 and num_ranks=10:
- Create one BLAS syrk job for the first 70 primes.
- For each of the remaining 3 primes, split Q into 2x2 matrix.
  This yields 2*3 syrk jobs plus 3 gemm jobs, i.e. 9 jobs to be assigned to 9 ranks.

TODO: this could be non-optimal, probably it's better to split Q into more parts
and give, e.g., one big job to one rank and two small jobs to another.
We can calculate max_cost for several split_factors starting from minimal_split_factor(),
and choose the optimal one.
1) Distribute as many primes as we can uniformly without splitting Q
2) Try to distribute remaining primes by splitting Q.
Check different splitting factors: [min_split_factor, min_split_factor + 5)

See details in create_blas_jobs_schedule.cxx and bigint_syrk/Readme.md
fixes waf configure for Expanse
…nding.

If one rank throws an exception and another one doesn't, the first rank will hang forever on the window fence.
Thus, we disable this fence and assume that the program will abort.
NB: if exception is caught after that and program continues working, it will probably hang on the next synchronization point!
# Conflicts:
#	src/outer_limits/compute_optimal/compute_optimal.cxx
#	src/sdpb/solve.cxx
vasdommes added 12 commits March 7, 2024 14:00
Support optional suffixes:
100 or 100B -> 100 bytes
100K or 100KB -> 102400 bytes
100M or 100MB -> 104857600 bytes
100G or 100GB -> 107374182400 bytes
Previously, we defined syrk by checking I==J.
This does not work when we are multiplying different matrices, C_IJ := A_I^T B^J
(it will happen when we'll split Q window and multiply different vertical bands of P)
… blocks for each MPI group, refactor compute_block_residues()

Fixes #203 bigint-syrk-blas: add --maxSharedMemory option to limit MPI shared memory window sizes
TODO: currently it fails if limit is to small. We should split input and output windows instead.
…ory limit

In unit tests, we test two cases:
- no memory limit (no P window splitting)
- memory limit ensuring that only 3 rows fit into P window.
see calculate_matrix_square.test.cxx

In end-to-end.test.cxx, we set --maxSharedMemory=1M for two realistic cases, thus enforcing split factors 4 and 6.
In other cases, limit is not set.

TODO: update Readme.md
TODO: split also Q (output) window, if necessary.
…culating total_size

Result is different when input_window_split_factor > 1.
…e --maxSharedMemory limit

TODO: update also bigint_syrk/Readme.md

Changed two end-to-end tests: set low --maxSharedMemory to enforce Q window splitting
In unit tests, we set different shared memory limits - to calculate Q = P^T P without splitting, with splitting only P window, or with splitting both P and Q.

Also supported both uplo=UPPER and LOWER for syrk.
Fixed reduce_scatter(): Old version synchronized only upper half always, but for off-diagonal blocks Q_IJ, we need to synchronize all.
Fix #207 bigint-syrk-blas: account for MPI shared memory limits (by splitting shared windows)
@vasdommes vasdommes marked this pull request as ready for review March 9, 2024 07:13
…o check bigint_syrk_blas behaviour (in particular, reduce-scatter).

We create virtual "nodes" consisting of 1/2/3 ranks each, and pass node communicator as comm_shared_mem to BigInt_Shared_Memory_Syrk_Context.

NB: for all tests to pass, you should run unit_tests on 6 ranks (or 12,18,24 etc.):
mpirun -n 6 build/unit_tests
For some reason, it didn't arise in production, but caused segfault when running new unit tests with memory sanitizer.
tests reduce_scatter() function for DistMatrix, used in bigint-syrk-blas for multinode case
@vasdommes vasdommes merged commit 236679f into master Mar 15, 2024
2 checks passed
@vasdommes vasdommes deleted the bigint-syrk-blas branch March 15, 2024 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant