Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement batched serial getrf #2331

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from

Conversation

yasahi-hpc
Copy link
Contributor

@yasahi-hpc yasahi-hpc commented Sep 10, 2024

This PR implements getrf function.

Following files are added:

  1. KokkosBatched_Getrf_Serial_Impl.hpp: Internal interfaces with implementation details
  2. KokkosBatched_Getrf.hpp: APIs
  3. Test_Batched_SerialGetrf.hpp: Unit tests for that

Detailed description

It computes an LU factorization of a real general M-by-N matrix A using partial pivoting with row interchanges.
Here, the matrix has the following shape.

  • A: (batch_count, m, n)
    On entry, the M-by-N matrix to be factored. On exit, the factors L and U from the factorization
    A = P * L * U; the unit diagonal elements of L are not stored.
  • IPIV: (batch_count, min(m, n))
    The pivot indices; for 0 <= i < min(M,N), row i of the matrix was interchanged with row IPIV(i).

Parallelization would be made in the following manner. This is efficient only when
A is given in LayoutLeft for GPUs and LayoutRight for CPUs (parallelized over batch direction).

Kokkos::parallel_for('getrf', 
    Kokkos::RangePolicy<execution_space> policy(0, n),
    [=](const int k) {
        auto aa = Kokkos::subview(_a, k, Kokkos::ALL(), Kokkos::ALL());
        auto ipiv = Kokkos::subview(_ipiv, k, Kokkos::ALL());

        KokkosBatched::SerialGetrf<AlgoTagType>::invoke(aa, ipiv);
    });

Tests

  1. Make a random matrix from random A and factorize it into LU with ipiv. Reconstruct L and U from LU. Then permute A by ipiv and confirm LU == A.
  2. Simple and small analytical test, i.e. choose A as follows to confirm LU == A.
A = [[1. 0. 0. 0.],
     [0. 1. 0. 0.],
     [0. 0. 1. 0.],
     [0. 0. 0. 1.]]
LU = [[1. 0. 0. 0.],
      [0. 1. 0. 0.],
      [0. 0. 1. 0.],
      [0. 0. 0. 1.]]
piv = [0 1 2 3]

Important remarks (edited 30/Oct)

  1. We have different implementations for CPUs and GPUs. For CPUs, we employ the original implementation with recursion. For GPUs, we use a static stack to achieve recursive calls. Device level recursion did not work appropriately.
  2. We need to add helper functions Laswp and Iamax. These are also used for getrs and gbtrf.
  3. In the current implementation, the maximum A size is 4096 x 4096 (=2^12) due to the limit of the stack size. In order to avoid this limit, we need to ask user to prepare a temporal buffer to achieve a stack.
  4. Currently, we call CPU version if View is accessible from host. This does not work appropriately if we execute the function on GPUs with USM. We call recursive version on Host and stuck version on Device by using KOKKOS_IF_ON_HOST and KOKKOS_IF_ON_DEVICE.
  5. For the complex matrix, we need to implement the complex versions of trsm and gemm first.
  6. Using gesv with dynamic_pivoting (getrf + getrs) shows comparable or better performance for most matrix sizes compared to gesv with static_pivoting on NVIDIA and AMD GPUs

@yasahi-hpc yasahi-hpc marked this pull request as draft September 10, 2024 07:44
@kokkos-devops-admin
Copy link

Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO INSPECTION HAS BEEN PERFORMED ON THIS PULL REQUEST! - This PR must be inspected by setting label 'AT: PRE-TEST INSPECTED'.

@yasahi-hpc yasahi-hpc changed the title Implement batched serial getrf implement batched serial getrf Sep 10, 2024
@yasahi-hpc yasahi-hpc force-pushed the implement-batched-serial-getrf branch from 689fba1 to 2ebf151 Compare October 30, 2024 03:28
@yasahi-hpc yasahi-hpc marked this pull request as ready for review October 30, 2024 05:43
@yasahi-hpc yasahi-hpc force-pushed the implement-batched-serial-getrf branch 2 times, most recently from f74e10d to d7e1eee Compare December 6, 2024 15:43
@yasahi-hpc yasahi-hpc force-pushed the implement-batched-serial-getrf branch from d7e1eee to bb53361 Compare December 19, 2024 08:04
#include <KokkosBatched_Gemm_Decl.hpp>
#include "Test_Batched_DenseUtils.hpp"

using namespace KokkosBatched;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these headers get included into a big test file, I'd rather not pollute the global namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I have fixed accordingly.

@yasahi-hpc yasahi-hpc force-pushed the implement-batched-serial-getrf branch from bb53361 to a4424cd Compare January 7, 2025 04:11
@yasahi-hpc
Copy link
Contributor Author

@cwpearson Thank you for the review. I have updated based on your review. Further reviews are highly appreciated

@yasahi-hpc yasahi-hpc requested a review from cwpearson January 7, 2025 05:10
@yasahi-hpc yasahi-hpc mentioned this pull request Jan 7, 2025
6 tasks
Copy link
Contributor

@lucbv lucbv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points to look at but good overall, will run testing on this

// Use recursive code
auto n1 = Kokkos::min(m, n) / 2;

// [ A00 ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity it should be A0 here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. Here, A0 is equivalent to

[ A00 ]
[ --- ]
[ A10 ]

I have added the concrete explanation in the docstrings of KokkosBatched_Getrf.hpp. For consistency, I changed the matrix notation in a pythonic manner, i.e.,

A0 = [[A00],
      [A10]]

The original is coming from the lapack notation
https://www.netlib.org/lapack/explore-html-3.6.1/dd/d9a/group__double_g_ecomputational_gabdd3af29e9f6bbaf4b352341a1e8b464.html#gabdd3af29e9f6bbaf4b352341a1e8b464


if (info == 0 && iinfo > 0) info = iinfo;

// [ A01 ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, the comment seems wrong since you have not extracted the subviews for A01 and A11 yet

TEST_F(TestCategory, test_batched_getrf_float) {
using algo_tag_type = typename KokkosBatched::Algo::Getrf::Unblocked;

test_batched_getrf<TestDevice, float, algo_tag_type>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it would be nice to have a couple analytical tests for this. A simple way to create an analytical test is to fill both L and U, then multiply them and eventually figure out the pivots. That would give a good level of extra confidence in the work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
I added a little more complicated analytical test where we know all of the matrices that satisfy
PA = LU

Copy link
Contributor

@cwpearson cwpearson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once @lucbv is happy this is fine with me. Thanks!

Yuuichi Asahi added 6 commits January 8, 2025 01:42
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Yuuichi Asahi added 10 commits January 8, 2025 01:42
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
…l.hpp

Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
Signed-off-by: Yuuichi Asahi <y.asahi@nr.titech.ac.jp>
@yasahi-hpc yasahi-hpc force-pushed the implement-batched-serial-getrf branch from a4424cd to 65d6fb4 Compare January 8, 2025 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AT2-CI-APPROVAL Approve CI to run at SNL feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants