Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple parallel AddMatMat, #119

Closed
danpovey opened this issue Sep 3, 2015 · 1 comment
Closed

Add multiple parallel AddMatMat, #119

danpovey opened this issue Sep 3, 2015 · 1 comment

Comments

@danpovey
Copy link
Contributor

danpovey commented Sep 3, 2015

@freewym, this is for you, although @naxingyu may have an interest in it too.

Please look over #47 to understand the background for this (Convolutional component). That pull request is for nnet2, but there is a similar set of code in nnet1, with a separate pull request (you can search for that). The original reason we wanted to upgrade to the CuBLAS v2 API is because of parallel matrix multiplication not being available in the v1 API. Now that you've (nearly) finished that task, you can help us add this batched matrix multiplication.

The current AddMatMat has signature
void Matrix::AddMatMat(const Real alpha,
const MatrixBase& A, MatrixTransposeType transA,
const MatrixBase& B, MatrixTransposeType transB,
const Real beta);
I'd like you to add a batched AddMatMat function that is a wrapper for cuBLAS's gemmBatched function. This will later be used in the convolutional component. Of course this will require test code.
The function signature and documentation should be the following (although I won't have the
whitespace correct as I am composing this with non-fixed-width font).
/**
@brief This function executes multiple matrix multiplications, executing them in parallel
using cuBLAS's gemmBatched if we are using a GPU. Vectors a, b and c
must have the same length; for each i, this function executes the matrix operation
c[i] = alpha a[i] b[i] + beta c[i].

  @param [in] alpha   The constant alpha in the equation "c[i] = alpha a[i] b[i] + beta c[i]."
   @param [in] c        A vector of pointers to matrices; all elements must have the same
                                 num-rows, num-cols and stride.  The matrices must point to distinct 
                                regions of GPU memory, or results are undefined.  Ownership of
                                 pointers is retained by the caller.
   @param [in] a        A vector of pointers to matrices; all elements must have the same
                                 num-rows, num-cols and stride.  Ownership of pointers is retained
                                 by the caller
    @param [in] trans_a   Indicates whether we should use the transpose of a[i] in the equation
                                If trans_a == kTrans, transpose(a[i]) appears in place of a[i].
   @param [in] B        A vector of pointers to matrices; all elements must have the same
                                 num-rows, num-cols and stride.  Ownership of pointers is retained
                                 by the caller
    @param [in] trans_b   Indicates whether we should use the transpose of b[i] in the equation
                                If trans_b == kTrans, transpose(b[i]) appears in place of b[i].
    @param [in] beta   The constant beta in the equation "c[i] = alpha a[i] b[i] + beta c[i]."
*/
template <class Real>
void AddMatMatBatched(const Real alpha,
                                        const std::vector<CuSubMatrix<Real>* > &C,
                                      const std::vector<const CuSubMatrix<Real>* > &A,
                                     MatrixTransposeType trans_a,
                                      const std::vector<const CuSubMatrix<Real>* > &B,
                                     MatrixTransposeType trans_a,
                                    const Real beta);

Note: we have to pass vectors of pointers, although it is inconvenient from a memory management perspective, because CuSubMatrix doesn't have an operator=, so we can't easily create a vector of CuSubMatrix directly. Also, we prefer to pass CuMatrixBase in situations like these, but it would create difficulties when deleting the memory (since an abstract base class can't be deleted unless it has a virtual destructor). It's OK; we can always create a CuSubMatrix that's identical to any given matrix.

Please make sure your test code does not have memory leaks; you can run valgrind or cuda-memtest on it.
Also, if you could add some speed-tests to make it possible to see whether the batched matrix-multiplication is helpful for various matrix sizes, that would also be very helpful.

@danpovey
Copy link
Contributor Author

closing since work is done. thanks, @freewym .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant