Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4409][MLlib] Additional Linear Algebra Utils #3319

Closed
wants to merge 19 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Nov 17, 2014

Addition of a very limited number of local matrix manipulation and generation methods that would be helpful in the further development for algorithms on top of BlockMatrix (SPARK-3974), such as Randomized SVD, and Multi Model Training (SPARK-1486).
The proposed methods for addition are:

For Matrix

  • map: maps the values in the matrix with a given function. Produces a new matrix.
  • update: the values in the matrix are updated with a given function. Occurs in place.

Factory methods for DenseMatrix:

  • *zeros: Generate a matrix consisting of zeros
  • *ones: Generate a matrix consisting of ones
  • *eye: Generate an identity matrix
  • *rand: Generate a matrix consisting of i.i.d. uniform random numbers
  • *randn: Generate a matrix consisting of i.i.d. gaussian random numbers
  • *diag: Generate a diagonal matrix from a supplied vector
    *These methods already exist in the factory methods for Matrices, however for cases where we require a DenseMatrix, you constantly have to add .asInstanceOf[DenseMatrix] everywhere, which makes the code "dirtier". I propose moving these functions to factory methods for DenseMatrix where the putput will be a DenseMatrix and the factory methods for Matrices will call these functions directly and output a generic Matrix.

Factory methods for SparseMatrix:

  • speye: Identity matrix in sparse format. Saves a ton of memory when dimensions are large, especially in Multi Model Training, where each row requires being multiplied by a scalar.
  • sprand: Generate a sparse matrix with a given density consisting of i.i.d. uniform random numbers.
  • sprandn: Generate a sparse matrix with a given density consisting of i.i.d. gaussian random numbers.
  • diag: Generate a diagonal matrix from a supplied vector, but is memory efficient, because it just stores the diagonal. Again, very helpful in Multi Model Training.

Factory methods for Matrices:

  • Include all the factory methods given above, but return a generic Matrix rather than SparseMatrix or DenseMatrix.
  • horzCat: Horizontally concatenate matrices to form one larger matrix. Very useful in both Multi Model Training, and for the repartitioning of BlockMatrix.
  • vertCat: Vertically concatenate matrices to form one larger matrix. Very useful for the repartitioning of BlockMatrix.

The names for these methods were selected from MATLAB

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23485 has started for PR 3319 at commit 94d7ae9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23485 has finished for PR 3319 at commit 94d7ae9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23485/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23492 has started for PR 3319 at commit d662f9d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 17, 2014

Test build #23492 has finished for PR 3319 at commit d662f9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23492/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 25, 2014

@brkyvz Two comments on the API:

  1. For the APIs we provide, could you add a JAVA test suite and verify that all methods work in Java.
  2. horzCat and vertCat are not MATLAB operators, nor NumPy's. Maybe we should rename them to hstack and vstack, which are at least known by NumPy users.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23863 has started for PR 3319 at commit c75f3cd.

  • This patch merges cleanly.

@brkyvz
Copy link
Contributor Author

brkyvz commented Nov 26, 2014

@mengxr:
Thanks for the feedback. Added the Java tests!
horzcat and vertcat are in fact MATLAB methods:
http://www.mathworks.com/help/matlab/ref/horzcat.html
http://www.mathworks.com/help/matlab/ref/vertcat.html
They are the underlying methods that are called when someone writes
A = [A1 A2; A3 A4];
I felt the naming was more intuitive as it is like strcat, because you are concatenating matrices either
horizontally or vertically. I'd be happy to change them to hstack and vstack, but horzcat sounds more intuitive to me (maybe I'm biased, because I used to use it more).
Your call :)

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23863 has finished for PR 3319 at commit c75f3cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23863/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23900 has started for PR 3319 at commit a8120d2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 26, 2014

Test build #23900 has finished for PR 3319 at commit a8120d2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23900/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 26, 2014

@brkyvz I didn't know MATLAB has horzcat and vertcat along with [A, B] or [A; B]. I'm okay with adapting method names from MATLAB. Hope there is no copyright issues. (I don't see any special statement from Octave.)

If we want to use MATLAB operators, maybe we should also stick to lowercase method names.

@brkyvz
Copy link
Contributor Author

brkyvz commented Nov 26, 2014

I checked MATLAB's webpage, I didn't see any copyright mentions for the method names. It's best to triple check though. Since numPy and sciPy share method names with MATLAB, I don't expect there to be problems.
with the last commit I made the method names lowercase :)

import breeze.linalg.{Matrix => BM, DenseMatrix => BDM, CSCMatrix => BSM}

import java.util.{Random, Arrays}
import scala.collection.mutable.ArrayBuffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organize imports

}
j += 1
}
while (numCols > lastCol) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not necessary. At the end of the while (j < numCols) loop, j = numCols + 1. So it is colPtrs(j) = nnz.

@SparkQA
Copy link

SparkQA commented Dec 20, 2014

Test build #24668 has started for PR 3319 at commit 10a63a6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 20, 2014

Test build #24668 has finished for PR 3319 at commit 10a63a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24668/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24774 has started for PR 3319 at commit 04c4829.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24775 has started for PR 3319 at commit b0354f6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24774 has finished for PR 3319 at commit 04c4829.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24774/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 24, 2014

Test build #24775 has finished for PR 3319 at commit b0354f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24775/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Dec 29, 2014

LGTM. Merged into master. Thanks!!

@asfgit asfgit closed this in 02b55de Dec 29, 2014
@brkyvz brkyvz deleted the SPARK-4409 branch January 30, 2015 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants