Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Sparse CSR Matrices #7029

Closed
wants to merge 2 commits into from
Closed

WIP: Sparse CSR Matrices #7029

wants to merge 2 commits into from

Conversation

tanmaykm
Copy link
Member

This would introduce Sparse CSR matrix data type.

  • SparseMatrixCSC and SparseMatrixCSR implement CompressedSparseMatrix
  • CompressedSparseMatrix is AbstractSparseMatrix and parameterized with the constants CSC or CSR a CompressedSparseType (CSC or CSR)
  • made the existing CSC matrix methods generic wherever possible to be used on CompressedSparseMatrix by working on the abstraction of compressed and uncompressed dimension.

All the linalg methods and a few matrix methods are yet to be implemented for CSR. This does not break any existing APIs. In brief, the API changes are:

  • There is a SparseMatrixCSR constructor, similar to the one for CSC.
  • The sparse constructor and methods like spzeros, sprand can now optionally take a keyword argument format that can either be CSC (default) or CSR. CSR or CSC as the first parameter.
  • The sparse constructor now has a variant that takes a sparse store (CompressedSparseStore or a TripletSparseStore) specifying the data. This seemed to be a good uniform way to overload the sparse constructor for different data formats.
    - Methods like spzeros, sprand take a keyword argument format that can either be CSC (default) or CSR.

Would appreciate feedbacks and whether this is in the right direction.

@tanmaykm tanmaykm added this to the 0.4 milestone May 29, 2014
@JeffBezanson
Copy link
Sponsor Member

I also look forward to having COO matrices so we can handle extremely sparse data.

@ViralBShah
Copy link
Member

This overhaul should also make it possible to add new storage formats. We want to do N-d COO next, and it will be quite interesting as one may want to use the data structure for non-numeric data too.

@StefanKarpinski
Copy link
Sponsor Member

This is going to require some careful abstraction design – we don't want to have 3-4 full but different versions of all the sparse matrix logic and code. If we can get the abstractions right, this is where Julia can really shine.

@simonbyrne
Copy link
Contributor

Also worth thinking about how this plays with #6837.

@tkelman
Copy link
Contributor

tkelman commented May 29, 2014

@simonbyrne as I mentioned in that issue, having both CSR and CSC means sparse transpose can be a very cheap reinterpret-as-opposite-type operation. The current re-sorting loop for transpose should be reappropriated as the conversion implementation.

@simonbyrne
Copy link
Contributor

@tkelman True, but for complex types we would also need a mechanism to conjugate. I've been thinking that perhaps the best solution is a shallow Conjugate type (the same problem arises with Tridiagonals).

@ViralBShah
Copy link
Member

@StefanKarpinski This is actually not adding all the code twice, but a refactor that essentially gives us CSR for a net addition of 200 lines. The COO implementation will have to be done completely from the ground up. As we add that, we will probably end up refactoring further. For now, I am pretty happy with this.

Cc: @lindahua @mlubin

@lindahua
Copy link
Contributor

May look at the codes more carefully later when I get some time to spend.

It is useful to have a set of benchmark suites to ensure there is no notable performance regression.

@ViralBShah
Copy link
Member

We already have a few sparse matrix perf tests that we can check. Of course, we should certainly add more.

@ViralBShah
Copy link
Member

@tkelman @simonbyrne I am not fully convinced about making the transpose a no-op, because that just translates the cost elsewhere. For the matrix multiply and divide cases, we already optimize away the transpose, and everywhere else, I feel it is best to keep the performance model transparent. Anyways, I am willing to try this out, but perhaps this discussion is best continued in #6837 and this PR can just focus on adding CSR functionality.

@tkelman
Copy link
Contributor

tkelman commented May 30, 2014

I am not fully convinced about making the transpose a no-op, because that just translates the cost elsewhere. For the matrix multiply and divide cases, we already optimize away the transpose, and everywhere else, I feel it is best to keep the performance model transparent.

Last I checked that was not the case. There don't appear to be any specialized spmm implementations for different combinations of transposing one or the other input, only currently for spmv. Right now spmm has to do a rather expensive double transpose at the end to put the indices in sorted order in each column. In some cases, especially when one or both inputs are transposed, it would be cheaper to transpose and swap the order of the inputs, then only have to transpose the output once. It also looks like mixed-type spmm is not implemented yet in this PR, that would be valuable to have.

Julia's sparse code is much less capable and has inferior performance to SciPy at the moment. There are certain choices made in SciPy like allowing un-sorted indices to persist until an operation is performed that needs them to be sorted, where I agree you're mostly deferring operations until later. But there are quite a few cases with transpose in particular that adding some cleverness can absolutely remove big performance penalties Julia is paying right now due to performing extra sorting and bookkeeping that it doesn't have to do.

@ViralBShah
Copy link
Member

It would be good to have some performance comparisons with SciPy, where we lack. Would be great if you can file issues. We can certainly relax the indices being sorted if there is a net gain by doing so. SPMM handling transpose is something that we should add. Mixed type operations are not yet implemented, but I believe @tanmaykm was planning to add them in a future PR, and it would be great if he can weigh in.

Often people write their codes expecting the matrix to be in a particular format - like working on columns or rows. Automatically changing the storage can be unexpected if someone has carefully written code expecting a particular format. If the user wants a format change, then that can be done explicitly rather than by doing a transpose.

Changing the behaviour of transpose and fixing other performance deficiencies are two separate things, I believe. In the short term, you are right that changing the transpose behaviour will give some performance gains, but once we address other deficiencies, we may very well not want the changed behaviour.

@tkelman
Copy link
Contributor

tkelman commented May 30, 2014

I'm vaguely remembering a discussion on the mailing list from a few months ago about spmm, the result of which was me adding another spmm performance test (the existing one was sparse(ones(n,n))*sparse(ones(n,n)), not the greatest).

If code requires CSC or CSR, it can easily typeassert or convert.

You can't fix making transpose expensive by default, if there are no other operations happening around it. Shouldn't we make the fast version the default choice? If we're considering making dense transposes and slices into views, I don't see why sparse should be deprived of the same performance-vs-complexity discussion.

But sure, these can all be split out into separate PR discussions and don't all have to be solved at once.

@tanmaykm
Copy link
Member Author

Mixed type operations can certainly be added. We did not have transpose switch the format by default as it would break many existing code, with no easy way to deprecate things. It would probably be easier to make these decisions once we have a more complete implementation across the sparse formats and have appropriate abstractions for packages to migrate to.

@tkelman
Copy link
Contributor

tkelman commented May 30, 2014

@tanmaykm that makes sense, would want things to be a bit more complete before making that kind of change.

@andreasnoack
Copy link
Member

I still think that it is natural to discuss this in connection to #6837. As pointed out by @tkelman, a Transpose type effectively defines a row major matrix from a columns major matrix and therefore it shouldn't be necessary with a CSR type. All the methods written for the CSR type would still be needed, but should be for Transposed{CSC} instead. Hence I think we should decide if we want to have a Transpose type or row major arrays. Having both seems unnecessary. I'd vote for handling this through the Transpose type because it also makes handling of conjugation before multiplication and division easier.

@ViralBShah
Copy link
Member

The need for row major in this case is because of application needs - I am sure others will chime in. Has been discussed in a few places now, and SciPy also has it. We can still have a Transpose type, but I'd like to tread carefully there. I am not fully convinced, but neither am I completely against it.

@tkelman
Copy link
Contributor

tkelman commented May 30, 2014

Is there any difference in functionality between Transposed{CSC} and CSR, or is it just a naming/philosophy question? It is a little funny how a) the state of Julia's current support, and b) application biases, lead to row-major dense arrays feeling redundant and a lot of work, but CSR sparse feeling like a more familiar pattern than a transpose type and worth making first-class.

Any applications that are more natural in row-major form (@tknopp has mentioned Kaczmarz, also parallel spmv a la https://github.com/madeleineudell/ParallelSparseMatMul.jl) can presumably also be done in terms of Transposed{CSC}, just might require a bit more thought during development to keep the extra abstraction level in mind.

@tknopp
Copy link
Contributor

tknopp commented May 30, 2014

Here is the issue #5932 where row-major applications were discussed. Kaczmarz's algorithm is indeed something that is currently hard to be expressed in Julia. I went down the road doing the tranposition in my head yielding code that feels just wrong but still better than code that is completely slow.

@ViralBShah
Copy link
Member

@tknopp Could you try this PR? It does not have the matmul implemented yet for CSR though. Once we accept the basic design, we can add more functionality.

@tknopp
Copy link
Contributor

tknopp commented May 30, 2014

I can have a look. Although I have to say that in my application I use both sparse and dense matrices and apply them to Kaczmarz algorithm. So this PR solves only part of the issue.

SparseMatrixCSC(int(m), int(n), colptr, rowval, nzval)
indptr{Tv,Ti}(S::SparseMatrixCSC{Tv,Ti}) = S.colptr
indval{Tv,Ti}(S::SparseMatrixCSC{Tv,Ti}) = S.rowval
indptr!{Tv,Ti}(S::SparseMatrixCSC{Tv,Ti}, colptr::Vector{Ti}) = (S.colptr = colptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unusual syntax for setting a member to a given value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably getindptr and setindptr! would be better notations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those names sound better to me.

@mlubin
Copy link
Member

mlubin commented May 30, 2014

It's inevitable that we'll have to support a variety of sparse matrix formats, there just isn't one that is the best in all applications. This seems like a step in the right direction. I've made some comments inline. It's also very important to check for any performance regressions.

@ViralBShah
Copy link
Member

The goal is not to support every possible sparse matrix data structure in Julia, but to have 2-3 of the common ones that support most use cases. Also, we should make it easy to add a package with a new sparse matrix data structure.

@lindahua
Copy link
Contributor

A question about the type hierarchy design:

Generally, an abstract type is useful when it is extensible, meaning that people can write other subtypes in addition to the several ones provided in the Base. Typically, an abstract type is associated with a set of interfaces that subtypes should implement.

Not very sure CompressedSparseMatrix is the case. In addition to SparseMatrixCSC and SparseMatrixCSR, what else can be implemented as a subtype of this?

If the motivation is that SparseMatrixCSC and SparseMatrixCSR have some codes in common, then the following suffices:

typealias CompressedSparseMatrix Union(SparseMatrixCSC, SparseMatrixCSR)

@tanmaykm
Copy link
Member Author

@lindahua One reason for CompressedSparseMatrix was to type dispatch on Tv, Ti and Ts. I am not sure if it is possible with unions.

@tanmaykm
Copy link
Member Author

Thanks. I shall incorporate some of the suggestions and rebase.

@tanmaykm
Copy link
Member Author

Rebased and incorporated some of the comments. The perf test results comparing this PR with master is here: https://gist.github.com/tanmaykm/4ee2ea00c468bbacc65a

I've updated the PR description with the changes, and shall try making CompressedSparseMatrix a union next.

@tanmaykm
Copy link
Member Author

CompressedSparseMatrix is now a Union instead of an abstract type. Also added some performance tweaks for getindex.

@ViralBShah
Copy link
Member

I think it would be nice to get this PR updated to master. There are lots of people who want CSR sparse matrices. The main thing would be to benchmark things so that there is no performance regression for existing sparse matrix code. Over time, after the basic support is in, we can work on interoperability of CSR and CSC matrices.

@ViralBShah ViralBShah modified the milestones: 0.4, 0.4-projects Aug 23, 2014
@jiahao jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06
@ViralBShah ViralBShah removed this from the 0.4-projects milestone Feb 14, 2015
@ViralBShah ViralBShah closed this Feb 19, 2015
@ViralBShah
Copy link
Member

I am closing this, since we really should do this in a package.

@tkelman
Copy link
Contributor

tkelman commented Mar 3, 2015

@ViralBShah @tanmaykm should we create a https://github.com/JuliaSparse/SparseMatrices.jl package for working on alternate sparse formats, CSR, COO, etc? I have some work where I'll want to use an array-of-sparse-columns format, might be good to put that part somewhere it could be more broadly used.

@ViralBShah
Copy link
Member

Good idea. Let's do it. We should change the interfaces in base to make it easier to add more formats as packages.

@sglyon
Copy link
Contributor

sglyon commented Aug 14, 2015

What's the status on this?

I realized that I could really use a CSR format for an application I am working on

@mlubin
Copy link
Member

mlubin commented Aug 14, 2015

Doesn't look like the code was ever moved into a package.

@sglyon
Copy link
Contributor

sglyon commented Aug 14, 2015

I'm happy to pull the code out into a package.

Any objections?

@tkelman
Copy link
Contributor

tkelman commented Aug 14, 2015

Nope. Just sent you an invite to join JuliaSparse. Good to do it there for easier collaboration.

@sglyon
Copy link
Contributor

sglyon commented Aug 15, 2015

Great, thanks. I'll make a SparseCSR.jl over there and start factoring the code out of this PR

@sglyon
Copy link
Contributor

sglyon commented Aug 15, 2015

So I'm looking at this now. Given how many changes were made to base here, I don't see a clean way to provide CSR in package without duplicating a lot of what we have for CSC in base.

We might consider cleansing this PR so that all the generalizations to the algorithms that "made room" for CSR remain, but the actual implementation of CSR gets farmed out to another package. With the right abstraction we could "make room" for COO, BSR, or others to be implemented in another package.

Has anyone put serious thought into how this abstraction might look?

@tkelman
Copy link
Contributor

tkelman commented Aug 15, 2015

I've thought a fair amount about the abstraction, and a direction I've started looking into has been basing more of the internal implementation on an eachstored iteration protocol for sparse matrices. So far my prototypes are really immature, the performance isn't very good and it's nowhere near being base-worthy. And this has more to do with the internals of how the various operations are implemented, rather than the user-facing API for interacting with different formats.

While a new more minimal reorganization PR could be started to make the base interfaces more easily extensible, I don't think we have time to get anything significant into base before 0.4, considering we're about to declare a feature freeze. So if you're in a hurry to build some functionality you can use, the most immediate approach would be to just live with duplicating code at first.

For design adjustments to base for 0.5 or later, we could start thinking about what we'd like the design to look like. That early design brainstorming work probably doesn't have to happen on this repo's issue tracker or as PR's just yet.

@sglyon
Copy link
Contributor

sglyon commented Aug 15, 2015

That's great. Maybe we can start working from what you have already started.

Totally agree that we shouldn't try to get anything related to this into 0.4 -- bring on the feature freeze! I actually think the array themed 0.5 is a good target aim for to get at least parts of this in.

Also agree we should have design discussions elsewhere. Should we create a new repo in JuilaSparse and have it serve as a playground/discussion area as we think this through?

@tkelman
Copy link
Contributor

tkelman commented Aug 15, 2015

Sure. We could either make it a "meta" repo just for discussion like some other organizations have, or a stub package with the intent of growing it into something functional. Since I think the abstraction-design goal is more ambitious than just extending to CSR, might want a more generic name like SparseMatrices.jl or SparseArrays.jl (would possibly conflict with #12323) or SparseIterators.jl or ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.