Correlation sparse array is very slow #7788

paulanalyst · 2014-07-30T08:58:11Z

Correlation sparse array is very slow. Out of memory on a dense array when we have 30,000 columns. How quickly it calculated?

julia> I=int32((rand(10^7)*9999999).+1);

julia> J=int32((rand(10^7)*29999).+1);

julia> V=int8((rand(10^7)*9).+1);

julia> D=sparse(I,J,V);

julia> @time cor(D[:,1:30]);
elapsed time: 23.806328476 seconds (2458875228 bytes allocated, 0.14% gc time)

julia> @time cor(full(D[:,1:30]));
elapsed time: 4.494099126 seconds (2732042496 bytes allocated, 5.31% gc time)

Paul

The text was updated successfully, but these errors were encountered:

ivarne · 2014-07-30T09:01:24Z

Previous discussion: https://groups.google.com/forum/?fromgroups=#!topic/julia-users/VacAk16jiN0

paulanalyst · 2014-07-31T11:35:10Z

Also very slow running sparse arrays: mean (S, 1), cov, etc functions through the column.

ViralBShah · 2014-08-01T12:45:15Z

We will fix all of these as soon as 0.3 is released. Thanks for reporting.

paulanalyst · 2014-08-01T19:23:54Z

Correlations and cov are calculated wonderfully at dense matrices, all 8
cores is always working at 100% on my win7. It would be great if it
worked for sparse the same: on all cores on full power.
Paul

W dniu 2014-08-01 14:45, Viral B. Shah pisze:

We will fix all of these as soon as 0.3 is released. Thanks for reporting.

—
Reply to this email directly or view it on GitHub
#7788 (comment).

paulanalyst · 2014-08-11T05:44:29Z

When this can work?

When this can work?
If it does not work, how to count to cov matrix D to finish in this life ..;)

I=int32((rand(10^7)_9999999).+1);
J=int32((rand(10^7)_29999).+1);
V=int8((rand(10^7)*9).+1);
D=sparse(I,J,V);
C=cov(D)
Paul

ViralBShah · 2014-08-11T06:13:51Z

I suspect that just implementing the At_mul_B and friends for sparse should suffice. Can you try for smaller problems? Basically you have to do A'*A at some point, and you can't do any faster than that. Once we have multi-threading and such, we can get a few more speedups.

paulanalyst · 2014-08-11T11:46:53Z

Big thx, it is short way;)

ViralBShah · 2014-08-11T11:48:52Z

@lindahua can you help here?

lindahua · 2014-08-11T17:38:23Z

I think one may just need to implement At_mul_B and friends

paulanalyst · 2014-08-12T07:02:07Z

It is also very slow when we do sparse mean etc ...
I=int32((rand(10^7)_9999999).+1);
J=int32((rand(10^7)_29999).+1);
V=int8((rand(10^7)*9).+1);
D=sparse(I,J,V);
mean(D,1)
looooong time...
etc...

lindahua · 2014-08-12T14:38:06Z

reduction along dimensions has not been specially optimized for sparse matrix.

That should not be too difficult though, since we only have to consider matrix here, rather than arrays of arbitrary dimensions.

paulanalyst · 2015-01-30T13:54:58Z

Dear, what about sparse statistic?
Now a have the simple sample:
@time E=mean(D,1)
after 15 min nothing, but it is posible in 0.8sek in simple while :

julia> k,l=size(D)
(6000000,30000)

julia> E=zeros(l)';

julia> @time for i=1:l
E[i]=mean(D[:,i])
#if mod(i,1000).==0 println(i);end;
end;
elapsed time: 0.838322868 seconds (794730416 bytes allocated, 48.18% gc time)

Propably is simmilary in cor, var,cov, etc

Paul

paulanalyst · 2015-01-30T14:05:43Z

Please, let use the parallel in this functions.

andreasnoack · 2015-01-30T14:12:34Z

@paulanalyst As usual: what is your versioninfo()? Also what is nnz(D)? On latest master, I get

julia> A = sprandn(6000000,30000,0.00003);

julia> @time mean(A, 1);
elapsed time: 0.13834035 seconds (166 MB allocated, 9.94% gc time in 7 pauses with 0 full sweep)

paulanalyst · 2015-01-30T14:17:09Z

julia> versioninfo()
Julia Version 0.3.5
Commit a05f87b* (2015-01-08 22:33 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

julia> nnz(D)
99625891
Paul

nalimilan · 2015-01-30T14:22:50Z

I can confirm that on 0.3.5 @andreasnoack's example takes ages.

andreasnoack · 2015-01-30T14:26:05Z

Yes. A faster mean for sparse matrices was introduced in bde4e65 so it is not available on 0.3.x. However, the implementation is very simple, i.e. sum(A,1)/size(A,1), so you could use this method instead while waiting for 0.4.

paulanalyst · 2015-01-30T14:30:23Z

OK, at 0.4.0 mean is fast, like below. But for "var" I am waitng many minetes...
A fresh approach to technical computing
Documentation: http://docs.julialang.org
Type "help()" for help.

Version 0.4.0-dev+2438 (2015-01-03 12:36 UTC)
Commit b0d94dd (27 days old master)
x86_64-w64-mingw32

julia> @time E=mean(D, 1)
elapsed time: 6.035410891 seconds (1558239044 bytes allocated, 18.71% gc time)
1x30070 Array{Float64,2}:
0.0 0.0904202 0.0072963 0.00103694 0.00430618 1.68553e-7 3.37105e-7 3.37105e-7 1.68553e-7

julia> @time E=var(D, 1)
minutes....
Paul

paulanalyst · 2015-01-30T14:35:15Z

sum(D,1)/size(D,1) at 0.3.5 is 2 times longer then mean(D,1) at 0.4.0. , 11 ver 6 sek in my array.

julia> @time sum(D,1)/size(D,1)
elapsed time: 11.550475452 seconds (1555463276 bytes allocated, 13.28% gc time)
1x30070 Array{Float64,2}:
0.0 0.0904202 0.0072963 0.00103694 0.00430618 1.68553e-7 3.37105e-7 3.37105e-7 1.68553e-7
Paul

andreasnoack · 2015-01-30T14:45:51Z

I think that the reason sum in 0.4 is faster is our new garbage collector bacuase I think the implementation is the same.

The var reduction is slightly more complicated and not implemented efficiently for sparse matrices yet, but you can probably benefit from the trick we discussed last time, i.e. by using the E(XX') - EX(EX)' formulation.

ViralBShah · 2015-01-30T19:03:23Z

We should backport the mean.

paulanalyst · 2015-01-30T19:08:48Z

@ andreasnoack
E(XX') is a vecotr...

I use X'X-E(X)'*E(X)
Paul

andreasnoack · 2015-01-30T19:13:48Z

Then you won't get the right result.

Anyway, I'm talking about the method in the issue where you thought the was a rounding error, but it was wrong formula.

(cherry picked from commit bde4e65)

ViralBShah · 2015-03-19T16:12:47Z

#10536 should have fixed this one.

simonster · 2015-03-19T16:52:24Z

Unfortunately it doesn't look like it does, since cor is not based on mapreduce/mapreducedim.

ViralBShah added the sparse label Jul 30, 2014

ViralBShah added this to the 0.4 milestone Jul 30, 2014

JeffBezanson added the performance label Jul 30, 2014

ViralBShah added a commit that referenced this issue Nov 23, 2014

Fast mean for sparse matrices along dimensions. (#7788)

bde4e65

tkelman added backport pending and removed backport pending labels Feb 1, 2015

ViralBShah added a commit that referenced this issue Feb 1, 2015

Fast mean for sparse matrices along dimensions. (#7788)

541610b

(cherry picked from commit bde4e65)

ViralBShah closed this as completed Mar 19, 2015

ViralBShah modified the milestones: 0.4, 0.4.1 Mar 19, 2015

simonster reopened this Mar 19, 2015

JeffBezanson modified the milestones: 0.4.x, 0.4.0 Jun 2, 2015

ViralBShah modified the milestones: 0.4.x, 0.5.0 Oct 1, 2015

IainNZ added the help wanted Indicates that a maintainer wants help on an issue or pull request label Oct 1, 2015

JeffBezanson modified the milestones: 0.5.x, 0.5.0 Mar 9, 2016

StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 27, 2016

nalimilan mentioned this issue Jul 5, 2017

Feature: Cor and Cov for Sparse Matrices JuliaStats/StatsBase.jl#282

Closed

This was referenced Jul 10, 2017

Proposal: Cov for sparse matrices JuliaLang/LinearAlgebra.jl#447

Closed

Faster Sparse Covariance Matrix #22735

Merged

andreasnoack closed this as completed in #22735 Aug 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correlation sparse array is very slow #7788

Correlation sparse array is very slow #7788

paulanalyst commented Jul 30, 2014

ivarne commented Jul 30, 2014

paulanalyst commented Jul 31, 2014

ViralBShah commented Aug 1, 2014

paulanalyst commented Aug 1, 2014

paulanalyst commented Aug 11, 2014

ViralBShah commented Aug 11, 2014

paulanalyst commented Aug 11, 2014

ViralBShah commented Aug 11, 2014

lindahua commented Aug 11, 2014

paulanalyst commented Aug 12, 2014

lindahua commented Aug 12, 2014

paulanalyst commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

nalimilan commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

ViralBShah commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

ViralBShah commented Mar 19, 2015

simonster commented Mar 19, 2015

Correlation sparse array is very slow #7788

Correlation sparse array is very slow #7788

Comments

paulanalyst commented Jul 30, 2014

ivarne commented Jul 30, 2014

paulanalyst commented Jul 31, 2014

ViralBShah commented Aug 1, 2014

paulanalyst commented Aug 1, 2014

paulanalyst commented Aug 11, 2014

ViralBShah commented Aug 11, 2014

paulanalyst commented Aug 11, 2014

ViralBShah commented Aug 11, 2014

lindahua commented Aug 11, 2014

paulanalyst commented Aug 12, 2014

lindahua commented Aug 12, 2014

paulanalyst commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

nalimilan commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

ViralBShah commented Jan 30, 2015

paulanalyst commented Jan 30, 2015

andreasnoack commented Jan 30, 2015

ViralBShah commented Mar 19, 2015

simonster commented Mar 19, 2015