Improve Shogun PCA #3048

amoudgl · 2016-03-07T10:27:19Z

This ipython notebook compares Shogun PCA with my naive python implementation of PCA in Python and other toolkits like Scikit-learn and Matplotlib. As per results in notebook, Shogun performs equally well in terms of speed.

Result of Shogun PCA is also accurate except few minor issues that should be fixed to match the output with other toolkits:

In above notebook, I am using same plot function for all toolkits but Shogun gives a different plot for new projected points obtained by PCA. Principal components in output matrix are stacked in opposite fashion.
apply_to_feature_matrix method currently gives scaled output by default whereas other toolkits doesn't do so.

The text was updated successfully, but these errors were encountered:

karlnapf · 2016-03-07T18:20:06Z

Thanks for that! In this state, it is not that useful. But we can turn it into that easily.

this is a speed comparison. While the plots you made are nice, they are somehow redundant to the pca notebook we have.
To compare results, simply compare the norm of the eigenvector differences and the projected features.
No need for naive python implementation and matplotlib, though the former is good to do so that you understand the method. But not to include in the notebook.
The datasets are way too small to tell us anything about the speed. I suggest you make a table where you perform the performance for datasets with dimension (D) and number of datapoints (N) varying: N=(small, medium, large), D=(small, medium, large) and all combinations. Use a synthetic dataset for now.
PCA can be done in two ways: via Eigendecomposition of covariance and via SVD of data matrix. Shogun picks this automatically unless you specifiy by hand. This has drastic implications on the runtime.
Memory is another very important topic here. Can you profile that?
If you find things that are annoying about the shogun interface, or default settings (like scaling) that you dont like, please send a PR to fix them.

Looking forward to the update

amoudgl · 2016-03-08T14:19:12Z

@karlnapf I did tests as suggested and following are the results I obtained:

N - Number of samples
D - Number of features

Time analysis:

D	N	Shogun	scikit-learn
10	100	0.00031	0.00046
10	1000	0.00043	0.00102
10	10000	0.0074	0.0272
100	100	0.0021	0.0195
100	1000	0.023	0.148
100	10000	0.081	0.756
500	100	0.293	0.077
500	1000	0.261	1.852
500	10000	1.487	15.303
1000	100	0.36	0.159
1000	1000	2.1005	8.491
1000	10000	6.819	62.383
2000	100	1.639	0.264

Memory analysis:
This is memory analysis of PCA for Shogun and scikit-learn. In all the cases, Shogun performs much better than scikit-learn.

Above are the results of Shogun in AUTO mode. I tried other modes too and they are giving results as expected i.e. when N > D, EVD is performing better otherwise SVD.
I plotted samples in resultant projected features and it is giving accurate results, except that samples are scaled in Shogun (can also be viewed in notebook). I am reducing number of features to 2 in all above cases.
Shogun performs well when number of samples are large. But when D > N, Shogun is a bit slower than scikit-learn.

Python script for time and memory analysis can be viewed here.

vigsterkr · 2016-03-08T15:03:52Z

thanks a lot for the analysis!

i find it a bit weird/interesting & worth to investigate why does does SVD so slow.
especially these two examples are worth to look into:

D	N	Shogun	scikit-learn
500	100	0.293	0.077
500	1000	0.261	1.852

as even though there are 10 times less vectors the runtime is almost the same...

amoudgl · 2016-03-09T09:44:51Z

Hi @vigsterkr, I am going through source codes to identify this problem. But I have a silly doubt which is a bit off topic. It is as follows:

I commented one method apply_to_feature_matrix from class CPCA in PCA.cpp, PCA.h just to test changes using python modular interface. I tried to make python modular interface again using command sudo make python_modular_src in build folder. However, it built python modular interface again but when I again call this method (apply_to_feature_matrix) in python, it doesn't raise any error as it may be using previously built libraries. If I do sudo make in build folder, it starts building all the modules again which takes time. Also I couldn't find any option in Makefile to build preprocessor modules like PCA again.

So, what should be general approach to test in python modular interface after making some change in source file say PCA.cpp? Can you please help? Please correct me if I am wrong somewhere.

vigsterkr · 2016-03-09T09:53:50Z

first of all thanks for the benchmark, but it would be much better to do it as part of:
https://github.com/zoq/benchmarks

this framework already has support for PCA benchmarking. this way the results are easily reproducible and as well as it's part of a framework.

vigsterkr · 2016-03-09T09:54:06Z

regarding the changes: if you did make install you'll have to do it again...

karlnapf · 2016-03-10T12:48:23Z

This is great work! In particular that you included the memory benchmarks.

Indeed doing it as part of a benchmark system would be best -- especially since then we can document it and potentially include in a (new updated) overview paper of Shogun.

In terms of speed:

Shogun in AUTO always chooses the mode that gives less complexity (in theory)
SVD in eigen3 is known to be slow, but more stable. This is not a problem in my eyes as the different will always be marginal (and it is indeed in the results). We could add this to linalg and use the same svd call as scikit does (I suspect lapack). But that is not important for such differences.

Finally, the PCA code was tuned GSoC 2014. This is why it is very fast and reliable. The next step here would be to include these results in a maintainable way to that we can reproduce them without much hassle. If you do this, this could also be an example of how to do this. Do you want to go ahead with this?

Once again, this is great, in particular that we are an order of magnitude faster than scikit.

amoudgl · 2016-03-10T13:31:34Z

@karlnapf Thanks for your feedback. I'd be happy to go ahead with this. I wrote a python script that can reproduce these results. How should I include these results in a maintainable way? Should I make a new notebook that compares the two toolkits and illustrates time and memory analysis?

vigsterkr · 2016-03-10T13:42:11Z

@abhinavmoudgil95 see the referred benchmarking framework.. that's the preferred way.

karlnapf · 2016-03-10T19:16:08Z

@zoq or @rcurtin might have comments

amoudgl · 2016-03-11T12:27:12Z

@karlnapf @vigsterkr I am using the referred framework for PCA benchmarking but I am not able to get the output for shogun. Scikit was done with all datasets in about 50 minutes but Shogun is stuck at last dataset for about 4 hours and script is still running! I let the script running for about 5 hours last night but didn't get any output. So, I stopped the process and tried again today. I'll let you know if I get any results.

rcurtin · 2016-03-11T13:39:13Z

Take a look at the config.yaml and see the block for shogun:

https://github.com/zoq/benchmarks/blob/master/config.yaml#L1381

There are a lot of methods and a lot of datasets being run there. You might want to try instead just running it for PCA (if you aren't already doing that) which you could do like this:

$ make BLOCK=shogun METHODBLOCK=PCA

More information here:
https://github.com/zoq/benchmarks/blob/master/README.md#benchmarking-a-single-libary-and-a-single-method

I hope that helps. It's also possible that you might want to remove a few datasets from the configuration so that it runs faster (some are very large).

zoq · 2016-03-11T14:09:47Z

Even if it runs for hours it should, at least, print some output to standard
out. Can you show us the command that you used to start the benchmark?

Btw. you can limit the execution time for each benchmark using the timeout
parameter.

amoudgl · 2016-03-11T15:59:25Z

@rcurtin Yes you are correct. To test only PCA, I used the following command:
make run LOG=true BLOCK=shogun,scikit METHODBLOCK=PCA

@zoq Script is still running. Since I wanted to save the output in database, I set LOG=true, so I am not expecting any results on stdout. Till now this is what I got on stdout.

I think I'll remove this dataset from configuration file if I don't get output in an hour or so.

rcurtin · 2016-03-11T16:01:43Z

Yeah, the yearpredictionmsd dataset is pretty large; 500k+ points in I think 90 dimensions. So I'm not too surprised it's taking a long time.

amoudgl · 2016-03-11T16:26:06Z

Well, let me guess. From what I know, PCA has time complexity of O(np^2 + p^3). This dataset has 515345 instances and 90 features. Total number of operations required for this dataset by above complexity is 4175023500. On average, assuming 10^7 operations per second carried out by machine, shouldn't it give the output in around 7 minutes? Correct me if I am wrong somewhere.

rcurtin · 2016-03-11T16:27:54Z

The big-O notation leaves out constant factors, so, you can use big-O notation to talk about the scaling of the dataset as you increase its size, but you can't use it to predict how long a particular implementation will take. I think also the shogun PCA implementation may use SVD on the data or it may calculate the covariance of the dataset and eigendecompose that, depending on the properties of the data.

amoudgl · 2016-03-11T16:36:59Z

Yes, ignoring constants is not a correct idea but we can get rough estimate, right? In PCA, we don't do much extra (i.e those calculations which are constant and independent of N and D) calculations other than eigen decomposition, covariance matrix calculation and matrix multiplication to get new dataset.
Shogun uses SVD when number of dimensions(D) are more than number of instances(N). Here, it'll use eigen decomposition of covariance matrix. Earlier when I did the benchmarking, Shogun was much faster than scikit when N > D (results in this thread above). Scikit was done with all datasets in around 45 minutes, so I was assuming Shogun will take much less time.

karlnapf · 2016-03-11T19:14:52Z

In big datasets, there are all sorts of weird effects happening on computers due to the way its memory is managed. This can reveil further problems we have in the Shogun implementation.

For now, I suggest that you start with using a single smallish dataset to keep things simple, and then increase

karlnapf · 2016-03-11T19:15:03Z

thanks @rcurtin @zoq for getting back!

karlnapf · 2016-03-12T14:29:45Z

#3054

karlnapf · 2016-03-19T13:13:23Z

#2991

karlnapf · 2017-03-23T15:36:55Z

any updates here?

karlnapf · 2017-03-23T15:37:40Z

Might be nice to explore this further in a GSOC entrance task, especially doing a faster (LAPACK) SVD

karlnapf added good first issue Tag: Performance Improvement labels Mar 23, 2017

karlnapf closed this as completed Dec 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Shogun PCA #3048

Improve Shogun PCA #3048

amoudgl commented Mar 7, 2016

karlnapf commented Mar 7, 2016

amoudgl commented Mar 8, 2016

vigsterkr commented Mar 8, 2016

amoudgl commented Mar 9, 2016

vigsterkr commented Mar 9, 2016

vigsterkr commented Mar 9, 2016

karlnapf commented Mar 10, 2016

amoudgl commented Mar 10, 2016

vigsterkr commented Mar 10, 2016

karlnapf commented Mar 10, 2016

amoudgl commented Mar 11, 2016

rcurtin commented Mar 11, 2016

zoq commented Mar 11, 2016

amoudgl commented Mar 11, 2016

rcurtin commented Mar 11, 2016

amoudgl commented Mar 11, 2016

rcurtin commented Mar 11, 2016

amoudgl commented Mar 11, 2016

karlnapf commented Mar 11, 2016

karlnapf commented Mar 11, 2016

karlnapf commented Mar 12, 2016

karlnapf commented Mar 19, 2016

karlnapf commented Mar 23, 2017

karlnapf commented Mar 23, 2017

Improve Shogun PCA #3048

Improve Shogun PCA #3048

Comments

amoudgl commented Mar 7, 2016

karlnapf commented Mar 7, 2016

amoudgl commented Mar 8, 2016

vigsterkr commented Mar 8, 2016

amoudgl commented Mar 9, 2016

vigsterkr commented Mar 9, 2016

vigsterkr commented Mar 9, 2016

karlnapf commented Mar 10, 2016

amoudgl commented Mar 10, 2016

vigsterkr commented Mar 10, 2016

karlnapf commented Mar 10, 2016

amoudgl commented Mar 11, 2016

rcurtin commented Mar 11, 2016

zoq commented Mar 11, 2016

amoudgl commented Mar 11, 2016

rcurtin commented Mar 11, 2016

amoudgl commented Mar 11, 2016

rcurtin commented Mar 11, 2016

amoudgl commented Mar 11, 2016

karlnapf commented Mar 11, 2016

karlnapf commented Mar 11, 2016

karlnapf commented Mar 12, 2016

karlnapf commented Mar 19, 2016

karlnapf commented Mar 23, 2017

karlnapf commented Mar 23, 2017