Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Shogun PCA #3048

Closed
amoudgl opened this issue Mar 7, 2016 · 24 comments
Closed

Improve Shogun PCA #3048

amoudgl opened this issue Mar 7, 2016 · 24 comments

Comments

@amoudgl
Copy link
Contributor

amoudgl commented Mar 7, 2016

This ipython notebook compares Shogun PCA with my naive python implementation of PCA in Python and other toolkits like Scikit-learn and Matplotlib. As per results in notebook, Shogun performs equally well in terms of speed.

Result of Shogun PCA is also accurate except few minor issues that should be fixed to match the output with other toolkits:

  1. In above notebook, I am using same plot function for all toolkits but Shogun gives a different plot for new projected points obtained by PCA. Principal components in output matrix are stacked in opposite fashion.
  2. apply_to_feature_matrix method currently gives scaled output by default whereas other toolkits doesn't do so.
@karlnapf
Copy link
Member

karlnapf commented Mar 7, 2016

Thanks for that! In this state, it is not that useful. But we can turn it into that easily.

  • this is a speed comparison. While the plots you made are nice, they are somehow redundant to the pca notebook we have.
  • To compare results, simply compare the norm of the eigenvector differences and the projected features.
  • No need for naive python implementation and matplotlib, though the former is good to do so that you understand the method. But not to include in the notebook.
  • The datasets are way too small to tell us anything about the speed. I suggest you make a table where you perform the performance for datasets with dimension (D) and number of datapoints (N) varying: N=(small, medium, large), D=(small, medium, large) and all combinations. Use a synthetic dataset for now.
  • PCA can be done in two ways: via Eigendecomposition of covariance and via SVD of data matrix. Shogun picks this automatically unless you specifiy by hand. This has drastic implications on the runtime.
  • Memory is another very important topic here. Can you profile that?
  • If you find things that are annoying about the shogun interface, or default settings (like scaling) that you dont like, please send a PR to fix them.

Looking forward to the update

@amoudgl
Copy link
Contributor Author

amoudgl commented Mar 8, 2016

@karlnapf I did tests as suggested and following are the results I obtained:

N - Number of samples
D - Number of features

Time analysis:

D N Shogun scikit-learn
10 100 0.00031 0.00046
10 1000 0.00043 0.00102
10 10000 0.0074 0.0272
100 100 0.0021 0.0195
100 1000 0.023 0.148
100 10000 0.081 0.756
500 100 0.293 0.077
500 1000 0.261 1.852
500 10000 1.487 15.303
1000 100 0.36 0.159
1000 1000 2.1005 8.491
1000 10000 6.819 62.383
2000 100 1.639 0.264

Memory analysis:
This is memory analysis of PCA for Shogun and scikit-learn. In all the cases, Shogun performs much better than scikit-learn.

  • Above are the results of Shogun in AUTO mode. I tried other modes too and they are giving results as expected i.e. when N > D, EVD is performing better otherwise SVD.
  • I plotted samples in resultant projected features and it is giving accurate results, except that samples are scaled in Shogun (can also be viewed in notebook). I am reducing number of features to 2 in all above cases.
  • Shogun performs well when number of samples are large. But when D > N, Shogun is a bit slower than scikit-learn.

Python script for time and memory analysis can be viewed here.

@vigsterkr
Copy link
Member

thanks a lot for the analysis!

i find it a bit weird/interesting & worth to investigate why does does SVD so slow.
especially these two examples are worth to look into:

D N Shogun scikit-learn
500 100 0.293 0.077
500 1000 0.261 1.852

as even though there are 10 times less vectors the runtime is almost the same...

@amoudgl
Copy link
Contributor Author

amoudgl commented Mar 9, 2016

Hi @vigsterkr, I am going through source codes to identify this problem. But I have a silly doubt which is a bit off topic. It is as follows:

I commented one method apply_to_feature_matrix from class CPCA in PCA.cpp, PCA.h just to test changes using python modular interface. I tried to make python modular interface again using command sudo make python_modular_src in build folder. However, it built python modular interface again but when I again call this method (apply_to_feature_matrix) in python, it doesn't raise any error as it may be using previously built libraries. If I do sudo make in build folder, it starts building all the modules again which takes time. Also I couldn't find any option in Makefile to build preprocessor modules like PCA again.

So, what should be general approach to test in python modular interface after making some change in source file say PCA.cpp? Can you please help? Please correct me if I am wrong somewhere.

@vigsterkr
Copy link
Member

first of all thanks for the benchmark, but it would be much better to do it as part of:
https://github.com/zoq/benchmarks

this framework already has support for PCA benchmarking. this way the results are easily reproducible and as well as it's part of a framework.

@vigsterkr
Copy link
Member

regarding the changes: if you did make install you'll have to do it again...

@karlnapf
Copy link
Member

This is great work! In particular that you included the memory benchmarks.

Indeed doing it as part of a benchmark system would be best -- especially since then we can document it and potentially include in a (new updated) overview paper of Shogun.

In terms of speed:

  • Shogun in AUTO always chooses the mode that gives less complexity (in theory)
  • SVD in eigen3 is known to be slow, but more stable. This is not a problem in my eyes as the different will always be marginal (and it is indeed in the results). We could add this to linalg and use the same svd call as scikit does (I suspect lapack). But that is not important for such differences.

Finally, the PCA code was tuned GSoC 2014. This is why it is very fast and reliable. The next step here would be to include these results in a maintainable way to that we can reproduce them without much hassle. If you do this, this could also be an example of how to do this. Do you want to go ahead with this?

Once again, this is great, in particular that we are an order of magnitude faster than scikit.

@amoudgl
Copy link
Contributor Author

amoudgl commented Mar 10, 2016

@karlnapf Thanks for your feedback. I'd be happy to go ahead with this. I wrote a python script that can reproduce these results. How should I include these results in a maintainable way? Should I make a new notebook that compares the two toolkits and illustrates time and memory analysis?

@vigsterkr
Copy link
Member

@abhinavmoudgil95 see the referred benchmarking framework.. that's the preferred way.

@karlnapf
Copy link
Member

@zoq or @rcurtin might have comments

@amoudgl
Copy link
Contributor Author

amoudgl commented Mar 11, 2016

@karlnapf @vigsterkr I am using the referred framework for PCA benchmarking but I am not able to get the output for shogun. Scikit was done with all datasets in about 50 minutes but Shogun is stuck at last dataset for about 4 hours and script is still running! I let the script running for about 5 hours last night but didn't get any output. So, I stopped the process and tried again today. I'll let you know if I get any results.

@rcurtin
Copy link

rcurtin commented Mar 11, 2016

Take a look at the config.yaml and see the block for shogun:

https://github.com/zoq/benchmarks/blob/master/config.yaml#L1381

There are a lot of methods and a lot of datasets being run there. You might want to try instead just running it for PCA (if you aren't already doing that) which you could do like this:

$ make BLOCK=shogun METHODBLOCK=PCA

More information here:
https://github.com/zoq/benchmarks/blob/master/README.md#benchmarking-a-single-libary-and-a-single-method

I hope that helps. It's also possible that you might want to remove a few datasets from the configuration so that it runs faster (some are very large).

@zoq
Copy link
Contributor

zoq commented Mar 11, 2016

Even if it runs for hours it should, at least, print some output to standard
out. Can you show us the command that you used to start the benchmark?

Btw. you can limit the execution time for each benchmark using the timeout
parameter.

@amoudgl
Copy link
Contributor Author

amoudgl commented Mar 11, 2016

@rcurtin Yes you are correct. To test only PCA, I used the following command:
make run LOG=true BLOCK=shogun,scikit METHODBLOCK=PCA

@zoq Script is still running. Since I wanted to save the output in database, I set LOG=true, so I am not expecting any results on stdout. Till now this is what I got on stdout.

I think I'll remove this dataset from configuration file if I don't get output in an hour or so.

@rcurtin
Copy link

rcurtin commented Mar 11, 2016

Yeah, the yearpredictionmsd dataset is pretty large; 500k+ points in I think 90 dimensions. So I'm not too surprised it's taking a long time.

@amoudgl
Copy link
Contributor Author

amoudgl commented Mar 11, 2016

Well, let me guess. From what I know, PCA has time complexity of O(np^2 + p^3). This dataset has 515345 instances and 90 features. Total number of operations required for this dataset by above complexity is 4175023500. On average, assuming 10^7 operations per second carried out by machine, shouldn't it give the output in around 7 minutes? Correct me if I am wrong somewhere.

@rcurtin
Copy link

rcurtin commented Mar 11, 2016

The big-O notation leaves out constant factors, so, you can use big-O notation to talk about the scaling of the dataset as you increase its size, but you can't use it to predict how long a particular implementation will take. I think also the shogun PCA implementation may use SVD on the data or it may calculate the covariance of the dataset and eigendecompose that, depending on the properties of the data.

@amoudgl
Copy link
Contributor Author

amoudgl commented Mar 11, 2016

Yes, ignoring constants is not a correct idea but we can get rough estimate, right? In PCA, we don't do much extra (i.e those calculations which are constant and independent of N and D) calculations other than eigen decomposition, covariance matrix calculation and matrix multiplication to get new dataset.
Shogun uses SVD when number of dimensions(D) are more than number of instances(N). Here, it'll use eigen decomposition of covariance matrix. Earlier when I did the benchmarking, Shogun was much faster than scikit when N > D (results in this thread above). Scikit was done with all datasets in around 45 minutes, so I was assuming Shogun will take much less time.

@karlnapf
Copy link
Member

In big datasets, there are all sorts of weird effects happening on computers due to the way its memory is managed. This can reveil further problems we have in the Shogun implementation.

For now, I suggest that you start with using a single smallish dataset to keep things simple, and then increase

@karlnapf
Copy link
Member

thanks @rcurtin @zoq for getting back!

@karlnapf
Copy link
Member

#3054

@karlnapf
Copy link
Member

#2991

@karlnapf
Copy link
Member

any updates here?

@karlnapf
Copy link
Member

Might be nice to explore this further in a GSOC entrance task, especially doing a faster (LAPACK) SVD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants