-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Shogun PCA #3048
Comments
Thanks for that! In this state, it is not that useful. But we can turn it into that easily.
Looking forward to the update |
@karlnapf I did tests as suggested and following are the results I obtained: N - Number of samples Time analysis:
Memory analysis:
Python script for time and memory analysis can be viewed here. |
thanks a lot for the analysis! i find it a bit weird/interesting & worth to investigate why does does SVD so slow.
as even though there are 10 times less vectors the runtime is almost the same... |
Hi @vigsterkr, I am going through source codes to identify this problem. But I have a silly doubt which is a bit off topic. It is as follows: I commented one method So, what should be general approach to test in python modular interface after making some change in source file say PCA.cpp? Can you please help? Please correct me if I am wrong somewhere. |
first of all thanks for the benchmark, but it would be much better to do it as part of: this framework already has support for PCA benchmarking. this way the results are easily reproducible and as well as it's part of a framework. |
regarding the changes: if you did |
This is great work! In particular that you included the memory benchmarks. Indeed doing it as part of a benchmark system would be best -- especially since then we can document it and potentially include in a (new updated) overview paper of Shogun. In terms of speed:
Finally, the PCA code was tuned GSoC 2014. This is why it is very fast and reliable. The next step here would be to include these results in a maintainable way to that we can reproduce them without much hassle. If you do this, this could also be an example of how to do this. Do you want to go ahead with this? Once again, this is great, in particular that we are an order of magnitude faster than scikit. |
@abhinavmoudgil95 see the referred benchmarking framework.. that's the preferred way. |
@karlnapf @vigsterkr I am using the referred framework for PCA benchmarking but I am not able to get the output for shogun. Scikit was done with all datasets in about 50 minutes but Shogun is stuck at last dataset for about 4 hours and script is still running! I let the script running for about 5 hours last night but didn't get any output. So, I stopped the process and tried again today. I'll let you know if I get any results. |
Take a look at the https://github.com/zoq/benchmarks/blob/master/config.yaml#L1381 There are a lot of methods and a lot of datasets being run there. You might want to try instead just running it for PCA (if you aren't already doing that) which you could do like this:
More information here: I hope that helps. It's also possible that you might want to remove a few datasets from the configuration so that it runs faster (some are very large). |
Even if it runs for hours it should, at least, print some output to standard Btw. you can limit the execution time for each benchmark using the timeout |
@rcurtin Yes you are correct. To test only PCA, I used the following command: @zoq Script is still running. Since I wanted to save the output in database, I set I think I'll remove this dataset from configuration file if I don't get output in an hour or so. |
Yeah, the yearpredictionmsd dataset is pretty large; 500k+ points in I think 90 dimensions. So I'm not too surprised it's taking a long time. |
Well, let me guess. From what I know, PCA has time complexity of |
The big-O notation leaves out constant factors, so, you can use big-O notation to talk about the scaling of the dataset as you increase its size, but you can't use it to predict how long a particular implementation will take. I think also the shogun PCA implementation may use SVD on the data or it may calculate the covariance of the dataset and eigendecompose that, depending on the properties of the data. |
Yes, ignoring constants is not a correct idea but we can get rough estimate, right? In PCA, we don't do much extra (i.e those calculations which are constant and independent of N and D) calculations other than eigen decomposition, covariance matrix calculation and matrix multiplication to get new dataset. |
In big datasets, there are all sorts of weird effects happening on computers due to the way its memory is managed. This can reveil further problems we have in the Shogun implementation. For now, I suggest that you start with using a single smallish dataset to keep things simple, and then increase |
any updates here? |
Might be nice to explore this further in a GSOC entrance task, especially doing a faster (LAPACK) SVD |
This ipython notebook compares Shogun PCA with my naive python implementation of PCA in Python and other toolkits like Scikit-learn and Matplotlib. As per results in notebook, Shogun performs equally well in terms of speed.
Result of Shogun PCA is also accurate except few minor issues that should be fixed to match the output with other toolkits:
apply_to_feature_matrix
method currently gives scaled output by default whereas other toolkits doesn't do so.The text was updated successfully, but these errors were encountered: