Improve a Shogun algorithm #2991

karlnapf · 2016-02-17T16:21:04Z

Entrace task for the GSoC project "the usual suspects", see here. Also a good entrance task for any other GSoC project.

Many of Shoguns algorithms have problems, especially the more basic ones. We might not even know about this. This entrance task is to

Pick a simple ML algorithm (see below)
Write a script that benchmarks Shogun against scikit-learn or MLPack on a few simple cases. This should address multiple aspects: correctness, speed, memory consumption, robustness, ease of use.
If Shogun is significantly worse than a competing implementation:

Identify bottlenecks in code. These can be of statistical nature, implementation problems, bugs, etc.
Fix them. We can help you here.
Give the code a clean-up (we like easy to read code, even it might not seem like that). Things to consider might be
Avoid using LAPACK/Eigen but using the linalg interface (can be extended through this)
get rid of old feature vector representations based on pointers and use SGVector and friends instead.
...
If you are already touching the code, why not give the interface documentation a bit of love: write if not exists, fix typos, make it clearer
unit test your changes

Some candidates to start with.

KMeans (see k-Means is very slow, scales badly with k #2987)
KNN
- LARS
- Linear regression
- KRR
- Ridge regression
- GMM models
- KDE
- HMM (more messy)

The text was updated successfully, but these errors were encountered:

ibrahim5253 · 2016-02-17T21:01:42Z

@karlnapf can you please direct me to a specific but very basic issue/bug that I can start with? Found #2987 a bit involved.

karlnapf · 2016-02-18T12:40:39Z

Yes, you might want to get rid of the LAPACK calls in KRR

These can be replace with Eigen3 calls.

Unit testing is also not existing here.

In addition, if you feel brave, you might want to add a linear solve to our linalg library. @lambday can give hints here

youssef-emad · 2016-03-01T18:06:29Z

@karlnapf I made this notebook for comparison between Least Square Linear Regression in sklearn and in Shogun.

As demonstrated in the notebook it seems that Shogun has no problem in speed but there is a problem with accuracy especially when the data shifted away from the zero ( e.g. the third example).

The reason is that the solution provided by Shogun's LeastSquareRegression is a linear method with bias 0. As appears in the train_machine method that the bias is not taken into consideration and has to be set manually through set_bias().

This can be modified easily by adding the bias as an additional feature and the value of this feature will be one along the observations.
I searched CFeatures class but I didn't find a way to add a new feature to a set of features.
So an implementation to this issue is to get the feature matrix from the input CFeature then copy it to a larger array with the bias then transform it into CFeature again but I think this implementation is a little bit dirty.

karlnapf · 2016-03-01T19:34:26Z

Nice one, feel free to send a patch with corrections

In linear regression, the bias term is the data mean. You can compute and add that explicitly. This is a great catch and should be fixed. Definitely, Shogun should take care of that by default.
you could get rid of the ugly lapack call cblas_daxpy
- either replace with a linalg solve. For that you would have to add that, see the Readme
- use eigen3 for the solve

Finally, could you test this on a larger dataset in higher dimensions, can be synthetic as this is just for speed. Both N and D should be in [small, medium, large] where

small: few hundred
medium: 2000
large: >10000

Make sure the Shogun implementation at any point in time only has one matrix in memory (i.e. 10000x10000). You can profile the memory usage as well.

karlnapf · 2016-03-01T19:41:33Z

BTW the docs state that the bias is set to 0. You can double check them as well, and clean up if there are problems.
But even if the docs mention the 0 bias, we should still change default behaviour

youssef-emad · 2016-03-01T19:43:52Z

@karlnapf I got it.I will push the last edits for the pull request and start working on this issue.

karlnapf · 2016-03-01T22:58:56Z

Can you also check whether the algorithm works when a preprocessor is attached to the features?
There are for example zero-mean preprocessors.
BTW I think about the bias, we should have a boolean flag in the constructor that allows to turn off the bias (e.g. when the mean was remove already), where the default is that it is on.

karlnapf · 2016-03-02T18:24:49Z

@youssef-emad Could we move the discussion in its own issue? You can just open one if you have updates...

youssef-emad · 2016-03-02T18:46:13Z

@karlnapf I don't have updates yet. I got a little bit busy with school but I'll start working on it by tomorrow.

Anjan1729 · 2016-03-04T05:37:38Z

@karlnapf , I am not that experienced in ML.Can you give me something I could start with?

karlnapf · 2016-03-04T23:48:38Z

@Anjan1729 there is a long list above. Pick any of the ones you like. Easy algorithms preferred, what about LDA?

amoudgl · 2016-03-07T01:56:09Z

Hi @karlnapf

I created a ipython notebook which compares Shogun PCA with other toolkits like scikit-learn, matplotlib and my naive python implementation of PCA. Shogun PCA is equally well in terms of speed but matrix received by get_transformation_matrix() method of Shogun is scaled more as compared to other standard toolkits. I am studying Shogun PCA source code to identify any bottlenecks and will soon come up with update. Please give suggestions, if any.

karlnapf · 2016-03-07T09:55:53Z

Hi @abhinavagarwalla Great! Can you open a seperate issue for this and we discuss there? We should keep this thread clean of discussions

sunil-sangwan · 2016-03-08T20:12:20Z

Hello @karlnapf
I want to contribute in improving algorithm but i like to work in python .can you advice me from where to start .

vigsterkr · 2016-03-08T20:13:32Z

@shark-s shogun is a c++ library that has a python interface (as well). if you want to work on shogun you should get familiar with c++

sunil-sangwan · 2016-03-08T20:18:05Z

thanks for reply .
I have also knowledge of c++ . but can i make models or test algorithms in python or i have to do it in c++ ?

vigsterkr · 2016-03-09T00:45:15Z

@shark-s you cab test in pyhon but write it in only in c++

youssef-emad · 2016-03-15T17:44:08Z

@karlnapf I made this notebook as a comparison between Shogun and SKlearn in multiple regression models ( Linear Ridge Regression, Lasso ,LARS , KRR).
I used 4 datasets of different sizes and dimensions.
Dataset 1: size 400 , Dimensions 13
Dataset 2: size 700 , Dimensions 1
Dataset 3: size 17000 , Dimensions 20
Dataset 4: size 20000 , Dimensions 20

Observations:
1- Linear Ridge Regressionand KRR: Shogun is faster with approximately the same RSS.

2- Lasso: Shogun is faster but It seems like there is something wrong as there was no weight set to zero and the resulted weights seemed close to what achieved using Ridge Regression.

3- LARS and Lasso: The notebook kernel crashed at the third dataset. I think This happened because the dataset needed a very large number of iterations to converge. This was handled at SKlearn by setting a default maximum number of iterations = 1000.

As I understand from the docs that to have a Lasso Regression Model , We can use LARS with parameter lasso=True but the resulted weights are the same.Did I get it wrong ?

vigsterkr · 2016-03-15T17:58:27Z

@youssef-emad great work! we would really like to have a more maintainable and reproducable way of doing benchmarking, so it'd be great if you could port this idea to this framework:
https://github.com/zoq/benchmarks

as this in this current form is just a snapshot of each library... and it'll be hard to re-run this in newer releases... as well as you'll see that there are many other libraries we should include in the comparison

youssef-emad · 2016-03-15T18:05:58Z

@vigsterkr yeah sure , I just have an assignment to deliver in a few hours so I'll finish it and check this out 😃

karlnapf · 2016-03-15T20:38:54Z

Thanks, very nice work. @vigsterkr is right, these notebooks can only be a first step to see what is going on. Eventually we want the scripts to do this in the benchmark platform -- so that we can re-run. But of course the notebooks are a nice way of exploring.

A few words on the benchmarks:

Aim for benchmarks that take at least a few seconds to run. Otherwise we only observe noise of the python interpreter being fired up. We want actual runtime. One or two short ones are ok to test this overhad. But in general, what counts is performance on longer runs.
N is large enough -- but can be larger (see first point) , what about larger D?
easy problems usually run faster than hard ones. Try to create harder regression problems -- higher dimensions -- more noise. Using N=20000, D=3 and linear functions does not represent an interesting problem.
Make sure the regulariser is set to the same value in both shogun and sklearn
if you find bugs (like a crash or wrong result), always report them in an issue and give a way to reproduce the bug

karlnapf · 2016-03-15T20:51:05Z

Good luck with the assignment :)

youssef-emad · 2016-03-19T10:29:34Z

@karlnapf @vigsterkr I added new 4 benchmarks and made a pull request at the benchmarks repo

I also made a notebook to compare the ease of use and accuracy between Shogun and Sklearn for 4 different classifiers ( Naive bayes , KNN,QDA,Logestic Regression) and to check large datasets with larger dimensions.
It seems Shogun have a problem with large dimensions and large number of classes

Note: Sorry for the delay , I was stuck with some academic commitments.

karlnapf · 2016-03-19T12:57:07Z

Great that this finally happened! For the benchmarks, I guess we need to compare against something, e.g. sklearn. As I can see here, the comparison so far only happens in your notebook?.

karlnapf · 2016-03-19T12:59:18Z

In the notebook, it would be great if you could label things you print, it takes some time to parse this otherwise for me

karlnapf · 2016-03-19T13:01:57Z

Comments:

Use larger datasets -- timings at microsecond scale have very little meaning as Python takes longer to start running than the actual computations.
You should always make sure that the options are set in such way that the algorithms do exactly the same

karlnapf · 2016-03-19T13:03:47Z

Looks like you identified problems in Shogun's results. These should definitely be investigated. Very alarming. I suggest that you put benchmark scripts for sklearn as well. Also, let's isolate one of the algorithms and try to understand why Shogun's results are so different
Very nice catch btw :) !
But I dont really agree with your conclusion ... we better find out whats going on there

youssef-emad · 2016-03-19T14:00:02Z

@karlnapf I'll add the benchmarks for sklearn and I'll try to investigate what's going on.

karlnapf · 2016-03-19T14:42:10Z

Start with one of the algos maybe. KNN or so

youssef-emad · 2016-03-19T17:55:33Z

@karlnapf I found out what was going on. I made a horrible mistake while transforming data to shogun's format. I fixed this embarrassing mistake and add visual comparison for decision boundaries for datasets with 2 features.Check the updated notebook.

New Observations:
1- KNN and Naive Bayes : Accuracy and decision boundaries are identical at both Shogun and sklearn at all datasets.
2- QDA and Logestic Regression : Accuracy and decision boundaries are approximately identical at the first 2 datasets but for the third and the fourth datasets ( more classes and more dimensions) , Shogun's Accuracy seems to be lower. I'll try to investigate that.

I also checked the options and ensured that the same options are applied to both Shogun and Sklearn.
Final Note: Sorry again for that mistake.

youssef-emad · 2016-03-20T20:44:49Z

@karlnapf I investigated QDA , found an issue and solved it. The issue was that the accuracy of Shogun's QDA strongly decreases as the number of features is more than 2. That happened because of an unnecessary transpose of the rotations matrix. Removing this line enhanced Shougun's Accuracy to be the same as sklearn. check this notebook for results before and after modification.
I also made a PR with this simple change.

Hephaestus12 · 2020-03-21T22:12:41Z

Hi, are there any other regression algorithms left to be analyzed?
As far as I see, Least squares regression, Linear Ridge Regression, Lasso, LARS, and KRR have scripts and notebooks written which benchmark these algorithms against sklearn.

If not regression, what other algorithms can I work on?

Hephaestus12 · 2020-03-23T07:24:14Z

@karlnapf is this issue still relevant? Which algorithms should I pick to analyze next?

karlnapf · 2020-03-23T15:41:54Z

I am pretty sure there are be algorithms left. For example you could investigate where shogun is slower than other libraries using the benchmark library mentioned above.

karlnapf added Type: Bugfixing good first issue Tag: Cleanup Tag: Testing labels Feb 17, 2016

karlnapf mentioned this issue Mar 2, 2016

k-Means is very slow, scales badly with k #2987

Closed

karlnapf mentioned this issue Mar 19, 2016

Improve Shogun PCA #3048

Closed

samdbrice added the Type: Bug label Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve a Shogun algorithm #2991

Improve a Shogun algorithm #2991

karlnapf commented Feb 17, 2016

ibrahim5253 commented Feb 17, 2016

karlnapf commented Feb 18, 2016

youssef-emad commented Mar 1, 2016

karlnapf commented Mar 1, 2016

karlnapf commented Mar 1, 2016

youssef-emad commented Mar 1, 2016

karlnapf commented Mar 1, 2016

karlnapf commented Mar 2, 2016

youssef-emad commented Mar 2, 2016

Anjan1729 commented Mar 4, 2016

karlnapf commented Mar 4, 2016

amoudgl commented Mar 7, 2016

karlnapf commented Mar 7, 2016

sunil-sangwan commented Mar 8, 2016

vigsterkr commented Mar 8, 2016

sunil-sangwan commented Mar 8, 2016

vigsterkr commented Mar 9, 2016

youssef-emad commented Mar 15, 2016

vigsterkr commented Mar 15, 2016

youssef-emad commented Mar 15, 2016

karlnapf commented Mar 15, 2016

karlnapf commented Mar 15, 2016

youssef-emad commented Mar 19, 2016

karlnapf commented Mar 19, 2016

karlnapf commented Mar 19, 2016

karlnapf commented Mar 19, 2016

karlnapf commented Mar 19, 2016

youssef-emad commented Mar 19, 2016

karlnapf commented Mar 19, 2016

youssef-emad commented Mar 19, 2016

youssef-emad commented Mar 20, 2016

Hephaestus12 commented Mar 21, 2020

Hephaestus12 commented Mar 23, 2020

karlnapf commented Mar 23, 2020