Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve a Shogun algorithm #2991

Open
karlnapf opened this issue Feb 17, 2016 · 34 comments
Open

Improve a Shogun algorithm #2991

karlnapf opened this issue Feb 17, 2016 · 34 comments

Comments

@karlnapf
Copy link
Member

Entrace task for the GSoC project "the usual suspects", see here. Also a good entrance task for any other GSoC project.

Many of Shoguns algorithms have problems, especially the more basic ones. We might not even know about this. This entrance task is to

  1. Pick a simple ML algorithm (see below)
  2. Write a script that benchmarks Shogun against scikit-learn or MLPack on a few simple cases. This should address multiple aspects: correctness, speed, memory consumption, robustness, ease of use.
  3. If Shogun is significantly worse than a competing implementation:
  • Identify bottlenecks in code. These can be of statistical nature, implementation problems, bugs, etc.
  • Fix them. We can help you here.
  • Give the code a clean-up (we like easy to read code, even it might not seem like that). Things to consider might be
  • Avoid using LAPACK/Eigen but using the linalg interface (can be extended through this)
  • get rid of old feature vector representations based on pointers and use SGVector and friends instead.
  • ...
  • If you are already touching the code, why not give the interface documentation a bit of love: write if not exists, fix typos, make it clearer
  • unit test your changes

Some candidates to start with.

@ibrahim5253
Copy link

@karlnapf can you please direct me to a specific but very basic issue/bug that I can start with? Found #2987 a bit involved.

@karlnapf
Copy link
Member Author

Yes, you might want to get rid of the LAPACK calls in KRR

These can be replace with Eigen3 calls.

Unit testing is also not existing here.

In addition, if you feel brave, you might want to add a linear solve to our linalg library. @lambday can give hints here

@youssef-emad
Copy link
Contributor

@karlnapf I made this notebook for comparison between Least Square Linear Regression in sklearn and in Shogun.

As demonstrated in the notebook it seems that Shogun has no problem in speed but there is a problem with accuracy especially when the data shifted away from the zero ( e.g. the third example).

The reason is that the solution provided by Shogun's LeastSquareRegression is a linear method with bias 0. As appears in the train_machine method that the bias is not taken into consideration and has to be set manually through set_bias().

This can be modified easily by adding the bias as an additional feature and the value of this feature will be one along the observations.
I searched CFeatures class but I didn't find a way to add a new feature to a set of features.
So an implementation to this issue is to get the feature matrix from the input CFeature then copy it to a larger array with the bias then transform it into CFeature again but I think this implementation is a little bit dirty.

@karlnapf
Copy link
Member Author

karlnapf commented Mar 1, 2016

Nice one, feel free to send a patch with corrections

  • In linear regression, the bias term is the data mean. You can compute and add that explicitly. This is a great catch and should be fixed. Definitely, Shogun should take care of that by default.
  • you could get rid of the ugly lapack call cblas_daxpy
    • either replace with a linalg solve. For that you would have to add that, see the Readme
    • use eigen3 for the solve

Finally, could you test this on a larger dataset in higher dimensions, can be synthetic as this is just for speed. Both N and D should be in [small, medium, large] where

  • small: few hundred
  • medium: 2000
  • large: >10000

Make sure the Shogun implementation at any point in time only has one matrix in memory (i.e. 10000x10000). You can profile the memory usage as well.

@karlnapf
Copy link
Member Author

karlnapf commented Mar 1, 2016

BTW the docs state that the bias is set to 0. You can double check them as well, and clean up if there are problems.
But even if the docs mention the 0 bias, we should still change default behaviour

@youssef-emad
Copy link
Contributor

@karlnapf I got it.I will push the last edits for the pull request and start working on this issue.

@karlnapf
Copy link
Member Author

karlnapf commented Mar 1, 2016

Can you also check whether the algorithm works when a preprocessor is attached to the features?
There are for example zero-mean preprocessors.
BTW I think about the bias, we should have a boolean flag in the constructor that allows to turn off the bias (e.g. when the mean was remove already), where the default is that it is on.

@karlnapf
Copy link
Member Author

karlnapf commented Mar 2, 2016

@youssef-emad Could we move the discussion in its own issue? You can just open one if you have updates...

@youssef-emad
Copy link
Contributor

@karlnapf I don't have updates yet. I got a little bit busy with school but I'll start working on it by tomorrow.

@Anjan1729
Copy link

@karlnapf , I am not that experienced in ML.Can you give me something I could start with?

@karlnapf
Copy link
Member Author

karlnapf commented Mar 4, 2016

@Anjan1729 there is a long list above. Pick any of the ones you like. Easy algorithms preferred, what about LDA?

@amoudgl
Copy link
Contributor

amoudgl commented Mar 7, 2016

Hi @karlnapf

I created a ipython notebook which compares Shogun PCA with other toolkits like scikit-learn, matplotlib and my naive python implementation of PCA. Shogun PCA is equally well in terms of speed but matrix received by get_transformation_matrix() method of Shogun is scaled more as compared to other standard toolkits. I am studying Shogun PCA source code to identify any bottlenecks and will soon come up with update. Please give suggestions, if any.

@karlnapf
Copy link
Member Author

karlnapf commented Mar 7, 2016

Hi @abhinavagarwalla Great! Can you open a seperate issue for this and we discuss there? We should keep this thread clean of discussions

@sunil-sangwan
Copy link

Hello @karlnapf
I want to contribute in improving algorithm but i like to work in python .can you advice me from where to start .

@vigsterkr
Copy link
Member

@shark-s shogun is a c++ library that has a python interface (as well). if you want to work on shogun you should get familiar with c++

@sunil-sangwan
Copy link

thanks for reply .
I have also knowledge of c++ . but can i make models or test algorithms in python or i have to do it in c++ ?

@vigsterkr
Copy link
Member

@shark-s you cab test in pyhon but write it in only in c++

@youssef-emad
Copy link
Contributor

@karlnapf I made this notebook as a comparison between Shogun and SKlearn in multiple regression models ( Linear Ridge Regression, Lasso ,LARS , KRR).
I used 4 datasets of different sizes and dimensions.
Dataset 1: size 400 , Dimensions 13
Dataset 2: size 700 , Dimensions 1
Dataset 3: size 17000 , Dimensions 20
Dataset 4: size 20000 , Dimensions 20

Observations:
1- Linear Ridge Regressionand KRR: Shogun is faster with approximately the same RSS.

2- Lasso: Shogun is faster but It seems like there is something wrong as there was no weight set to zero and the resulted weights seemed close to what achieved using Ridge Regression.

3- LARS and Lasso: The notebook kernel crashed at the third dataset. I think This happened because the dataset needed a very large number of iterations to converge. This was handled at SKlearn by setting a default maximum number of iterations = 1000.

As I understand from the docs that to have a Lasso Regression Model , We can use LARS with parameter lasso=True but the resulted weights are the same.Did I get it wrong ?

@vigsterkr
Copy link
Member

@youssef-emad great work! we would really like to have a more maintainable and reproducable way of doing benchmarking, so it'd be great if you could port this idea to this framework:
https://github.com/zoq/benchmarks

as this in this current form is just a snapshot of each library... and it'll be hard to re-run this in newer releases... as well as you'll see that there are many other libraries we should include in the comparison

@youssef-emad
Copy link
Contributor

@vigsterkr yeah sure , I just have an assignment to deliver in a few hours so I'll finish it and check this out 😃

@karlnapf
Copy link
Member Author

Thanks, very nice work. @vigsterkr is right, these notebooks can only be a first step to see what is going on. Eventually we want the scripts to do this in the benchmark platform -- so that we can re-run. But of course the notebooks are a nice way of exploring.

A few words on the benchmarks:

  • Aim for benchmarks that take at least a few seconds to run. Otherwise we only observe noise of the python interpreter being fired up. We want actual runtime. One or two short ones are ok to test this overhad. But in general, what counts is performance on longer runs.
  • N is large enough -- but can be larger (see first point) , what about larger D?
  • easy problems usually run faster than hard ones. Try to create harder regression problems -- higher dimensions -- more noise. Using N=20000, D=3 and linear functions does not represent an interesting problem.
  • Make sure the regulariser is set to the same value in both shogun and sklearn
  • if you find bugs (like a crash or wrong result), always report them in an issue and give a way to reproduce the bug

@karlnapf
Copy link
Member Author

Good luck with the assignment :)

@youssef-emad
Copy link
Contributor

@karlnapf @vigsterkr I added new 4 benchmarks and made a pull request at the benchmarks repo

I also made a notebook to compare the ease of use and accuracy between Shogun and Sklearn for 4 different classifiers ( Naive bayes , KNN,QDA,Logestic Regression) and to check large datasets with larger dimensions.
It seems Shogun have a problem with large dimensions and large number of classes

Note: Sorry for the delay , I was stuck with some academic commitments.

@karlnapf
Copy link
Member Author

Great that this finally happened! For the benchmarks, I guess we need to compare against something, e.g. sklearn. As I can see here, the comparison so far only happens in your notebook?.

@karlnapf
Copy link
Member Author

In the notebook, it would be great if you could label things you print, it takes some time to parse this otherwise for me

@karlnapf
Copy link
Member Author

Comments:

  • Use larger datasets -- timings at microsecond scale have very little meaning as Python takes longer to start running than the actual computations.
  • You should always make sure that the options are set in such way that the algorithms do exactly the same

@karlnapf
Copy link
Member Author

Looks like you identified problems in Shogun's results. These should definitely be investigated. Very alarming. I suggest that you put benchmark scripts for sklearn as well. Also, let's isolate one of the algorithms and try to understand why Shogun's results are so different
Very nice catch btw :) !
But I dont really agree with your conclusion ... we better find out whats going on there

@youssef-emad
Copy link
Contributor

@karlnapf I'll add the benchmarks for sklearn and I'll try to investigate what's going on.

@karlnapf
Copy link
Member Author

Start with one of the algos maybe. KNN or so

@youssef-emad
Copy link
Contributor

@karlnapf I found out what was going on. I made a horrible mistake while transforming data to shogun's format. I fixed this embarrassing mistake and add visual comparison for decision boundaries for datasets with 2 features.Check the updated notebook.

New Observations:
1- KNN and Naive Bayes : Accuracy and decision boundaries are identical at both Shogun and sklearn at all datasets.
2- QDA and Logestic Regression : Accuracy and decision boundaries are approximately identical at the first 2 datasets but for the third and the fourth datasets ( more classes and more dimensions) , Shogun's Accuracy seems to be lower. I'll try to investigate that.

I also checked the options and ensured that the same options are applied to both Shogun and Sklearn.
Final Note: Sorry again for that mistake.

@youssef-emad
Copy link
Contributor

@karlnapf I investigated QDA , found an issue and solved it. The issue was that the accuracy of Shogun's QDA strongly decreases as the number of features is more than 2. That happened because of an unnecessary transpose of the rotations matrix. Removing this line enhanced Shougun's Accuracy to be the same as sklearn. check this notebook for results before and after modification.
I also made a PR with this simple change.

@Hephaestus12
Copy link
Contributor

Hi, are there any other regression algorithms left to be analyzed?
As far as I see, Least squares regression, Linear Ridge Regression, Lasso, LARS, and KRR have scripts and notebooks written which benchmark these algorithms against sklearn.

If not regression, what other algorithms can I work on?

@Hephaestus12
Copy link
Contributor

@karlnapf is this issue still relevant? Which algorithms should I pick to analyze next?

@karlnapf
Copy link
Member Author

I am pretty sure there are be algorithms left. For example you could investigate where shogun is slower than other libraries using the benchmark library mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants