-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve a Shogun algorithm #2991
Comments
@karlnapf I made this notebook for comparison between Least Square Linear Regression in sklearn and in Shogun. As demonstrated in the notebook it seems that Shogun has no problem in speed but there is a problem with accuracy especially when the data shifted away from the zero ( e.g. the third example). The reason is that the solution provided by Shogun's LeastSquareRegression is a linear method with bias 0. As appears in the train_machine method that the bias is not taken into consideration and has to be set manually through set_bias(). This can be modified easily by adding the bias as an additional feature and the value of this feature will be one along the observations. |
Nice one, feel free to send a patch with corrections
Finally, could you test this on a larger dataset in higher dimensions, can be synthetic as this is just for speed. Both N and D should be in [small, medium, large] where
Make sure the Shogun implementation at any point in time only has one matrix in memory (i.e. 10000x10000). You can profile the memory usage as well. |
BTW the docs state that the bias is set to 0. You can double check them as well, and clean up if there are problems. |
@karlnapf I got it.I will push the last edits for the pull request and start working on this issue. |
Can you also check whether the algorithm works when a preprocessor is attached to the features? |
@youssef-emad Could we move the discussion in its own issue? You can just open one if you have updates... |
@karlnapf I don't have updates yet. I got a little bit busy with school but I'll start working on it by tomorrow. |
@karlnapf , I am not that experienced in ML.Can you give me something I could start with? |
@Anjan1729 there is a long list above. Pick any of the ones you like. Easy algorithms preferred, what about LDA? |
Hi @karlnapf I created a ipython notebook which compares Shogun PCA with other toolkits like scikit-learn, matplotlib and my naive python implementation of PCA. Shogun PCA is equally well in terms of speed but matrix received by get_transformation_matrix() method of Shogun is scaled more as compared to other standard toolkits. I am studying Shogun PCA source code to identify any bottlenecks and will soon come up with update. Please give suggestions, if any. |
Hi @abhinavagarwalla Great! Can you open a seperate issue for this and we discuss there? We should keep this thread clean of discussions |
Hello @karlnapf |
@shark-s shogun is a c++ library that has a python interface (as well). if you want to work on shogun you should get familiar with c++ |
thanks for reply . |
@shark-s you cab test in pyhon but write it in only in c++ |
@karlnapf I made this notebook as a comparison between Shogun and SKlearn in multiple regression models ( Linear Ridge Regression, Lasso ,LARS , KRR). Observations: 2- Lasso: Shogun is faster but It seems like there is something wrong as there was no weight set to zero and the resulted weights seemed close to what achieved using Ridge Regression. 3- LARS and Lasso: The notebook kernel crashed at the third dataset. I think This happened because the dataset needed a very large number of iterations to converge. This was handled at SKlearn by setting a default maximum number of iterations = 1000. As I understand from the docs that to have a Lasso Regression Model , We can use LARS with parameter |
@youssef-emad great work! we would really like to have a more maintainable and reproducable way of doing benchmarking, so it'd be great if you could port this idea to this framework: as this in this current form is just a snapshot of each library... and it'll be hard to re-run this in newer releases... as well as you'll see that there are many other libraries we should include in the comparison |
@vigsterkr yeah sure , I just have an assignment to deliver in a few hours so I'll finish it and check this out 😃 |
Thanks, very nice work. @vigsterkr is right, these notebooks can only be a first step to see what is going on. Eventually we want the scripts to do this in the benchmark platform -- so that we can re-run. But of course the notebooks are a nice way of exploring. A few words on the benchmarks:
|
Good luck with the assignment :) |
@karlnapf @vigsterkr I added new 4 benchmarks and made a pull request at the benchmarks repo I also made a notebook to compare the ease of use and accuracy between Shogun and Sklearn for 4 different classifiers ( Naive bayes , KNN,QDA,Logestic Regression) and to check large datasets with larger dimensions. Note: Sorry for the delay , I was stuck with some academic commitments. |
Great that this finally happened! For the benchmarks, I guess we need to compare against something, e.g. sklearn. As I can see here, the comparison so far only happens in your notebook?. |
In the notebook, it would be great if you could label things you print, it takes some time to parse this otherwise for me |
Comments:
|
Looks like you identified problems in Shogun's results. These should definitely be investigated. Very alarming. I suggest that you put benchmark scripts for sklearn as well. Also, let's isolate one of the algorithms and try to understand why Shogun's results are so different |
@karlnapf I'll add the benchmarks for sklearn and I'll try to investigate what's going on. |
Start with one of the algos maybe. KNN or so |
@karlnapf I found out what was going on. I made a horrible mistake while transforming data to shogun's format. I fixed this embarrassing mistake and add visual comparison for decision boundaries for datasets with 2 features.Check the updated notebook. New Observations: I also checked the options and ensured that the same options are applied to both Shogun and Sklearn. |
@karlnapf I investigated QDA , found an issue and solved it. The issue was that the accuracy of Shogun's QDA strongly decreases as the number of features is more than 2. That happened because of an unnecessary transpose of the rotations matrix. Removing this line enhanced Shougun's Accuracy to be the same as sklearn. check this notebook for results before and after modification. |
Hi, are there any other regression algorithms left to be analyzed? If not regression, what other algorithms can I work on? |
@karlnapf is this issue still relevant? Which algorithms should I pick to analyze next? |
I am pretty sure there are be algorithms left. For example you could investigate where shogun is slower than other libraries using the benchmark library mentioned above. |
Entrace task for the GSoC project "the usual suspects", see here. Also a good entrance task for any other GSoC project.
Many of Shoguns algorithms have problems, especially the more basic ones. We might not even know about this. This entrance task is to
linalg
interface (can be extended through this)SGVector
and friends instead.Some candidates to start with.
The text was updated successfully, but these errors were encountered: