Skip to content

GSoC_2018_project_efficient_ml

Ryan Curtin edited this page Jan 21, 2018 · 8 revisions

Fundamental Machine Learning Algorithms III: Finding the bad guys

... continuing from last year

We are continuing the highly popular project of the last years: the aim is to improve our implementations of fundamental ML algorithms. As this year's focus is on user experiences with Shogun, we focus on finding the bad guys. Who are the bad guys? Those are implementation of algorithms in Shogun that are embarrassingly in one of: runtime, memory efficiency, code-style, API, documentation ... we don't want to embarrass ourselves ;)

While we don't need Shogun to be the fastest/best/most pretty library in all tasks, it at least should not suck. This project is about identifying fixing all those "bad guys".

Mentors

Difficulty & Requirements

Medium to difficult, you need to dig into existing code and you will need:

  • ML Algorithms in C++
  • Re-factoring existing code / design patterns
  • Knowledge of basic ML
  • Basic Linear Algebra, Shogun's linalg framework
  • Experience with other ML toolkits (preferably Python, such as scikit-learn, or c++ such as mlpack)
  • Desirable: Experience with the benchmarking system

Details

Here are some examples of what topics should be covered.

Runtime

Have a look at benchmark comparisons of Shogun with other libraries at mlpack's benchmarking framework. You will notices that sometimes Shogun does quite well, like for KMeans

dataset mlpy scikit shogun weka mlpack
corel-histogram 3.59s 0.73s 1.11s 19.43s 1.92s
mnist 119.83s 46.13s 16.02s 1558.07s 61.35s

On the other hand, there are situations that are less than optimal, like for linear regression, where Shogun fails.

dataset mlpy scikit shogun weka mlpack
arcene failure 0.24s failure 3.16s 0.42s
cosExp 0.13s 0.08s failure 17.42s 0.13s

Anotoher one is linear ridge regression, where Shogun is extremely slow

dataset scikit shogun
webpage 1.94s >9000s

Again, we don't want Shogun to be the fastest candidate everywhere. We only don't want it to be the slowest by far.

Awkward API

Example: have a look at GMM. It has 3 train methods, awkward methods like get_nth_mean, multiple methods to apply it (::cluster, ::get_likelihood_example), etc. A first step would be to rename the methods to something that looks nice, or to remove them (we have tags so no need for getters/setters anymore). Next, GMM is nothing else but a supervised learning algorithm, so it should support that interface: fit, predict, and not offer its own methods. Next, GMM is also a distribution that can be sampled from, so it should be possible to turn it into an API that supports sampling.

We actually wrote some API desiderata for the user experience project, which overlaps with the project in terms of API. Think: you identify bad API, and how it should be instead, user experience project person implements basics for your changes to be possible, you change the algorithm.

Documentation issues

Some bad examples:

You get the point...

First steps

  • Increase coverage of Shogun in the benchmark framework. Ideally all algorithms in the framework should be populate with Shogun
  • Make a priority list of algorithms where Shogun doesn't do well: runtime & memory
  • Make a list of badly or un-documented algorithm classes (missing@brief, one sentence docs)
  • Make a list of algorithms with awkward API
  • Take a single instance and work on it until things are better.
  • Whenever you touch the internals, make sure to also polish: linalg usage, API, class design
  • Work on a one-by-one basis
  • Whenever you improve something, make sure to provide a "before-after" comparison.

Why this is cool

This project offers the chance to learn about many fundamental ML algorithms from a practical perspective, with a focus on usability and efficiency. As we want to start with important algorithms first, it is likely that many people will use (and appreciate) code that you wrote.

Useful resources

Clone this wiki locally