Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using C++11s Distributions in Shogun #1998

Closed
cameo54321 opened this issue Mar 14, 2014 · 8 comments
Closed

Using C++11s Distributions in Shogun #1998

cameo54321 opened this issue Mar 14, 2014 · 8 comments

Comments

@cameo54321
Copy link

The recent addition in shogun for a probability distribution was:
http://www.shogun-toolbox.org/doc/en/3.0.0/classshogun_1_1CProbabilityDistribution.html
with implementation
http://www.shogun-toolbox.org/doc/en/3.0.0/classshogun_1_1CGaussianDistribution.html

The above implementation uses SIMD-oriented Fast Mersenne Twister (SFMT) pseudorandom number generator for random number generation and Eigen3 for generating the Gaussian Distribution from these samples.

The C++11s Pseudo-random number generator (as pointed by @vigsterkr) has many built-in distributions and random number generators which can be used for generating distributions in Shogun. So instead of implementing each distribution by ourselves, can we use C++11s distributions?

What if the classes for the Issue #1929 are written utilizing C++11s distributions in mind instead of (or alongwith) SFMT? What can be a good direction to achieve that?

@karlnapf @vigsterkr Comments please?

@karlnapf
Copy link
Member

I agree, we should try to use as much C++11 as possible. This is well-tested and reliable code and we don't even add a new dependency.

So we can do most of the sampling using c++11 methods. For some multivariate distributions, we will need some amount of linear algebra operations. Those should be done against our Shogun internal (soon to exist) framework, see #1930 and #1973. Or, since the latter interface is not yet complete, use eigen3 (such as the multivariate Gaussian).

In additions, and maybe more important than sampling, is to be able to evaluate the pdf/cdf of those densities. AFAIK this is not supported by c++11. Also, one can do lots of things ineffective or even wrong. An example is again the Gaussian, the Cholesky decomposition of the covariance should only be computed once in the beginning (if not already specified) such that evaluating the pdf of samples does not have to do that again. For many distributions, evaluating the pdf of many points at once comes at the same cost as evaluating a single points (see again Gaussian). Another interesting feature would be to compute quantiles of a given number of points (see https://github.com/karlnapf/kameleon-mcmc)
For univariate distributions, pdf/cdf functions usually depend on table lookups or numerical integration (see for example CStatistics::gamma_cdf). Currently, we borrowed a couple of those implementations from ALGLIB. Apart from the problem that this is GPL, integrating such codes is not the best idea due to the overhead it creates and the impossibility of maintaining any bugs/changes since nobody understands the code. I would much rather depend on http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html which is mature, tested, etc. We can make the dependency optional since most shogun packages dont need complicated distributions. Using external libs should always be done in a way that we have minimal dependency.

Stan can serve as a

  • inspiration how to represent distributions
  • we could also borrow code for complicated PDFs
  • inspiration for auto-diff (which is out of scope for the mcmc GSoC project though)

@karlnapf
Copy link
Member

Final point: We want a unified interface in Shogun, so merge all the existing probability classes into one

  • maximum likelihood learning
  • some other methods, see CDistribution

@vigsterkr
Copy link
Member

regarding SFMT: no... we've just added SFMT. before even suggesting something like that, it'd be better to investigate the performance between SFMT/dSFMT and c++11

@karlnapf
Copy link
Member

yeah very good point! In fact, let's just stay with the current one and only add things from c++11 if we there is not existing implementation. This is not really about re-doing normal numbers but about interfaces of probability distributions

@vigsterkr
Copy link
Member

yeah we had a discussion about this with @cameo54321 on IRC. i've tried to suggest that it would be good to check out other libraries (e.g. stan, boost) to get some good ideas how our probability distrib API should look...

@karlnapf
Copy link
Member

Yeah, so just to summarise a few things

  • This is first of all about API aka how to represent distributions in Shogun. Unified interface!
  • This might involve interfaces for creating random numbers. This task's priority is not on this. We just should keep things open here. It should be noticed that c++11 could possibly help us there, but we don't know. Backends have to be investigated. We currently have a working one, so no need to touch this as long as we don't need to.
  • Way more important are the (log) pdf/cdf functions. We need a way of getting rid of this ugly ALGLIB GPL code in CStatistics Those things are messy (lookup tables, integration, etc) and important (small mistakes fuck up things) and have therefore to be tested properly. Stealing code and integrating into Shogun with copy/paste (the current state) is not what we want there. This is where I suggest a possible boost backend. This is independent of the interface, there could be multiple implementations. Boost is the only serious one that I know, I am open for any suggestions here. But I really dont want any re-implementations of such things, its way too much work, much harder than generating random numbers. But again, this is first about interfaces.

Please see the kameleon-mcmc distribution framework on my github. This is in the direction of what we want.

@vigsterkr
Copy link
Member

feature/refactor-random is on this path :)

@karlnapf
Copy link
Member

karlnapf commented Jul 9, 2017

FYI: ALGLIB did already got pushed out a while ago (it was GPL code, replace with cdflib)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants