Skip to content

Latest commit

 

History

History
54 lines (40 loc) · 4.4 KB

README.md

File metadata and controls

54 lines (40 loc) · 4.4 KB

Symbolic Regression 101

About simple ML models

Let's consider Linear Regression to illustrate a few shortcomings of traditional (in this case linear) ML methods. While the number of terms in the equetion is derived from the dimension of our input data, the structure of this equation is fixed and known in advance

In Python we would end up with a decision

    X = # get data to predict on
    W = [1.0000000000000002, 1.9999999999999991]
    b = 3.0000000000000018

    predictions =  b + sum(X[i] * wi for i, wi in enumerate(W))

In this scenario we learned the coefficients, but the overall structure will never change

  1. we multiply every input feature by its corresponding coefficient
  2. we aggregate resulting values using a sum function
  3. we add the bias

Of course we have access to a variety of more complex, non-linear models out there, but they are usually harder to explain ...

Where does Symbolic Regression stand then ?

Symbolic Regression aims at learning not only the (potential) coefficients but first and foremost the structure of the model. The resulting relaxed model consists in polynomial, built by composition from

  • our input variables, eg. Age and Height
  • a set of operator at hand, eg. addition, substraction, multiplication and division

Symbolic Regression runs Genetic Programming in order to produce the N bests polynomials for your data.

What pros can you expect compared to other ML methods ?

  • built-in explainability along with non-linearity : if you understand all the operators you provide the algorithm with, you understand all potential polynomials derived from these operators
  • built-in feature selection & model-size reduction : when passing existing traits to a new offspring, we take the polynomial complexity in consideration (eq. to the model size in a Minimum Description Length setting). We then build a skyline based on both performance and complexity, and only consider models on this skyline. This process efficiently avoids "bloating" (complex individuals will not mate on the next round), controls overfitting (the simplest the polynomial, the more likely it does generalize to new data) and embeds explainability in the learning process
  • easier transfer learning : it's fairly easy to extract the list of learned polynomials from a Symbolic Regression model and instantiates a new one with this list, to solve a new but similar problem

How is Symkit different from other Symbolic Regression packages like gplearn ?

At each round in the genetic programming process, we need to evaluate the entire population of polynomials on the training data. If you consider a usual population size of 1000 and 500 rounds until what we assume to be convergence, this can quickly become very expensive to run on an off-the-shelf laptop, even for a medium-sized dataset.
Taking a step back, we can do better. At each round, indivuals are likely to share sub-expressions (they may share a common parent, and potentially grand-parents ...). In sympy this is known as a Common Subexpression Elimination, but it's essentialy a graph optimization technique. Extracting the common subparts to all our expressions, we can precompute results and inline them back into our original expressions ! This saves an enormous amount a CPU and RAM !!

Why using polars in the context of Symkit ?

The polars engine not only submit computation to multiple threads, but applies graph optimization before running your expressions. The only extra step we need is to convert a sympy expr to a polars expr (see core.py )

Symkit

TODO

  • add unary operators like sin and cos [X]
  • try an online version with RiverML [ ]
  • pareto front optimisation [X]
    • multi-objective optimisation (eg. return vs risk ) ? [ ]
  • SymbolicRegression for symbolic regression on timeseries [ ]
  • scikit-learn tree structure into a sympy expression (that can be translated into a polars expression later on)