Skip to content

Latest commit

 

History

History
59 lines (57 loc) · 3.36 KB

README.md

File metadata and controls

59 lines (57 loc) · 3.36 KB

GP-EMaL

Genetic Programming for Explainable Manifold Learning

This is our method for producing interpretable trees for nonlinear dimensionality reduction (GP for NLDR). To do this, we extend an existing multi-objective method, GPMaLMO. GPMaLMO optimises NLDR embeddings while minimising embedding dimensionality. We changed the objective function to minimise tree complexity instead of dimensionality. To measure tree complexity, we performed a literature survey for existing GP tree complexity metrics and implemented and compared a few of the most promising ones. A method that minimised the semantic complexity of the tree worked the best. We took this metric and extended it to be flexible, allowing the user to specify how the semantic complexity is defined depending on what is interpretable in the context of their dataset, knowledge and methodology. We also modify the tree complexity metric to explicitly optimise for small and symmetrical trees. The result is a flexible, efficient method which suits the subjective nature of interpretability. The new method is applied on all the same datasets as the original algorithm GPMaLMO, and the constructed datasets give similar classification accuracy with much more interpretable trees and minimal added compute cost.

Usage

(from the src/ directory):
To run on iris.data (in /src/datasets dir) for 10 generations, using our functional metric as our seconday objective to minimise (alongside neighbourhood structure embedding loss)

make run DATASET=iris GENS=10 DIM=2 DIR=datasets/

To clean up output files in the directory:

make clean

To run in parallel:

jobs=10
dataset=wine
gens=1000
for i in {1..$jobs}; do make run DATASET=$dataset GENS=$gens OBJ=functional DIR="./datasets/" & done

A main feature of GPEMaL is specifying the operations in your function set and their associated costs. These are specified by the following command line arguments (grabbed by argparse):

# /src/gptools/util.py line 64-67
parser.add_argument("-fs", "--funcset", help="function set", 
    type=str, default="vadd,vsub,vmul,vdiv,max,min,relu,sigmoid,abs")
parser.add_argument("-oc", "--opcosts", help="node operation costs",
    type=str, default="sum,sum,prod,prod,exp,exp,exp,exp,exp")

The function set operation specification is straightforward, but let us analyse the scaling of these costs. (Here $L_{i}$ is the complexity of the $i$ th node's left child subtree and $R_{i}$ is the right child subtree complexity)

Cost argument Complexity Cost Scaling ($n$ average distance from leaf node / average subtree height)
"sum" $L_{i} + R_{i}$ $\mathcal{O}(n)$
"prod" max( $L_{i}\times R_{i},L_{i},R_{i}$ ) $\mathcal{O}(n^{2})$
"exp" $\exp(L_{i} + R_{i})$ $\mathcal{O}(2^{n})$

Note:

  • Datasets used in the paper are in datasets/
  • Add your own datasets in csv format, with a header line:
    Header: classPosition,#features,#classes,seperator. e.g.
    classLast,1024,20,comma (from COIL20.data)
  • Most GP parameters are configured in gpmalmo/rundata.py