Selected papers and possible corresponding codes in our review paper "A Survey of Neural Trees" [arXiv Version]
If you find there is a missed paper or a possible mistake in our survey, please feel free to email me or pull a request here. I am more than glad to receive your advice. Thanks!
Neural trees (NTs) refer to a school of methods that combine neural networks (NNs) and decision trees (DTs), for which we present a comprehensive review in this survey. Our keynote is to identify how these approaches enhance the model interpretability and suggest possible solutions to the remaining challenges. Besides, we provide a discussion about other considerations like conditional computation and promising directions towards this field, in hope to advance the practice of NTs.
New work of NTs and possible changes in this survey will be notified here.
- Add "ProGReST: Prototypical Graph Regression Soft Trees for Molecular Property Prediction" [Paper], an interesting work that combines prototype learning, soft decision trees and Graph Neural Networks for compound property prediction.
- Add "Large-Scale Category Structure Aware Image Categorization" [Paper] & "Leveraging Class Hierarchies with Metric-Guided Prototype Learning" [Paper], two methods incorporating the class hierarchy into the regulariser of the general objective function in section 3. Although a regulariser is certainly feasible in utilize the class hierarchy, it is curiously rare in practice and these two methods are the only two examples we know of.
- Add "Distillation Decision Tree" [Paper], which interprets the black-box model by distilling its knowledge into decision trees, belonging to section 2.2.
If you find this survey useful for your research, please consider citing
@article{li2022survey,
title={A Survey of Neural Trees},
author={Li, Haoling and Song, Jie and Xue, Mengqi and Zhang, Haofei and Ye, Jingwen and Cheng, Lechao and Song, Mingli},
journal={arXiv preprint arXiv:2209.03415},
year={2022}
}
Thanks!
In this survey, we present a thorough taxonomy for NTs that expresses the gradual integration and co-evolution of NNs and DTs.
These approaches are the very first to combine NNs and DTs. They employ DTs as auxiliaries for NNs as well as using NNs as tools to improve the design of DTs. In these approaches, "neural" and "tree" are separated, i.e., one of NN and DT is assigned to accomplish a specific task, while the other performs as its assistant or interpreter. It can date back to the early 1990s when DTs were supposed to provide structural priors for NNs or extract rules from a trained NN. They are perceived implicit combinations of NNs and DTs, because NNs and DTs still operate on their own paradigms and no hybrid model is produced.
Adopt DTs to approximate the target concept of NNs in terms of logical descriptions, i.e., use DTs to design NNs with structural priors.
- Entropy net: the first practice
An example for the original entropy net:
-
ID3-based algorithm that converts DTs into hidden layers:
- "A machine learning method for generation of a neural network architecture: a continuous ID3 algorithm", IEEE Transactions on Neural Networks, 1992
- K.J. Cios, N. Liu
- [Paper]
- "A machine learning method for generation of a neural network architecture: a continuous ID3 algorithm", IEEE Transactions on Neural Networks, 1992
-
Soft thresholding of entropy net:
- "Neural implementation of tree classifiers", IEEE Transactions on Systems, Man, and Cybernetics, 1995
- I.K. Sethi
- [Paper]
- "Neural implementation of tree classifiers", IEEE Transactions on Systems, Man, and Cybernetics, 1995
-
Oblique trees version of entropy net:
-
A knowledge acquisition system that concerns about the knowledge form DTs and NNs prefer:
- "Implementation and refinement of decision trees using neural networks for hybrid knowledge acquisition", Artificial Intelligence in Engineering, 1995
- Katsuhiko Tsujino, Shogo Nishida
- [Paper]
- "Implementation and refinement of decision trees using neural networks for hybrid knowledge acquisition", Artificial Intelligence in Engineering, 1995
-
Create a disjunctive normal form formula for each class and reformulate the first layer of entropy net:
- "Initializing neural networks using decision trees", published by Rutgers University, 1990
- Arunava Banerjee
- [Paper]
- "Initialization of neural networks by means of decision trees", Knowledge-Based Systems, 1995
- Irena Ivanova, Miroslav Kubat
- [Paper]
- "On mapping decision trees and neural networks", Knowledge-Based Systems, 1999
- R.SetionoW, K.Leow
- [Paper]
- "Initializing neural networks using decision trees", published by Rutgers University, 1990
Apply DTs to interpret trained NNs by approximating the network (mostly in terms of input-output mapping) and mimic the decision boundaries implicitly learned by the hidden layers.
Treat NNs as black-box approaches, and extract the input-output rule directly form NNs regardless of its intermediate layers.
A description of pedagogical techniques:
-
TREPAN: the first practice of DT-based interpretation
-
Remove insignificant input neurons from the trained network before inducing the DT
- "Decision Tree Extraction using Trained Neural Network", SMARTGREENS, 2020
- Nikola Vasilev et al.
- [Paper]
- "Decision Tree Extraction using Trained Neural Network", SMARTGREENS, 2020
-
Turn the 𝑚-of-𝑛 type splits into standard splitting tests such as C4.5
- "Decision Tree Extraction from Trained Neural Networks", AAAI, 2004
- Darren Dancey et al. (CART tree)
- [Paper]
- "Converting a trained neural network to a decision tree DecText - decision tree extractor", Dissertation, 2000
- Olcay Boz (propose discretization algorithms to handle continuous features)
- [Paper]
- "Extracting decision trees from trained neural networks", Pattern Recognition, 2020
- Olcay Boz
- [Paper]
- "NeC4.5: neural ensemble based C4.5", IEEE Transactions on Knowledge and Data Engineering, 2004
- Zhi-Hua Zhou, Yuan Jiang (use NN ensembles)
- [Paper]
- "Extracting decision trees from trained neural networks", Pattern Recognition, 1999
- R.Krishnan, et al. (adapt genetic algorithms to generate artificial examples)
- [Paper]
- "Decision Tree Extraction from Trained Neural Networks", AAAI, 2004
-
generalize the TREPAN algorithm
- "An investigation of TREPAN utilising a continuous oracle model", Data Analysis Techniques and Strategies, 2011
- William A Young et al. (develop DTs derived from continuous-based models)
- [Paper]
- "Trepan-Plus: An Extension of a Decision Tree Extraction Algorithm Utilizing Artificial Neural Networks", Intelligent Engineering Systems Through Artificial Neural Networks, 2007
- M Rangwala et al. (extend TREPAN to multi-class regression problems)
- [Paper]
- "Extracting fuzzy symbolic representation from artificial neural networks", NAFIPS, 1999
- Maciej Faifer et al. (use fuzzy representation during the tree-induction process)
- [Paper]
- "DT+GNN: A Fully Explainable Graph Neural Network using Decision Trees", arXiv, 2022
- Peter Müller et al. (incorporate DTs into graph neural networks)
- [Paper]
- "An investigation of TREPAN utilising a continuous oracle model", Data Analysis Techniques and Strategies, 2011
-
NNs perform significance analysis to select attributes
- "ANN-DT: an algorithm for extraction of decision trees from artificial neural networks", IEEE Transactions on Neural Networks, 1999
- Gregor PJ Schmitz et al.
- [Paper]
- "ANN-DT: an algorithm for extraction of decision trees from artificial neural networks", IEEE Transactions on Neural Networks, 1999
Decompositional techniques concern units in the hidden layers. They extract rules from the trained NN at the level of individual neurons, thus gaining insight into their inner structures.
-
Extract axis-aligned trees
- "Rule extraction from neural networks via decision tree induction", IJCNN, 2001
- Makoto Sato, Hiroshi Tsukimoto (CRED algorithm that deals with NNs with one hidden layer)
- [Paper]
- "DeepRED – Rule Extraction from Deep Neural Networks", International Conference on Discovery Science, 2016
- Jan Ruben Zilke, Eneldo Loza Mencía, Frederik Janssen (Extend CRED algorithm by deriving intermediate rules for every additional hidden layer)
- [Paper]
- "Eclectic rule extraction from Neural Networks using aggregated Decision Trees", ICECE, 2012
- MD. Ridwan Al Iqbal (Simplify rules by logical simplification)
- [Paper]
- "Rule extraction from neural networks via decision tree induction", IJCNN, 2001
-
Extract oblique and multivariate trees
- "NeuroLinear: From neural networks to oblique decision rules", Neurocomputing, 1997
- Rudy Setiono, Huan Liu (oblique trees)
- [Paper]
- "Towards Interpretable ANNs: An Exact Transformation to Multi-Class Multivariate Decision Trees", 2020
- Duy T Nguyen, Kathryn E Kasmarik, Hussein A Abbass (multivariate trees)
- [Paper]
- "NeuroLinear: From neural networks to oblique decision rules", Neurocomputing, 1997
Typically, eclectic techniques model the relationship between the last hidden layer and the outputs, then infer the input-output relationship from the magnitudes of the weights in a NN. However, few of them are based on DTs. Therefore, this survey comprises approaches that extract partial knowledge contained in the hidden layers and do not demand their approximation target is the input-output relationship
-
Extended C-Net algorithm
- "Towards Interpretable ANNs: An Exact Transformation to Multi-Class Multivariate Decision Trees", arXiv, 2020
- Duy T Nguyen, Kathryn E Kasmarik, Hussein A Abbass (a typical eclectic technique)
- [Paper]
- "Towards Interpretable ANNs: An Exact Transformation to Multi-Class Multivariate Decision Trees", arXiv, 2020
-
Generalized eclectic techniques
- "Global Model Interpretation Via Recursive Partitioning", HPCC/SmartCity/DSS, 2018
- "Interpreting CNNs via Decision Trees", CVPR, 2019
- Quanshi Zhang et al. (encodes all potential decision modes of the CNN in a coarse-to-fine manner)
- [Paper]
These approaches train NNs that resemble compact DTs through crafted regularization terms, so that they can be well-approximated by small DTs.
-
Tree regularization
- "Beyond Sparsity: Tree Regularization of Deep Models for Interpretability", AAAI, 2018
- "Regional Tree Regularization for Interpretability in Deep Neural Networks", AAAI, 2020
- Mike Wu et al. (extended work of tree regularization that encourage a deep model to be approximated by several separate DTs specific to pre-defined regions of the input space)
- [Paper]
-
L1-orthogonal regularization
- "Enhancing Decision Tree Based Interpretation of Deep Neural Networks through L1-Orthogonal Regularization", ICMLA, 2019
- Nina Schaaf, Marco Huber, Johannes Maucher (so that they do not need a surrogate network to estimate the tree loss)
- [Paper]
- "Enhancing Decision Tree Based Interpretation of Deep Neural Networks through L1-Orthogonal Regularization", ICMLA, 2019
NNs specialized for designing more reasonable DTs and do not benefit from it, which is rare in practice.
- Induce DTs from more valid data filtered by the trained NNs
- "An integrated approach of neural network and decision tree to classification", ICMLC, 2005
- Xiao-Ye Wang, Xiu-Xia Liang, Ji-Zhou Sun
- [Paper]
- "An integrated approach of neural network and decision tree to classification", ICMLC, 2005
DT algorithms have two key ideas: "decision" and "tree". We view that "decision" is decision branches implemented by routing functions, and "tree" is the model topology that identifies a class hierarchy. This survey tries to include NNs that draw on a part of ideas from DTs, either the class hierarchy or the decision branches, but not both. These approaches are considered to be half and implicit integration of NNs and DTs. They borrow some inherent ideas from DTs for NNs, instead of designing a NN that works like a DT. However, we only concern about the former in our taxonomy, i.e., NNs leveraging class hierarchies. These approaches can not implement the class hierarchy in the network structure due to the absence of decision branches, so they resort to incorporating the hierarchical relations into a NN directly. There are three lines of research: hierarchical architecture, hierarchical loss function and label embedding based methods, corresponding to different strategies of employing class hierarchies.
Overview of NNs exploiting class hierarchies:
Functions and notations in the figure are adopted from Bertinetto et al.:
These methods attempt to incorporate the class hierarchy into the architecture of the classifier, so that the networks are designed to be branched and each branch is tasked to identify the concept abstraction at one specific level of the class hierarchy.
-
Hierarchical multi-label classification using local neural networks
- "An integrated approach of neural network and decision tree to classification", ICMLC, 2014
- Ricardo Cerri, Rodrigo C Barros, André CPLF De Carvalho
- [Paper]
- "Reduction strategies for hierarchical multi-label classification in protein function prediction", BMC Bioinformatics, 2016
- Ricardo Cerri et al.
- [Paper]
- "HD-CNN: Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition", ICCV, 2015
- "Your "Flamingo" is My "Bird": Fine-Grained, or Not", CVPR, 2021
- "An integrated approach of neural network and decision tree to classification", ICMLC, 2014
-
Branch the network after a shared trunk
- "Learning to Make Better Mistakes: Semantics-aware Visual Food Recognition", ACM international conference on Multimedia, 2016
- Hui Wu et al. (a single network backbone shared by multiple fully-connected layers)
- [Paper]
- "Do Convolutional Neural Networks Learn Class Hierarchy?", TVCG, 2017
- Alsallakh Bilal et al. (deep CNN with branches at intermediate layers to fit the coarser-grained labels)
- [Paper]
- "Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding", ACM international conference on Multimedia, 2018
- Tianshui Chen et al. (introduce an attention mechanism to incorporate the coarse-grained results for learning finer-grained features)
- [Paper]
- "Learning to Make Better Mistakes: Semantics-aware Visual Food Recognition", ACM international conference on Multimedia, 2016
-
Interactions between different branches
- "Label Hierarchy Transition: Modeling Class Hierarchies to Enhance Deep Classifiers", arXiv, 2021
- Renzhen Wang, et al (propose label hierarchy transition matrices whose column vectors represent the conditional label distributions of classes between two adjacent hierarchies)
- [Paper]
- "Label Relation Graphs Enhanced Hierarchical Residual Network for Hierarchical Multi-Granularity Classification", CVPR, 2022
- "Label Hierarchy Transition: Modeling Class Hierarchies to Enhance Deep Classifiers", arXiv, 2021
Incorporate the class hierarchy into loss functions, which exploits the underlying hierarchical relationships and produce predictions coherent with the pre-defined class hierarchy.
-
Predict at each grain separately, i.e., first take a model pre-trained on one level, then tune it using labels from other levels
-
Utilize the hierarchical constraints directly
- "Learning hierarchical similarity metrics", CVPR, 2012
- Nakul Verma et al. (probabilistic nearest-neighbor classification based framework)
- [Paper]
- "Discriminative Transfer Learning with Tree-based Priors", NIPS, 2013
- Nitish Srivastava, Russ R Salakhutdinov (DNN with hierarchical priors over the parameters of the classification layer)
- [Paper]
- "Large-Scale Object Classification Using Label Relation Graphs", ECCV, 2018
- "Making Better Mistakes: Leveraging Class Hierarchies With Deep Networks", CVPR, 2020
- Luca Bertinetto et al. (incorporate class hierarchy into the cross-entropy loss)
- [Paper]
- "Learning hierarchical similarity metrics", CVPR, 2012
These approaches aim to encode the class hierarchy into embeddings whose relative locations or possible interactions represent the semantic relationships.
-
DeViSE method that utilizes unannotated Wikipedia text
-
Embeddings whose pair-wise dot products correspond to semantic similarity between classes
-
Employ the entailment cones to learn order-preserving embeddings
-
Label embeddings in zero-shot classification
Neural Decision Trees (NDTs) are hybrid NN models that implement both the class hierarchy and the decision branches. Their core idea is to exploit NNs in the tree design by making the routing functions differentiable, thus allowing gradient descent-based methods to optimize.
If a NDT satisfies this property, it means each internal node is assigned a specific and intermediate classification task. This "divide-and-conquer" strategy and stepwise inference process make the model more interpretable, because each node in the tree is responsible and distinguishable from other nodes.
Data-driven methods employ data-dependent heuristics to perform local optimization. The resultant tree will lead to a recursive partitioning of the input space 𝑋 through a cascade of tests and define the output of each leaf in terms of examples falling within it.
A comparison between NDTs according to whether it implements a class hierarchy and whether it is data-driven:
-
Informativeness-related splitting functions with NN-based routers
- "Neural trees-using neural nets in a tree classifier structure", ICASSP, 1991
- J-E Stromberg, Jalel Zrida, Alf Isaksson (train a small NN at each internal node, which singles out one unique class that gains the most class purity)
- [Paper]
- "Evolutionary design of neural network tree-integration of decision tree, neural network and GA", CEC, 2001
- Qiangfu Zhao (use genetic algorithms to maximize the information gain ratio at each internal node)
- [Paper]
- "Hybrid decision tree", Knowledge-Based Systems, 2002
- Zhi-Hua Zhou, Zhao-Qian Chen (Divide the instance space into ordered attributes and unordered attributes)
- [Paper]
- "Classification trees with neural network feature extraction", CVPR, 1992
- Heng Guo, Saul B Gelfand (find two clusters that minimizes a Gini impurity criterion, then find a good split through back-propagation)
- [Paper]
- "Neural trees-using neural nets in a tree classifier structure", ICASSP, 1991
-
Decrease the classification error at each node to be extended
- "Growing and pruning neural tree networks", IEEE Transactions on Computers, 1993
- A Sakar et al. (Divide the instance space into ordered attributes and unordered attributes)
- [Paper]
- "Growing and pruning neural tree networks", IEEE Transactions on Computers, 1993
-
Grows during competitive learning
- "Competitive neural trees for pattern classification", IEEE Transactions on Neural Networks, 1998
- Sven Behnke, Nicolaos B Karayiannis
- [Paper]
- "Competitive neural trees for pattern classification", IEEE Transactions on Neural Networks, 1998
-
Map examples into the label embedding space and predicts using a NDT
-
Apply data-driven architectures to fuzzy NDTs
- "Globally optimal fuzzy decision trees for classification and regression", IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999
- Alberto Suárez, James F Lutsko (superimpose a fuzzy structure over the skeleton of a CART tree and introduce a global optimization algorithm)
- [Paper]
- "Budding Trees", ICPR, 2014
- "Convolutional Decision Trees for Feature Learning and Segmentation", GCPR, 2014
- Dmitry Laptev, Joachim M Buhmann (applies fuzzy NDTs to image segmentation by extracting the most informative and interpretable features)
- [Paper]
- "Neural Decision Trees", arXiv, 2017
- Randall Balestriero (map similar inputs to the same hash value)
- [Paper]
- "Neuro-fuzzy decision trees", Neural Systems, 2006
- Rajen B Bhatt, M Gopal (employs fuzzy ID3 algorithm for DT-renovation)
- [Paper]
- "Globally optimal fuzzy decision trees for classification and regression", IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999
-
Incremental learning
- "Hybrid decision tree", Knowledge-Based Systems, 2022
- Zhi-Hua Zhou, Zhao-Qian Chen (two example-incremental tasks, one hypothesis-driven constructive induction mechanism)
- [Paper]
- "Soft decision trees", ICPR, 2012
- Ozan Irsoy, Olcay Taner Yıldız, Ethem Alpaydın (sigmoid-based FDT whose splits are made by checking if there is an improvement over the validation set)
- [Paper]
- "A Neural Tree with Partial Incremental Learning Capability", ICMLC, 2007
- Mu-Chun Su, Hsu-Hsun Lo (choose a target class at each internal node and train a small NN to separate the positive patterns from negative ones)
- [Paper]
- "Hybrid decision tree", Knowledge-Based Systems, 2022
A bigot NDT tends to have a pre-defined structure and determine leaves (pure classes or fixed class distributions) by priors or algorithms such that induce the class hierarchy.
-
Bigot NDTs with pure leaves
- "Structure-driven induction of decision tree classifiers through neural learning", Pattern Recognition, 1997
- Ishwar K Sethi et al. (randomly initialized leaf classes)
- [Paper]
- "NBDT: Neural-Backed Decision Trees", ICCV, 2020
- "Structure-driven induction of decision tree classifiers through neural learning", Pattern Recognition, 1997
-
Bigot NDTs with determined leaf distributions
- "Distilling a Neural Network Into a Soft Decision Tree", AI*IA, 2017
- "Policy-gradient Methods for Decision Trees", ESANN, 2016
- Aurélia Léon, Ludovic Denoyer (use Monte Carlo approximation to expect the gradient of the objective function)
- [Paper]
- "Deep Neural Decision Forests", ICCV, 2015
- "Deep Regression Forests for Age Estimation", CVPR, 2018
- "Neural Decision Forests for Semantic Image Labelling", CVPR, 2014
- Samuel Rota Bulo, Peter Kontschieder (decision forests for semantic image labelling)
- [Paper]
- "Neural Prototype Trees for Interpretable Fine-Grained Image Recognition", CVPR, 2021
- "Learn decision trees with deep visual primitives", SSRN, 2022
- Mengqi Xue, et al
- [Paper]
-
Bigot NDTs for knowledge distillation
- "Tree-Like Decision Distillation", CVPR, 2021
- Jie Song et al. (layer-wise dissect the decision process of a DNN)
- [Paper]
- "Tree-Like Branching Network for Multi-class Classification", LNNS, 2021
- Mengqi Xue, Jie Song, Li Sun, Mingli Song (mine the underlying category relationships from a trained teacher network and determines the appropriate layers on which specialized branches grow)
- [Paper]
- "Distilling a Neural Network Into a Soft Decision Tree", AI*IA, 2017
- "TNT: An Interpretable Tree-Network-Tree Learning Framework using Knowledge Distillation", Entropy, 2020
- Jiawei Li et al. (transfer knowledge between tree models and DNNs)
- [Paper]
- "KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation", arXiv, 2021
- Mengqi Xue et al.
- [Paper]
- "Tree-Like Decision Distillation", CVPR, 2021
NDTs without class hierarchies restrain themselves little and perform arbitrary predictions at the leaves. Their leaf nodes are usually classifiers rather than determined classes or distributions.
-
Hierarchical Mixtures of Experts (HME)
- "Hierarchical Mixtures of Experts and the EM Algorithm", Neural computation, 1994
- Michael I Jordan, Robert A Jacobs (the original HME, a tree-structured model for regression and classification.)
- [Paper]
- "Classification using hierarchical mixtures of experts", NNSP, 1994
- Steve R Waterhouse, Anthony J Robinson (each leaf expert is non-linear and performs multi-way classification)
- [Paper]
- "Bayesian Hierarchical Mixtures of Experts", arXiv, 2012
- Christopher M Bishop, Naonori Ueda, Steve Waterhouse (bayesian treatments of the HME model to prevent the severe overfitting caused by maximum likelihood)
- [Paper]
- "Hierarchical Mixtures of Experts and the EM Algorithm", Neural computation, 1994
-
Generalized HMEs in advanced frameworks
- "Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization", CVPR, 2020
- "NDT: Neual Decision Tree Towards Fully Functioned Neural Graph", arXiv, 2017
- Han Xiao (reformulate the non-differentiable information gain in the form of Dirac symbol and approximate it as a continuous function)
- [Paper]
- "Decision Forests, Convolutional Networks and the Models in-Between", arXiv, 2016
- Yani Ioannou et al. (hybrid model between decision forests and convolutional networks)
- [Paper]
- "Deep Neural Decision Trees", arXiv, 2018
- "ViT-NeT: Interpretable Vision Transformers with Neural Tree Decoder", ICML, 2022
-
Expert NDTs with architecture search phase