The DFNET R-package can be installed using devtools.
install.packages("devtools")
devtools::install_github("pievos101/DFNET")
See our examples using synthetic data sets or real world cancer data.
Generally speaking, DFNET follows a four step process:
- Preparing the input data (graph and features)
- Training the forest.
- Finding useful decision trees.
- Using these trees for evaluation.
DFNET expects an igraph::igraph
and a 2D or 3D feature array, as well as a
target vector with the same number of rows as the array.
The vertex names of the graph should be the same as the column names of the array.
When in doubt, use launder
or related functions to prepare the input data.
Once you have your graph and features, you can train your forest like so:
forest <- train(,
graph, features, target,
...
)
If you have a pre-trained forest, you can use that for training as well:
forest <- train(forest,
graph, features, target,
...
)
Since DFNET performs greedy optimization, the last generation of trees
is the best according to the provided test metric. DFNET provides overrides for
the standard R methods head
and tail
, which return generation.
# get the selected modules
last_gen <- tail(forest, 1)
tree_imp <- attr(last_gen, "last.performance")
Note, that performance metrics for earlier generations are not kept. Several importance scores can be derived from these metrics.
e_imp <- edge_importance(graph, last_gen$trees, tree_imp)
f_imp <- feature_importance(last_gen, features)
m_imp <- module_importance(
graph,
last_gen$modules,
e_imp,
tree_imp
)
The module importance is particularly useful for feature selection, as it combines the importance of edges within a module with the overall accuracy of the decision tree. You can use it to order decision trees or simply extract the best one.
best <- which.max(as.numeric(m_imp[, "total"]))
best.tree <- last_gen$trees[[best]]
by_importance <- order(m_imp[, "total"], decreasing = TRUE)
last_gen$trees[by_importance]
DFNET provides an override for the predict
method, that functions much like ranger's.
# Predict using the best DT
pred_best = predict(best.tree, test_data)$predictions
# predict using all detected modules
pred_all = predict(last_gen, test_data)$predictions
You can use ModelMetrics to evaluate the accuracy, precision, recall, or other performance metrics.
ModelMetrics::auc(pred_best, test_target)
ModelMetrics::auc(pred_all, test_target)
Now, lets check the performance of that module on the independent test data set. We compare the results with the performance of all trees selected.
# Prepare test data
colnames(mRNA_test) = paste(colnames(mRNA_test),"$","mRNA", sep="")
colnames(Methy_test) = paste(colnames(Methy_test),"$","Methy", sep="")
DATA_test = as.data.frame(cbind(mRNA_test, Methy_test))
# Predict using the best DT
pred_best = predict(best_DT, DATA_test)$predictions
# predict using all detected modules
pred_all = predict(last_gen, DATA_test)$predictions
pred_best
pred_all
# Check the performance of the predictions
ModelMetrics::auc(pred_best, target[test_ids])
ModelMetrics::auc(pred_all, target[test_ids])
Finally, we provide an extension to compute tree-based SHAP values via treeshap.
forest_unified = dfnet.unify(last_gen$trees, test_data)
forest_shap = treeshap(forest_unified, test_data)
@article{pfeifer2022multi,
title={Multi-omics disease module detection with an explainable Greedy Decision Forest},
author={Pfeifer, Bastian and Baniecki, Hubert and Saranti, Anna and Biecek, Przemyslaw and Holzinger, Andreas},
journal={Scientific Reports},
volume={12},
number={1},
pages={1--15},
year={2022},
publisher={Nature Publishing Group}
}