This repository shows the benchmark of counterfactual generation algorithms in terms of (click for details):
Coverage
how many factuals are converted to counterfactuals?
Sparsity
how many features are unchanged?
L2 distance
how far are the counterfactuals from the factual data?
Mean Absolute Deviation
how different are the counterfactuals from the factual data considering feature variations?
Mahalanobis distance
how different are the counterfactuals from the factual data considering the data distribution?
Time
how long does it take to generate a counterfactual?
Follow the instructions on the CounterfactualBenchmark repository
All experiments consider a confidence level of 95%.
Click here to see why we use ranking instead of the metrics itself
Most metrics cannot be directly compared as each algorithm has a different coverage. For example, if one algorithm only creates a single counterfactual and has a sparsity of 90%, we cannot say it is better than another algorithm that creates 1 000 counterfactuals and with sparsity of 88%. Therefore, the ranking consider these cases, giving a better picture of the algorithms' performance.
The rankings below were created with Friedman's test to evaluate the null hypothesis that the algorithms are equal. And Nemenyi's test to evaluate the significance of the difference between the algorithms. The highlighted results are the ones that are statistically significant.
framework | alibi_nograd | alibi | cadex | cfnow_random | cfnow_greedy | dice | growingspheres | synas | lore | sedc | cfnow_random_simple | cfnow_greedy_simple | N |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | |||||||||||||
validity | 7.55 | 7.56 | 6.10 | 🥇4.45 | 🥇4.45 | 6.24 | 8.42 | 7.43 | 8.39 | 8.54 | 🥇4.45 | 🥇4.45 | 3925 |
sparsity | 7.65 | 7.82 | 8.55 | 4.20 | 🥇3.78 | 5.90 | 9.17 | 6.10 | 7.99 | 7.76 | 5.51 | 🥇3.58 | 3925 |
L2 | 6.68 | 6.89 | 6.81 | 🥇3.34 | 3.81 | 8.22 | 7.07 | 7.75 | 8.70 | 9.24 | 4.92 | 4.56 | 3925 |
MAD | 7.39 | 7.63 | 7.63 | 3.42 | 🥇3.05 | 7.52 | 8.58 | 7.86 | 8.65 | 8.49 | 4.30 | 3.47 | 3925 |
MD | 6.91 | 7.00 | 6.79 | 🥇3.56 | 🥇3.54 | 8.16 | 7.37 | 7.76 | 8.70 | 9.40 | 4.62 | 4.18 | 3925 |
framework | alibi_nograd | alibi | cadex | cfnow_random | cfnow_greedy | dice | growingspheres | synas | lore | sedc | cfnow_random_simple | cfnow_greedy_simple | N |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | |||||||||||||
validity | 8.39 | 8.79 | 5.93 | 🥇4.19 | 🥇4.19 | 🥇4.19 | 10.19 | 7.26 | 6.27 | 10.19 | 🥇4.19 | 🥇4.19 | 1327 |
sparsity | 8.20 | 8.74 | 8.19 | 🥇3.31 | 🥇3.37 | 5.97 | 10.19 | 6.84 | 5.91 | 10.19 | 🥇3.69 | 🥇3.38 | 1327 |
L2 | 8.20 | 8.74 | 8.19 | 🥇3.31 | 🥇3.37 | 5.97 | 10.19 | 6.84 | 5.91 | 10.19 | 🥇3.69 | 🥇3.38 | 1327 |
MAD | 8.49 | 9.29 | 7.85 | 🥇3.10 | 🥇3.16 | 5.56 | 9.96 | 7.81 | 6.18 | 9.96 | 🥇3.46 | 🥇3.16 | 1327 |
MD | 8.10 | 8.71 | 8.20 | 🥇3.43 | 🥇3.34 | 5.88 | 10.19 | 6.92 | 5.93 | 10.19 | 🥇3.77 | 🥇3.34 | 1327 |
framework | alibi_nograd | alibi | cadex | cfnow_random | cfnow_greedy | dice | growingspheres | synas | lore | sedc | cfnow_random_simple | cfnow_greedy_simple | N |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | |||||||||||||
validity | 6.64 | 6.64 | 6.93 | 🥇5.09 | 🥇5.09 | 6.78 | 6.12 | 8.14 | 9.42 | 6.97 | 🥇5.09 | 🥇5.09 | 1598 |
sparsity | 7.10 | 7.14 | 9.61 | 5.20 | 4.49 | 4.57 | 7.96 | 6.07 | 8.77 | 5.25 | 7.87 | 🥇3.97 | 1598 |
L2 | 4.83 | 4.84 | 5.54 | 4.28 | 4.72 | 9.75 | 🥇2.80 | 9.19 | 10.49 | 8.64 | 6.59 | 6.34 | 1598 |
MAD | 6.02 | 6.06 | 8.41 | 🥇3.38 | 🥇3.44 | 8.17 | 6.83 | 8.02 | 10.15 | 7.03 | 6.02 | 4.47 | 1598 |
MD | 5.32 | 5.34 | 5.70 | 4.02 | 4.21 | 9.68 | 🥇3.54 | 8.89 | 10.45 | 8.86 | 6.33 | 5.66 | 1598 |
framework | alibi_nograd | alibi | cadex | cfnow_random | cfnow_greedy | dice | growingspheres | synas | lore | sedc | cfnow_random_simple | cfnow_greedy_simple | N |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | |||||||||||||
validity | 7.89 | 7.38 | 5.00 | 🥇3.75 | 🥇3.75 | 8.07 | 9.75 | 6.51 | 9.56 | 8.84 | 🥇3.75 | 🥇3.75 | 1000 |
sparsity | 7.81 | 7.67 | 7.31 | 3.76 | 🥇3.19 | 7.93 | 9.75 | 5.16 | 9.50 | 8.55 | 4.15 | 🥇3.22 | 1000 |
L2 | 7.64 | 7.70 | 7.01 | 🥇1.89 | 2.95 | 8.75 | 9.75 | 6.67 | 9.56 | 8.95 | 3.88 | 3.26 | 1000 |
MAD | 8.11 | 7.96 | 6.11 | 3.92 | 🥇2.27 | 9.06 | 9.54 | 7.67 | 9.53 | 8.88 | 🥇2.67 | 🥇2.27 | 1000 |
MD | 7.86 | 7.39 | 6.68 | 🥇3.01 | 🥇2.75 | 8.74 | 9.75 | 7.07 | 9.58 | 9.23 | 🥇3.02 | 🥇2.93 | 1000 |
The results below consider valid counterfactuals. In other words, counterfactuals that: (1) have a different prediction class if compared to the factual and (2) respects binary and one-hot encoding rules.
Time spent (in seconds) to generate a counterfactual explanation
If you used this package on your experiments, here's the reference paper:
@Article{app11167274,
AUTHOR = {de Oliveira, Raphael Mazzine Barbosa and Martens, David},
TITLE = {A Framework and Benchmarking Study for Counterfactual Generating Methods on Tabular Data},
JOURNAL = {Applied Sciences},
VOLUME = {11},
YEAR = {2021},
NUMBER = {16},
ARTICLE-NUMBER = {7274},
URL = {https://www.mdpi.com/2076-3417/11/16/7274},
ISSN = {2076-3417},
DOI = {10.3390/app11167274}
}