-
-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Implement predict_proba() for OutputCodeClassifier #25148
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will also need an entry in the changelog for the latest failure to go away. We can start to put this entry in doc/whats_new/v1.3.rst
file.
sklearn/multiclass.py
Outdated
# Inverse distance weighting: | ||
eps = 1e-16 | ||
proba = (1.0 / (dist + eps)) / np.sum(1.0 / (dist + eps), axis=1)[:, np.newaxis] | ||
return proba |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any scientific literature linked to this approach to express probability. I would have naively think to use a softmax instead of a linear mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that we could invert the distance because we should have a upper bound of the distance since we deal with binary code (I did not look at the algorithm yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turning distance into a probability-like value is basically spatial interpolation. IDW was described in a classical 'A two-dimensional interpolation function for irregularly-spaced data', among other methods.
Softmax is an option, of course, I cosidered a selectable method but an extra predict_proba()
argument looked like a bad idea to me.
Practically: softmax decays somewhat faster, but that's about it. Both ways yield non-negative scores which sum to 1. Any method is able to come slightly ahead of the other depending on data in terms of ROC_AUC, AP and calibration curve (the avergage difference in metrics is within 0.01).
@@ -704,6 +706,19 @@ def test_ecoc_fit_predict(): | |||
assert len(ecoc.estimators_) == n_classes * 2 | |||
|
|||
|
|||
def test_ecoc_predict_proba(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try binary and multiclass here.
sklearn/tests/test_multiclass.py
Outdated
# Test that the scores sum to 1 | ||
assert_almost_equal(np.sum(proba, axis=1), np.ones(proba.shape[0])) | ||
|
||
# Regression test for the new proba-based predict against the old one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this. Basically, this non-regression test is only check a "general" case. If something goes wrong, it would be on the corner case.
At least we have now the case where y_prob.argmax(axis=1) == y_pred
which is something that we intend to achieve in common test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at OneVsRestClassifier()
, it only tests the output shape (which is reasonable I guess) and the said y_prob.argmax(axis=1) == y_pred
equality. The latter seems redundant since predict()
basically does nothing but argmax(axis=1)
now.
As an alternative, ensuring the shortest distance yields the highest probability for a few cases comes to mind:
- code size much smaller than the number of classes;
- code size much bigger than the number of classes;
- all equal distances.
Should we include those (or, perhaps, other possible cases)?
Adding the "needs triage" to discuss the inclusion during the triage meeting. |
We discuss the implementation during the triage meeting. Basically, we are not against adding the In this regard, I think that there are several bugs to correct before going deeper further here. The original paper [1] is using an L1 distance (city-block). I have to check [2] to be sure that there is not extension or theory behind using the L2 distance here. At least L1 is more intuitive to me than an L2 when computing distances between binary codes. A second potential bug is linked to the binary classifier predictions. The original paper states that they concatenate the probabilities of each binary classifier. In our code, we will first try to concatenate the predictions that we get through Regarding the [1] Dietterich, T. G., & Bakiri, G. (1994). Solving multiclass learning problems via error-correcting output codes. Journal of artificial intelligence research, 2, 263-286. [2] James, G., & Hastie, T. (1998). The error coding method and PICTs. Journal of Computational and Graphical statistics, 7(3), 377-387. |
I did not finish yet but for the probability/confidence score I have the following feeling:
|
I wanted to understand exactly what was in Zadrozny et al. paper and I made a barely tested implementation. I am putting the code that I could come up with. The naming is terrible but it is related to the formula in the paper. I saw that we mentioned the paper in the calibration section. I assume that we used the non-iterative formulation that was proposed by Hastie et al. that holds only in the case of one vs. rest. # %%
from sklearn.datasets import make_classification
n_classes = 3
X, y = make_classification(
n_samples=1_000,
n_features=4,
n_clusters_per_class=1,
n_classes=n_classes,
random_state=0,
)
# %%
from sklearn.multiclass import OutputCodeClassifier
from sklearn.tree import DecisionTreeClassifier
model = OutputCodeClassifier(
estimator=DecisionTreeClassifier(max_depth=2, random_state=0),
code_size=2,
random_state=0
).fit(X, y)
# %%
import warnings
import numpy as np
from sklearn.exceptions import ConvergenceWarning
# n_samples should be computed during training and stored
n_samples, n_estimators = X.shape[0], len(model.estimators_)
rng = np.random.default_rng(0)
indicator = model.code_book_.astype(bool)
r = np.transpose([
est.predict_proba(X[:n_samples])[:, 1]
for est in model.estimators_
])
not_r = 1 - r
# initialization
p_hat = rng.uniform(size=(n_samples, n_classes))
p_hat /= p_hat.sum(axis=1)[:, np.newaxis]
r_hat = np.zeros_like(r)
max_iter, n_iter, tol, err = 100, 0, 1e-3, np.inf
for _ in range(max_iter):
# compute r_hat
for learner_idx in range(n_estimators):
r_hat[:, learner_idx] = p_hat[:, indicator[:, learner_idx]].sum(axis=1)
not_r_hat = 1 - r_hat
# estimate p_hat
p_hat_previous = p_hat.copy()
for klass in range(n_classes):
numerator, denominator = 0, 0
klass_indicator = indicator[klass]
numerator = n_samples * (
r[:, klass_indicator].sum(axis=1) +
not_r[:, ~klass_indicator].sum(axis=1)
)
denominator = n_samples * (
r_hat[:, klass_indicator].sum(axis=1) +
not_r_hat[:, ~klass_indicator].sum(axis=1)
)
p_hat[:, klass] *= numerator / denominator
p_hat /= p_hat.sum(axis=1)[:, np.newaxis]
n_iter += 1
err = np.linalg.norm(p_hat_previous - p_hat, np.inf)
if err < tol:
break
else:
warnings.warn(
"Did not converge, decrease tol or increase max_iter.",
ConvergenceWarning,
)
# %%
np.mean(p_hat.argmax(axis=1) == model.predict(X)) |
As I can see, it tries to derive model scores using individual estimators' scores only. However, changing the distance calculation method is able to change the predictions (hence your proposal to switch from euclidean to manhattan). The individual scores, however, remain the same. So, logically thinking, there's no way to ensure the argmax of the proba equals the argmin of the distance unless the distance calculation is included into scoring. Am I missing something here? |
I would look at the theory in the paper stated above in section 4. I did not yet have time to investigate more closely. |
So I intend to solve the couple of bugs that we had: #25217 From the article above, we can ensure the equivalence by setting |
I've run some tests against #25217: The non-iterative estimator suggested in the same chapter (i.e. basically |
Reference Issues/PRs
None found, albeit it was briefly mentioned in #389
What does this implement/fix? Explain your changes.
In its present state,
sklearn.multiclass.OutputCodeClassifier()
is only able topredict()
. Having access to the underlying class scores/probabilities is required for a good part of validation metrics. Those can also be used for model calibration and improved decision making (though that might be considered redundant for the ECOC classifier usecases).What does it do / how does it work?
This PR implements the (faux) probability estimates using the same base estimators' output <-> codebook euclidean distances that are currently used by the
predict()
method using inverse distance weighting. Distance calculation is moved to the newpredict_proba()
method, andpredict()
is adjusted accordingly.Any other comments?
Possible cons:
The probabilistic interpretation of said distances might be questionable. However, lots of other classification algorithms that do implement
predict_proba()
aren't guaranteed to have its output well-calibrated either.QA:
A non-regression test to ensure theas well as the sanity test of class scores summing to 1.predict()
results are the same as before is provided,predict_proba()
appear to yield sane OvR ROC and PR curves, as well as calibration curves of an acceptable shape.