Add an example with binary classification #1

Vincent-Maladiere · 2024-06-19T15:20:26Z

Here is a first quick example that doesn't use Mandr for now. It extends one of the scikit-learn examples introducing the TunedThresholdClassifierCV.

It displays some plots a user might want to see for this task, although many are missing (especially considering the non-existent EDA part). It notably features plots with multiple models, which could be a good candidate for the aggregated feature of Mandr (something MLFlow doesn't have).

Feedback on the methodology (or anything else really) is more than welcome.

koaning · 2024-06-19T19:06:37Z

examples/binary_loans.py

+    'logisticregression__C': np.logspace(0.0001, 1, 10),
+}
+
+with warnings.catch_warnings(action="ignore"):


If this was a production setting ... why would you ignore the warnings? Wouldn't we want these to appear in our logs? These warnings may give us a hint when something breaks down completely.

Yeah, I should have mentioned that I was worried about the readability of the notebook, as the same (low-priority) warning was printed repeatedly across the cv.
So, this is not a production setting, this is an exploratory notebook setting :)
But I can try to fix the root cause of this warning.

koaning · 2024-06-19T19:10:48Z

examples/binary_loans.py

+
+        y_proba = named_results["y_proba"][:, 0]
+
+        CalibrationDisplay.from_predictions(


I have not implemented it yet, but when I looked at this guide I got the impression that we might also be able to do this:

disp = CalibrationDisplay(...)

Then we could store the disp unto the mander and from a template we might do something like:

<sklearn-display disp='@mander.disp'/>

How would that feel for you?

Sure, that's an interesting option! To play the devil's advocate here, should we generalize this and try using e.g. matplotlib figures instead?

<matplotlib fig='@mander.my_calibration_fig'>

Ideally, I'd like frequently used plots to be built in Mandr, so that I don't need to write CalibrationDisplay(...) myself. How do you think we could achieve that with Mandr?

glemaitre · 2024-06-19T21:58:06Z

examples/binary_loans.py

+
+
+from sklearn.dummy import DummyClassifier
+from sklearn.model_selection import cross_val_predict


brrrrr :)

I see that you don't use it so this is fine :)

Haha good catch, I need to remove it.
A good follow-up on this example would be to add confidence intervals on the roc curve, and error bars on the metrics. What would be the best way to achieve uncertainty quantification here?

jeremiedbb

quick pass, I'm still reading :)

jeremiedbb · 2024-06-20T13:56:48Z

examples/binary_loans.py

+    tnr = tn / (tn + fp)
+    return 1 - tnr


same but more straightforward

Suggested change

tnr = tn / (tn + fp)

return 1 - tnr

return fp / (fp + tn)

jeremiedbb · 2024-06-20T13:57:06Z

examples/binary_loans.py

+# 
+# The goal of this example, based on [an existing scikit-learn example](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cost_sensitive_learning.html), is to showcase various baseline models and evaluation metrics for binary classification. The use-case we are dealing with is predicting the acceptance of candidates applying for a loan. The application can either be marked as good or bad, hence the binary nature of the task.
+# 
+# We start with loading the dataset from openML. Note that in real-life setting, we would probably have to join and aggregate multiple tables from multiple sources before landing a neat dataframe ready to use for ML. We would also need to worry about data quality, feature selection, missing values etc. It we wanted to deploy this model into some production system, the availability of our features at the prediction time must also be checked.


Suggested change

# We start with loading the dataset from openML. Note that in real-life setting, we would probably have to join and aggregate multiple tables from multiple sources before landing a neat dataframe ready to use for ML. We would also need to worry about data quality, feature selection, missing values etc. It we wanted to deploy this model into some production system, the availability of our features at the prediction time must also be checked.

# We start with loading the dataset from OpenML. Note that in real-life setting, we would probably have to join and aggregate multiple tables from multiple sources before landing a neat dataframe ready to use for ML. We would also need to worry about data quality, feature selection, missing values etc. It we wanted to deploy this model into some production system, the availability of our features at the prediction time must also be checked.

jeremiedbb · 2024-06-20T14:00:05Z

examples/binary_loans.py

+
+# We then add our utility function "cost gain". Unlike the previous metrics which are generic to the binary classification setting, this utility function is specific to the problem at hand and must be carefuly considered.
+# 
+# Here, we set this utility function using a cost matrix, which indicates how much error costs, and how much correct predictions yield. This is where a strong intuition of the use-case is needed. We considered the coefficients to be fixed, but note that we could also pass variables from our dataframe (e.g. a False positive could be proportional to the amount of the loan).


Suggested change

# Here, we set this utility function using a cost matrix, which indicates how much error costs, and how much correct predictions yield. This is where a strong intuition of the use-case is needed. We considered the coefficients to be fixed, but note that we could also pass variables from our dataframe (e.g. a False positive could be proportional to the amount of the loan).

# Here, we set this utility function using a cost matrix, which indicates how much errors cost, and how much correct predictions yield. This is where a strong intuition of the use-case is needed. We considered the coefficients to be fixed, but note that we could also pass variables from our dataframe (e.g. a False positive could be proportional to the amount of the loan).

jeremiedbb · 2024-06-20T14:02:22Z

examples/binary_loans.py

+
+# We then add our utility function "cost gain". Unlike the previous metrics which are generic to the binary classification setting, this utility function is specific to the problem at hand and must be carefuly considered.
+# 
+# Here, we set this utility function using a cost matrix, which indicates how much error costs, and how much correct predictions yield. This is where a strong intuition of the use-case is needed. We considered the coefficients to be fixed, but note that we could also pass variables from our dataframe (e.g. a False positive could be proportional to the amount of the loan).


is where a strong intuition of the use-case is needed.

It's more than intuition. It's knowledge of the domain, of the application, of business constraints, ...

jeremiedbb · 2024-06-20T14:04:23Z

examples/binary_loans.py

+# 
+# Here, we set this utility function using a cost matrix, which indicates how much error costs, and how much correct predictions yield. This is where a strong intuition of the use-case is needed. We considered the coefficients to be fixed, but note that we could also pass variables from our dataframe (e.g. a False positive could be proportional to the amount of the loan).
+# 
+# In this example, correct classifications yield 0, false negatives yield -5 and false positives yield -1. We can already understand that penalizing false negatives 5x time more than false positives will put the classification threshold closer to 0 than to 0.5.


what's the unit ? I think these numbers should be motivated with concrete examples

Vincent-Maladiere · 2024-07-05T18:13:53Z

Thanks for your comments, I'm closing this PR in favor of #24, because it deals with a more concrete use case.

Vincent Maladiere added 3 commits June 19, 2024 17:01

add notebook

f27bcad

add nbconvert output

a15fc2b

refresh cell outputs

1052c00

koaning reviewed Jun 19, 2024

View reviewed changes

glemaitre reviewed Jun 19, 2024

View reviewed changes

jeremiedbb reviewed Jun 20, 2024

View reviewed changes

Vincent-Maladiere closed this Jul 5, 2024

thomass-dev deleted the binary_example branch September 16, 2024 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an example with binary classification #1

Add an example with binary classification #1

Vincent-Maladiere commented Jun 19, 2024

koaning Jun 19, 2024

Vincent-Maladiere Jun 20, 2024 •

edited

Loading

koaning Jun 19, 2024

Vincent-Maladiere Jun 20, 2024

glemaitre Jun 19, 2024

Vincent-Maladiere Jun 20, 2024

jeremiedbb left a comment

jeremiedbb Jun 20, 2024

jeremiedbb Jun 20, 2024

jeremiedbb Jun 20, 2024

jeremiedbb Jun 20, 2024

jeremiedbb Jun 20, 2024

Vincent-Maladiere commented Jul 5, 2024


		y_proba = named_results["y_proba"][:, 0]

		CalibrationDisplay.from_predictions(



		from sklearn.dummy import DummyClassifier
		from sklearn.model_selection import cross_val_predict

	# Here, we set this utility function using a cost matrix, which indicates how much error costs, and how much correct predictions yield. This is where a strong intuition of the use-case is needed. We considered the coefficients to be fixed, but note that we could also pass variables from our dataframe (e.g. a False positive could be proportional to the amount of the loan).
	# Here, we set this utility function using a cost matrix, which indicates how much errors cost, and how much correct predictions yield. This is where a strong intuition of the use-case is needed. We considered the coefficients to be fixed, but note that we could also pass variables from our dataframe (e.g. a False positive could be proportional to the amount of the loan).

Add an example with binary classification #1

Add an example with binary classification #1

Conversation

Vincent-Maladiere commented Jun 19, 2024

Choose a reason for hiding this comment

Vincent-Maladiere Jun 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Jul 5, 2024

Vincent-Maladiere Jun 20, 2024 •

edited

Loading