[WIP] Add the binary fraud example #24

Vincent-Maladiere · 2024-07-05T18:11:46Z

Related issues
#25

What does this PR implement?

This PR proposes to add an e-commerce fraud binary classification use case as an example. It currently doesn't use mandr.
I uploaded the dataset because it is fairly small (25MB) and in an open-access license.
I tried to keep the eda.ipynb notebook reasonably short by putting much of the plotting logic in a dedicated eda_plot.py. You can view the notebook at the following link: https://github.com/probabl-ai/mandr/blob/add_binary_fraud_example/examples/fraud/eda.ipynb. To ease reviews, I also added its nbconvert script version, eda.py.
This PR is still in WIP –I will add at least one notebook for modeling, but I'd be happy to get feedback on the EDA section!

adrinjalali

a few points I noticed:

upload dataset to openml and fetch from there
add dependencies as "doc" dependencies
remote the notebook and only leave the .py file
clean up the .py file and have it the same as the examples in the example galleries we have in sklearn
plots require a bit more of an explanation before the plot to explain what we're looking at
the plots need to be importable from mandr.examples.fraud.eda_plots probably?

It seems this is for now exploration of the data rather then modeling. We can split the example then into two files / steps. Exploring the data, and modeling the fraud problem.

examples/fraud/eda.py

Vincent-Maladiere · 2024-07-15T14:00:42Z

Hey Adrin! Thank you for this review. It took me some time to properly document the EDA notebook. I addressed most of your suggestions, but a few todo remain:

upload dataset to openml and fetch from there

I've tried multiple times on different days but can't upload a dataset on OpenML. I receive a 502 proxy error at every try.

I'm don't know where to give feedback, because their org repo looks rather inactive. WDYT?

I would be happy to try that again, but I left the csv in the PR for now.

remote the notebook and only leave the .py file
clean up the .py file and have it the same as the examples in the example galleries we have in sklearn

We don't have a doc builder process for now on Mandr. Do you think we should create one in another PR and then apply the changes you suggest, before merging this current PR?

tuscland · 2024-07-17T07:37:36Z

It is a good idea to start having a doc CI process, even if incomplete or looking bad. In order to apply Adrin's advice, you are right we should do this first. Could you please send a PR?

adrinjalali · 2024-07-18T13:25:27Z

I haven't looked at the updates yet, but one high level comment might be that this doc is pretty good in giving context, and we would need it for good documentation, but it might also be a good idea to focus on the notebook that comes after this, taking the data and working on the parts which would actually use mandr maybe?

And as for the docs, we should have a sphinx + sphinx-gallery CI, and we can use readthedocs format to start with an easy docs deployment.

tuscland · 2024-07-18T13:29:55Z

Could we please make sure we don't commit the dataset to the git repo?

adrinjalali · 2024-07-18T13:32:17Z

Could we please make sure we don't commit the dataset to the git repo?

Yes, that's being discussed as well.

Regarding the openml issue, have you tried asking them on their slack @Vincent-Maladiere ? They're usually responsive.

Vincent-Maladiere · 2024-07-18T17:57:30Z

Ok, Sphinx Doc + Readthedoc sounds great, I'll create a PR for that. I'll contact the OpenML team on Slack too.

You could argue that Mandr could also be useful during EDA to contextualise the dataset and display info next to the modeling part later, but I get your point. The modeling notebook is almost done and just need a little more documentation.

Vincent-Maladiere · 2024-07-19T16:22:46Z

The modeling notebook is now ready for review! Link for visualization.

TODO:

upload on OpenML
- contact the team on Slack
- waiting for the OpenML server team to solve the issue
removing notebooks and cleaning the .py file.
- Add a doc builder Build examples using Sphinx gallery and readthedocs #61
- Adapt this PR to the doc builder

rouk1 · 2024-07-23T08:36:27Z

Just my 2 cents. Example related code should be outside of the main mandr folder. Bonus you may avoid linting issues if in another folder : )

Reference: #59 Since this repo is private, we don't currently use readthedocs to host the example gallery. We can build it locally instead. I added an example from scikit-learn temporarily, which needs to be removed in a subsequent PR like #8 or #24

Vincent-Maladiere · 2024-07-25T11:34:28Z

Just my 2 cents. Example related code should be outside of the main mandr folder. Bonus you may avoid linting issues if in another folder : )

Right, I don't have a strong opinion on this. Let's keep it in the example folder then. WDYT @adrinjalali?

Vincent-Maladiere · 2024-07-26T12:56:18Z

So this PR is nearly done, I'm still waiting for joaquin in the OpenML Slack to get back to me.
The error in the CI is due to the typo checker running on the dataset CSV, which will be fixed when I upload it on OpenML
I added a CI doc builder, which uploads an artifact we can download and view locally. That way, we don't need to build the examples locally. This will ease reviews. To download the example gallery, go to the Summary of the Github action Build documentation, scroll down, and click the download button. Then, unzip the folder and open build/index.html.
One caveat is that my current modeling example takes a while to complete (30min). We clearly don't want to wait that long for the CI to complete, so we need to either a) simplify the modeling example and/or b) not trigger the doc-building CI for every PR.

WDYT?

The thumbnails of the two examples:

Vincent-Maladiere · 2024-08-08T07:48:02Z

So, I've contacted a member of OpenML for the third time who said fixing the server was on his TODO list. Our first interaction was 3 weeks ago. I would like to move this PR forward and eventually put the datasets on their platform when they fix it.

In the meantime, I'd appreciate a more thorough review of the content of the notebooks. Note that I already introduced them to the skrub team since the modeling could be simplified with further development of skrub.

Ping @glemaitre, @ogrisel @koaning

koaning · 2024-08-08T09:01:07Z

I guess my main observation here is that we might not be able to trust the labels. If there is an indication of fraud I will gladly trust it ... but what about the non fraud labels? Do we really know for sure that these have all been varified by a human? I will gladly hear it if this is out of scope for this challenge but I felt compelled to at least make this point.

koaning · 2024-08-08T09:04:55Z

examples/fraud/plot_modeling.py

+)
+target_encoder.set_output(transform="pandas")
+
+row_aggregate = FunctionTransformer(row_aggregate_post_target_encoder)


Shouln't this be a stateful operation? The aggregations that are learned at train time, should these not be used at eval time? Or am I misinterpreting the code here?

This part is pretty tricky to read I agree: I use a Target Encoder first, which is stateful, followed by the row-wise aggregation that you mentioned, which is stateless

koaning · 2024-08-08T09:06:21Z

examples/fraud/plot_modeling.py

+# downloads encoders from HuggingFace. Since we don't fine-tune it, this is a stateless
+# operation.
+
+from sentence_transformers import SentenceTransformer


Why not embetter ;)?

The benefit is that the sentence transformer will now just behave like a normal sklearn estimator.

Another benefit is that the model is only loaded once. In this implementation it is loaded every time the inference is called, which can be a bunch of overhead.

Haha, well I wanted to stick the strict minimum, but I can def use embetter, no problem here

koaning · 2024-08-08T09:10:30Z

examples/fraud/plot_modeling.py

+# **Can we compare where these two models disagree?**
+
+
+def get_disagreeing_between(model_name_1, model_name_2, X_test, y_test, results):


This is a nice touch! It can serve as a stepping stone to dive more into the data quality!

koaning

Was only able to give it a glance but found a few points that might be worth mentioning.

Vincent-Maladiere · 2024-08-21T15:21:25Z

Hey Vinnie, thank you for the feedback! I'll iterate on that.

Also note that I made a very concise demo which has already been merged. We can improve this, and I'm curious to know what you think: https://github.com/probabl-ai/mandr/blob/main/notebooks/skrub_demo.py

tuscland · 2024-08-21T17:15:54Z

@Vincent-Maladiere it would be nice if we can close this PR by the end of the month, do you think it is possible?

Vincent-Maladiere · 2024-08-22T09:33:53Z

Yes, I will give it a last touch today and then it will be ready for merging. The only issue is that we don't have a dedicated place to store / host / render the gallery artifact, so these two notebooks will be tricky to preview. That can be addressed later, though.

tuscland · 2024-08-22T10:22:03Z

As long as they can be rendered by generating the documentation it is fine for now I guess.

Vincent-Maladiere · 2024-08-22T10:24:02Z

So I gave embetter a spin and got to say that I struggled a bit. Here are some feedback:

The documentation of SentenceEncoder is incomplete; I wanted to know what I can pass for fit as X (e.g. a series works but a dataframe doesn't). Then I understood I needed to grab the column with ColumnGrabber for dataframes, but it's a little bit annoying compared to a simple function in a FunctionTransformer.
Unrelated, but the docstring example of the ContrastiveLearner is the same as SbertLearner
The get_feature_names_out method is missing from the SentenceEncoder class, therefore I can't use set_output(transform="pandas")

So I will stick with my simpler solution for now

Vincent-Maladiere · 2024-08-22T12:42:55Z

I replaced the CSVs with a parquet file whose size is reduced 10x (from 25MB to 2.5MB)

Vincent-Maladiere · 2024-08-22T16:06:57Z

I guess my main observation here is that we might not be able to trust the labels. If there is an indication of fraud I will gladly trust it ... but what about the non fraud labels? Do we really know for sure that these have all been varified by a human? I will gladly hear it if this is out of scope for this challenge but I felt compelled to at least make this point.

@koaning I forgot to react to that, but yeah I 100% agree. How trustworthy the labels are is a critical question, to which we can't bring an answer here, unfortunately.

Vincent-Maladiere · 2024-08-22T16:09:48Z

@tuscland Let's wait until the CI is green (hopefully) and this PR will be ready to be merged

Reference: #59 Since this repo is private, we don't currently use readthedocs to host the example gallery. We can build it locally instead. I added an example from scikit-learn temporarily, which needs to be removed in a subsequent PR like #8 or #24

**Related issues** #25 **What does this PR implement?** - This PR proposes to add an e-commerce fraud binary classification use case as an example. It currently doesn't use mandr. - I uploaded the dataset because it is fairly small (25MB) and in an open-access license. - I tried to keep the `eda.ipynb` notebook reasonably short by putting much of the plotting logic in a dedicated `eda_plot.py`. You can view the notebook at the following link: https://github.com/probabl-ai/mandr/blob/add_binary_fraud_example/examples/fraud/eda.ipynb. To ease reviews, I also added its nbconvert script version, `eda.py`. - This PR is still in WIP –I will add at least one notebook for modeling, but I'd be happy to get feedback on the EDA section! --------- Co-authored-by: Vincent Maladiere <vincentmaladiere@Vincents-Laptop.local> Co-authored-by: Vincent Maladiere <vincentmaladiere@Vincents-Laptop.lan>

add eda fraud

09ef181

Vincent-Maladiere mentioned this pull request Jul 5, 2024

Add an example with binary classification #1

Closed

adrinjalali reviewed Jul 8, 2024

View reviewed changes

Vincent Maladiere and others added 2 commits July 9, 2024 18:26

Merge branch 'main' into add_binary_fraud_example

ecc647c

clean eda section

Loading
Loading status checks…

d869696

augustebaum mentioned this pull request Jul 16, 2024

docs: add example forecasting notebook #8

Closed

add modeling notebook

Loading
Loading status checks…

eba9073

This was referenced Jul 22, 2024

Add a doc builder with Sphinx #59

Closed

Build examples using Sphinx gallery and readthedocs #61

Merged

Merge branch 'main' into add_binary_fraud_example

d4cb470

Merge branch 'main' into add_binary_fraud_example

b9881f5

Vincent-Maladiere added 4 commits July 26, 2024 11:01

Remove notebooks and adapt example scripts to sphinx gallery

Loading
Loading status checks…

3b0f513

Move example utils from mandr to doc

Loading
Loading status checks…

52d3661

Add a doc preview in the CI

Loading
Loading status checks…

c2d0284

Add sphinx pydata-theme

Loading
Loading status checks…

7756e60

koaning reviewed Aug 8, 2024

View reviewed changes

removing IPython from doc dependencies

Loading
Loading status checks…

b79b3c9

Merge branch 'main' into add_binary_fraud_example

8b4ec0a

Vincent-Maladiere added 2 commits August 22, 2024 14:10

update requirement

Loading
Loading status checks…

f0f4839

replace csv with parquet for efficiency

Loading
Loading status checks…

4fa7971

Vincent-Maladiere added 2 commits August 22, 2024 14:45

add parquet as a doc requirement

Loading
Loading status checks…

eb8dddd

fix dtype issue

Loading
Loading status checks…

393765d

tuscland merged commit 2b38b71 into main Aug 23, 2024
2 checks passed

tuscland deleted the add_binary_fraud_example branch August 23, 2024 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add the binary fraud example #24

[WIP] Add the binary fraud example #24

Vincent-Maladiere commented Jul 5, 2024 •

edited

Loading

adrinjalali left a comment

Vincent-Maladiere commented Jul 15, 2024 •

edited

Loading

tuscland commented Jul 17, 2024

adrinjalali commented Jul 18, 2024

tuscland commented Jul 18, 2024

adrinjalali commented Jul 18, 2024

Vincent-Maladiere commented Jul 18, 2024

Vincent-Maladiere commented Jul 19, 2024 •

edited

Loading

rouk1 commented Jul 23, 2024

Vincent-Maladiere commented Jul 25, 2024

Vincent-Maladiere commented Jul 26, 2024 •

edited

Loading

Vincent-Maladiere commented Aug 8, 2024

koaning commented Aug 8, 2024

koaning Aug 8, 2024

Vincent-Maladiere Aug 21, 2024

koaning Aug 8, 2024

koaning Aug 8, 2024

Vincent-Maladiere Aug 21, 2024

koaning Aug 8, 2024

koaning left a comment

Vincent-Maladiere commented Aug 21, 2024 •

edited

Loading

tuscland commented Aug 21, 2024

Vincent-Maladiere commented Aug 22, 2024

tuscland commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

		# Can we compare where these two models disagree?


		def get_disagreeing_between(model_name_1, model_name_2, X_test, y_test, results):

[WIP] Add the binary fraud example #24

[WIP] Add the binary fraud example #24

Conversation

Vincent-Maladiere commented Jul 5, 2024 • edited Loading

adrinjalali left a comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Jul 15, 2024 • edited Loading

tuscland commented Jul 17, 2024

adrinjalali commented Jul 18, 2024

tuscland commented Jul 18, 2024

adrinjalali commented Jul 18, 2024

Vincent-Maladiere commented Jul 18, 2024

Vincent-Maladiere commented Jul 19, 2024 • edited Loading

rouk1 commented Jul 23, 2024

Vincent-Maladiere commented Jul 25, 2024

Vincent-Maladiere commented Jul 26, 2024 • edited Loading

Vincent-Maladiere commented Aug 8, 2024

koaning commented Aug 8, 2024

koaning Aug 8, 2024

Choose a reason for hiding this comment

Vincent-Maladiere Aug 21, 2024

Choose a reason for hiding this comment

koaning Aug 8, 2024

Choose a reason for hiding this comment

koaning Aug 8, 2024

Choose a reason for hiding this comment

Vincent-Maladiere Aug 21, 2024

Choose a reason for hiding this comment

koaning Aug 8, 2024

Choose a reason for hiding this comment

koaning left a comment

Choose a reason for hiding this comment

Vincent-Maladiere commented Aug 21, 2024 • edited Loading

tuscland commented Aug 21, 2024

Vincent-Maladiere commented Aug 22, 2024

tuscland commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

Vincent-Maladiere commented Aug 22, 2024

Vincent-Maladiere commented Jul 5, 2024 •

edited

Loading

Vincent-Maladiere commented Jul 15, 2024 •

edited

Loading

Vincent-Maladiere commented Jul 19, 2024 •

edited

Loading

Vincent-Maladiere commented Jul 26, 2024 •

edited

Loading

Vincent-Maladiere commented Aug 21, 2024 •

edited

Loading