Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add the binary fraud example #24

Merged
merged 16 commits into from
Aug 23, 2024
Merged

[WIP] Add the binary fraud example #24

merged 16 commits into from
Aug 23, 2024

Conversation

Vincent-Maladiere
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere commented Jul 5, 2024

Related issues
#25

What does this PR implement?

  • This PR proposes to add an e-commerce fraud binary classification use case as an example. It currently doesn't use mandr.
  • I uploaded the dataset because it is fairly small (25MB) and in an open-access license.
  • I tried to keep the eda.ipynb notebook reasonably short by putting much of the plotting logic in a dedicated eda_plot.py. You can view the notebook at the following link: https://github.com/probabl-ai/mandr/blob/add_binary_fraud_example/examples/fraud/eda.ipynb. To ease reviews, I also added its nbconvert script version, eda.py.
  • This PR is still in WIP –I will add at least one notebook for modeling, but I'd be happy to get feedback on the EDA section!

Copy link
Contributor

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few points I noticed:

  • upload dataset to openml and fetch from there
  • add dependencies as "doc" dependencies
  • remote the notebook and only leave the .py file
  • clean up the .py file and have it the same as the examples in the example galleries we have in sklearn
  • plots require a bit more of an explanation before the plot to explain what we're looking at
  • the plots need to be importable from mandr.examples.fraud.eda_plots probably?

It seems this is for now exploration of the data rather then modeling. We can split the example then into two files / steps. Exploring the data, and modeling the fraud problem.

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Jul 15, 2024

Hey Adrin! Thank you for this review. It took me some time to properly document the EDA notebook. I addressed most of your suggestions, but a few todo remain:

upload dataset to openml and fetch from there

I've tried multiple times on different days but can't upload a dataset on OpenML. I receive a 502 proxy error at every try.

Screenshot 2024-07-10 at 15 36 47

I'm don't know where to give feedback, because their org repo looks rather inactive. WDYT?

I would be happy to try that again, but I left the csv in the PR for now.


remote the notebook and only leave the .py file
clean up the .py file and have it the same as the examples in the example galleries we have in sklearn

We don't have a doc builder process for now on Mandr. Do you think we should create one in another PR and then apply the changes you suggest, before merging this current PR?

@tuscland
Copy link
Member

It is a good idea to start having a doc CI process, even if incomplete or looking bad. In order to apply Adrin's advice, you are right we should do this first. Could you please send a PR?

@adrinjalali
Copy link
Contributor

I haven't looked at the updates yet, but one high level comment might be that this doc is pretty good in giving context, and we would need it for good documentation, but it might also be a good idea to focus on the notebook that comes after this, taking the data and working on the parts which would actually use mandr maybe?

And as for the docs, we should have a sphinx + sphinx-gallery CI, and we can use readthedocs format to start with an easy docs deployment.

@tuscland
Copy link
Member

Could we please make sure we don't commit the dataset to the git repo?

@adrinjalali
Copy link
Contributor

Could we please make sure we don't commit the dataset to the git repo?

Yes, that's being discussed as well.

Regarding the openml issue, have you tried asking them on their slack @Vincent-Maladiere ? They're usually responsive.

@Vincent-Maladiere
Copy link
Member Author

Ok, Sphinx Doc + Readthedoc sounds great, I'll create a PR for that. I'll contact the OpenML team on Slack too.

You could argue that Mandr could also be useful during EDA to contextualise the dataset and display info next to the modeling part later, but I get your point. The modeling notebook is almost done and just need a little more documentation.

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Jul 19, 2024

The modeling notebook is now ready for review! Link for visualization.

TODO:

@rouk1
Copy link
Contributor

rouk1 commented Jul 23, 2024

Just my 2 cents. Example related code should be outside of the main mandr folder. Bonus you may avoid linting issues if in another folder : )

Vincent-Maladiere added a commit that referenced this pull request Jul 25, 2024
Reference: #59

Since this repo is private, we don't currently use readthedocs to host
the example gallery. We can build it locally instead.
I added an example from scikit-learn temporarily, which needs to be
removed in a subsequent PR like
#8 or
#24
@Vincent-Maladiere
Copy link
Member Author

Just my 2 cents. Example related code should be outside of the main mandr folder. Bonus you may avoid linting issues if in another folder : )

Right, I don't have a strong opinion on this. Let's keep it in the example folder then. WDYT @adrinjalali?

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Jul 26, 2024

  1. So this PR is nearly done, I'm still waiting for joaquin in the OpenML Slack to get back to me.
  2. The error in the CI is due to the typo checker running on the dataset CSV, which will be fixed when I upload it on OpenML
  3. I added a CI doc builder, which uploads an artifact we can download and view locally. That way, we don't need to build the examples locally. This will ease reviews. To download the example gallery, go to the Summary of the Github action Build documentation, scroll down, and click the download button. Then, unzip the folder and open build/index.html.
  4. One caveat is that my current modeling example takes a while to complete (30min). We clearly don't want to wait that long for the CI to complete, so we need to either a) simplify the modeling example and/or b) not trigger the doc-building CI for every PR.

WDYT?

Screenshot 2024-07-26 at 14 51 14

The thumbnails of the two examples:

Screenshot 2024-07-26 at 14 57 00

@Vincent-Maladiere
Copy link
Member Author

So, I've contacted a member of OpenML for the third time who said fixing the server was on his TODO list. Our first interaction was 3 weeks ago. I would like to move this PR forward and eventually put the datasets on their platform when they fix it.

In the meantime, I'd appreciate a more thorough review of the content of the notebooks. Note that I already introduced them to the skrub team since the modeling could be simplified with further development of skrub.

Ping @glemaitre, @ogrisel @koaning

@koaning
Copy link
Contributor

koaning commented Aug 8, 2024

I guess my main observation here is that we might not be able to trust the labels. If there is an indication of fraud I will gladly trust it ... but what about the non fraud labels? Do we really know for sure that these have all been varified by a human? I will gladly hear it if this is out of scope for this challenge but I felt compelled to at least make this point.

)
target_encoder.set_output(transform="pandas")

row_aggregate = FunctionTransformer(row_aggregate_post_target_encoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouln't this be a stateful operation? The aggregations that are learned at train time, should these not be used at eval time? Or am I misinterpreting the code here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is pretty tricky to read I agree: I use a Target Encoder first, which is stateful, followed by the row-wise aggregation that you mentioned, which is stateless

# downloads encoders from HuggingFace. Since we don't fine-tune it, this is a stateless
# operation.

from sentence_transformers import SentenceTransformer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not embetter ;)?

The benefit is that the sentence transformer will now just behave like a normal sklearn estimator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another benefit is that the model is only loaded once. In this implementation it is loaded every time the inference is called, which can be a bunch of overhead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, well I wanted to stick the strict minimum, but I can def use embetter, no problem here

# **Can we compare where these two models disagree?**


def get_disagreeing_between(model_name_1, model_name_2, X_test, y_test, results):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice touch! It can serve as a stepping stone to dive more into the data quality!

Copy link
Contributor

@koaning koaning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was only able to give it a glance but found a few points that might be worth mentioning.

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Aug 21, 2024

Hey Vinnie, thank you for the feedback! I'll iterate on that.

Also note that I made a very concise demo which has already been merged. We can improve this, and I'm curious to know what you think: https://github.com/probabl-ai/mandr/blob/main/notebooks/skrub_demo.py

@tuscland
Copy link
Member

@Vincent-Maladiere it would be nice if we can close this PR by the end of the month, do you think it is possible?

@Vincent-Maladiere
Copy link
Member Author

Yes, I will give it a last touch today and then it will be ready for merging. The only issue is that we don't have a dedicated place to store / host / render the gallery artifact, so these two notebooks will be tricky to preview. That can be addressed later, though.

@tuscland
Copy link
Member

As long as they can be rendered by generating the documentation it is fine for now I guess.

@Vincent-Maladiere
Copy link
Member Author

So I gave embetter a spin and got to say that I struggled a bit. Here are some feedback:

  • The documentation of SentenceEncoder is incomplete; I wanted to know what I can pass for fit as X (e.g. a series works but a dataframe doesn't). Then I understood I needed to grab the column with ColumnGrabber for dataframes, but it's a little bit annoying compared to a simple function in a FunctionTransformer.
  • Unrelated, but the docstring example of the ContrastiveLearner is the same as SbertLearner
  • The get_feature_names_out method is missing from the SentenceEncoder class, therefore I can't use set_output(transform="pandas")

So I will stick with my simpler solution for now

@Vincent-Maladiere
Copy link
Member Author

I replaced the CSVs with a parquet file whose size is reduced 10x (from 25MB to 2.5MB)

@Vincent-Maladiere
Copy link
Member Author

I guess my main observation here is that we might not be able to trust the labels. If there is an indication of fraud I will gladly trust it ... but what about the non fraud labels? Do we really know for sure that these have all been varified by a human? I will gladly hear it if this is out of scope for this challenge but I felt compelled to at least make this point.

@koaning I forgot to react to that, but yeah I 100% agree. How trustworthy the labels are is a critical question, to which we can't bring an answer here, unfortunately.

@Vincent-Maladiere
Copy link
Member Author

@tuscland Let's wait until the CI is green (hopefully) and this PR will be ready to be merged

@tuscland tuscland merged commit 2b38b71 into main Aug 23, 2024
2 checks passed
@tuscland tuscland deleted the add_binary_fraud_example branch August 23, 2024 07:28
thomass-dev pushed a commit that referenced this pull request Dec 2, 2024
Reference: #59

Since this repo is private, we don't currently use readthedocs to host
the example gallery. We can build it locally instead.
I added an example from scikit-learn temporarily, which needs to be
removed in a subsequent PR like
#8 or
#24
thomass-dev pushed a commit that referenced this pull request Dec 2, 2024
**Related issues**
#25

**What does this PR implement?**
- This PR proposes to add an e-commerce fraud binary classification use
case as an example. It currently doesn't use mandr.
- I uploaded the dataset because it is fairly small (25MB) and in an
open-access license.
- I tried to keep the `eda.ipynb` notebook reasonably short by putting
much of the plotting logic in a dedicated `eda_plot.py`. You can view
the notebook at the following link:
https://github.com/probabl-ai/mandr/blob/add_binary_fraud_example/examples/fraud/eda.ipynb.
To ease reviews, I also added its nbconvert script version, `eda.py`.
- This PR is still in WIP –I will add at least one notebook for
modeling, but I'd be happy to get feedback on the EDA section!

---------

Co-authored-by: Vincent Maladiere <vincentmaladiere@Vincents-Laptop.local>
Co-authored-by: Vincent Maladiere <vincentmaladiere@Vincents-Laptop.lan>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants