-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add the binary fraud example #24
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few points I noticed:
- upload dataset to openml and fetch from there
- add dependencies as "doc" dependencies
- remote the notebook and only leave the
.py
file - clean up the
.py
file and have it the same as the examples in the example galleries we have in sklearn - plots require a bit more of an explanation before the plot to explain what we're looking at
- the plots need to be importable from
mandr.examples.fraud.eda_plots
probably?
It seems this is for now exploration of the data rather then modeling. We can split the example then into two files / steps. Exploring the data, and modeling the fraud problem.
Hey Adrin! Thank you for this review. It took me some time to properly document the EDA notebook. I addressed most of your suggestions, but a few todo remain: upload dataset to openml and fetch from there I've tried multiple times on different days but can't upload a dataset on OpenML. I receive a 502 proxy error at every try. ![]() I'm don't know where to give feedback, because their org repo looks rather inactive. WDYT? I would be happy to try that again, but I left the csv in the PR for now. remote the notebook and only leave the .py file We don't have a doc builder process for now on Mandr. Do you think we should create one in another PR and then apply the changes you suggest, before merging this current PR? |
It is a good idea to start having a doc CI process, even if incomplete or looking bad. In order to apply Adrin's advice, you are right we should do this first. Could you please send a PR? |
I haven't looked at the updates yet, but one high level comment might be that this doc is pretty good in giving context, and we would need it for good documentation, but it might also be a good idea to focus on the notebook that comes after this, taking the data and working on the parts which would actually use And as for the docs, we should have a |
Could we please make sure we don't commit the dataset to the git repo? |
Yes, that's being discussed as well. Regarding the openml issue, have you tried asking them on their slack @Vincent-Maladiere ? They're usually responsive. |
Ok, Sphinx Doc + Readthedoc sounds great, I'll create a PR for that. I'll contact the OpenML team on Slack too. You could argue that Mandr could also be useful during EDA to contextualise the dataset and display info next to the modeling part later, but I get your point. The modeling notebook is almost done and just need a little more documentation. |
The modeling notebook is now ready for review! Link for visualization. TODO:
|
Just my 2 cents. Example related code should be outside of the main mandr folder. Bonus you may avoid linting issues if in another folder : ) |
Right, I don't have a strong opinion on this. Let's keep it in the example folder then. WDYT @adrinjalali? |
So, I've contacted a member of OpenML for the third time who said fixing the server was on his TODO list. Our first interaction was 3 weeks ago. I would like to move this PR forward and eventually put the datasets on their platform when they fix it. In the meantime, I'd appreciate a more thorough review of the content of the notebooks. Note that I already introduced them to the skrub team since the modeling could be simplified with further development of skrub. Ping @glemaitre, @ogrisel @koaning |
I guess my main observation here is that we might not be able to trust the labels. If there is an indication of fraud I will gladly trust it ... but what about the non fraud labels? Do we really know for sure that these have all been varified by a human? I will gladly hear it if this is out of scope for this challenge but I felt compelled to at least make this point. |
) | ||
target_encoder.set_output(transform="pandas") | ||
|
||
row_aggregate = FunctionTransformer(row_aggregate_post_target_encoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouln't this be a stateful operation? The aggregations that are learned at train time, should these not be used at eval time? Or am I misinterpreting the code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is pretty tricky to read I agree: I use a Target Encoder first, which is stateful, followed by the row-wise aggregation that you mentioned, which is stateless
# downloads encoders from HuggingFace. Since we don't fine-tune it, this is a stateless | ||
# operation. | ||
|
||
from sentence_transformers import SentenceTransformer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not embetter ;)?
The benefit is that the sentence transformer will now just behave like a normal sklearn estimator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another benefit is that the model is only loaded once. In this implementation it is loaded every time the inference is called, which can be a bunch of overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, well I wanted to stick the strict minimum, but I can def use embetter, no problem here
# **Can we compare where these two models disagree?** | ||
|
||
|
||
def get_disagreeing_between(model_name_1, model_name_2, X_test, y_test, results): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice touch! It can serve as a stepping stone to dive more into the data quality!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was only able to give it a glance but found a few points that might be worth mentioning.
Hey Vinnie, thank you for the feedback! I'll iterate on that. Also note that I made a very concise demo which has already been merged. We can improve this, and I'm curious to know what you think: https://github.com/probabl-ai/mandr/blob/main/notebooks/skrub_demo.py |
@Vincent-Maladiere it would be nice if we can close this PR by the end of the month, do you think it is possible? |
Yes, I will give it a last touch today and then it will be ready for merging. The only issue is that we don't have a dedicated place to store / host / render the gallery artifact, so these two notebooks will be tricky to preview. That can be addressed later, though. |
As long as they can be rendered by generating the documentation it is fine for now I guess. |
So I gave embetter a spin and got to say that I struggled a bit. Here are some feedback:
So I will stick with my simpler solution for now |
I replaced the CSVs with a parquet file whose size is reduced 10x (from 25MB to 2.5MB) |
@koaning I forgot to react to that, but yeah I 100% agree. How trustworthy the labels are is a critical question, to which we can't bring an answer here, unfortunately. |
@tuscland Let's wait until the CI is green (hopefully) and this PR will be ready to be merged |
**Related issues** #25 **What does this PR implement?** - This PR proposes to add an e-commerce fraud binary classification use case as an example. It currently doesn't use mandr. - I uploaded the dataset because it is fairly small (25MB) and in an open-access license. - I tried to keep the `eda.ipynb` notebook reasonably short by putting much of the plotting logic in a dedicated `eda_plot.py`. You can view the notebook at the following link: https://github.com/probabl-ai/mandr/blob/add_binary_fraud_example/examples/fraud/eda.ipynb. To ease reviews, I also added its nbconvert script version, `eda.py`. - This PR is still in WIP –I will add at least one notebook for modeling, but I'd be happy to get feedback on the EDA section! --------- Co-authored-by: Vincent Maladiere <vincentmaladiere@Vincents-Laptop.local> Co-authored-by: Vincent Maladiere <vincentmaladiere@Vincents-Laptop.lan>
Related issues
#25
What does this PR implement?
eda.ipynb
notebook reasonably short by putting much of the plotting logic in a dedicatededa_plot.py
. You can view the notebook at the following link: https://github.com/probabl-ai/mandr/blob/add_binary_fraud_example/examples/fraud/eda.ipynb. To ease reviews, I also added its nbconvert script version,eda.py
.