-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuzzy_scaffold can now return a pandas dataframe that is more readable. #188
Conversation
Codecov Report
@@ Coverage Diff @@
## main #188 +/- ##
==========================================
+ Coverage 91.45% 91.47% +0.02%
==========================================
Files 46 46
Lines 3661 3672 +11
==========================================
+ Hits 3348 3359 +11
Misses 313 313
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
It seems like I am having trouble dealing with conditional return typing for |
@Pakman450, it's fine to break compatibility and return only the new structure here. This code has not been maintained for a while, and we should likely refactor most of it. See here too: #119 |
I changed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put the refactoring of the function itself in a different PR.
@Pakman450 can you document the column name in the dataframes ?
ping @hadim for planning refactoring later.
datamol/scaffold/_fuzzy.py
Outdated
@@ -80,7 +82,8 @@ def fuzzy_scaffolding( | |||
additional_templates: Optional[List[Mol]] = None, | |||
ignore_non_ring: bool = False, | |||
mcs_params: Optional[Dict[Any, Any]] = None, | |||
) -> Tuple[set, Dict[str, dict], Dict[str, list]]: | |||
if_df: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Pakman450, can you replace by return_df
? otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I would suggest to instead totally drop if_df
and only return dataframe here. I am not too afraid about backward compat here (since it actually makes sense) and I don't see any good reasons for keeping the old returned types.
@maclandrol agree to merge here and tackle the refactoring later. We should likely revisit all the code and break it down in multiple functions.
@Pakman450 could you rebase this PR to |
@maclandrol Where do you want me to document them? In the comments? |
Sure. I give me a moment. |
0da3545
to
dc04ffa
Compare
rebase complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I'll let Manu doing a last pass before merging.
In the docstring of the function, so it's rendered in the documentation. |
LGTM, thanks @Pakman450 |
Hello! This is for issue #114. So this edit is quite opinionated. I don't know if you guys would this kind of organized dataframe. But first I have to discuss where I added this change. I decided to add a optional flag for datamol users in the
fuzzy_scaffolding
function.The flag is termedThe flag is termedif_df
, which is defaulted toFalse
.if_df
, which is defaulted toTrue
. Two separate pandas dataframe for eachscf2infos
andscf2groups
will be returned. The rationale is not to confuse users on howfuzzy_scaffolding
function would return previously.NOTE: the output below is my best attempt to express the pandas dataframe. The output is just a df that has
3
columns and not15
rows.Let's start with
scf2infos
The way scf2infos is structured is absolutely perfect to be pandas transposed. Every scaffold output has its corresponding rdkit
mol
list and itssmarts
pattern. This means every row will represent thescf
and everyscf
will have itsmol
list.output:
What about
scf2groups
?scf2groups is trickier because the core 'groups' are in their individual dictionaries contained in a list. This can be difficult to create a df due to these multi-valued attributes. Thankfully, Pandas can control for the multi-valued attributes by calling from the
.from_dict()
method and settingorient
to'index'
. Further, this allows the df for eachscf
row to haveNone
values if there are no results. So this df will dynamically createn_core_group
s if there is an output withn
number of core groups. So in the case below,scf
index1
has two core groups but the rest do not.Other than that. I hope you like the code. I also updated the test cases.
Thanks,
Steven Pak
Checklist:
news
entry.news/TEMPLATE.rst
tonews/my-feature-or-branch.rst
) and edit it.