Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzzy_scaffold can now return a pandas dataframe that is more readable. #188

Merged
merged 9 commits into from
May 5, 2023

Conversation

Pakman450
Copy link
Contributor

@Pakman450 Pakman450 commented May 4, 2023

Hello! This is for issue #114. So this edit is quite opinionated. I don't know if you guys would this kind of organized dataframe. But first I have to discuss where I added this change. I decided to add a optional flag for datamol users in the fuzzy_scaffolding function. The flag is termed if_df, which is defaulted to False. The flag is termed if_df, which is defaulted to True. Two separate pandas dataframe for each scf2infos and scf2groups will be returned. The rationale is not to confuse users on how fuzzy_scaffolding function would return previously.

NOTE: the output below is my best attempt to express the pandas dataframe. The output is just a df that has 3 columns and not 15 rows.

Let's start with scf2infos

The way scf2infos is structured is absolutely perfect to be pandas transposed. Every scaffold output has its corresponding rdkit mol list and its smarts pattern. This means every row will represent the scf and every scf will have its mol list.

output:

                                         scf  \
index                                          
0                      CC(CC1CCCCC1)CC1CCCC1   
1                     CC(CCC1CCCCC1)CC1CCCC1   
2                    CC(CCCC1CCCC1)CCC1CCCC1   
3                      CC(CC1CCCCC1)C1CCCCC1   
4      CC(CCC1CCC(C2CCC3CCCCC32)C1)CC1CCCCC1   

                                                    mols  \
index                                                      
0        [<rdkit.Chem.rdchem.Mol object at 0x1108a1a80>]   
1      [<rdkit.Chem.rdchem.Mol object at 0x1108a06d0>...   
2        [<rdkit.Chem.rdchem.Mol object at 0x1108a0200>]   
3        [<rdkit.Chem.rdchem.Mol object at 0x1108a2dc0>]   
4        [<rdkit.Chem.rdchem.Mol object at 0x1108a2c00>]   

                                                  smarts  
index                                                     
0      [#6]1:&@[#6]:&@[#6]:&@[#6](:&@[#6]:&@[#6]:&@1)...  
1      [#6]-&!@[#8]-&!@[#6]1:&@[#6]:&@[#6]:&@[#6](:&@...  
2      [#6](-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#6]-&!@[#1...  
3      [#6]1-&@[#6]-&@[#6]-&@[#6]-&@[#6]-&@[#7]-&@1-&...  
4      [#15](=,-;!@[#8,#7])(-&!@[#8]-&!@[#6]-&!@[#6]1...  

What about scf2groups?

scf2groups is trickier because the core 'groups' are in their individual dictionaries contained in a list. This can be difficult to create a df due to these multi-valued attributes. Thankfully, Pandas can control for the multi-valued attributes by calling from the .from_dict() method and setting orient to 'index'. Further, this allows the df for each scf row to have None values if there are no results. So this df will dynamically create n_core_groups if there is an output with n number of core groups. So in the case below, scf index 1 has two core groups but the rest do not.

                                         scf  \
index                                          
0                      CC(CC1CCCCC1)CC1CCCC1   
1                     CC(CCC1CCCCC1)CC1CCCC1   
2                    CC(CCCC1CCCC1)CCC1CCCC1   
3                      CC(CC1CCCCC1)C1CCCCC1   
4      CC(CCC1CCC(C2CCC3CCCCC32)C1)CC1CCCCC1   

                                            0_core_group  \
index                                                      
0      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
1      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
2      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
3      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
4      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   

                                            1_core_group  
index                                                     
0                                                   None  
1      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...  
2                                                   None  
3                                                   None  
4                                                   None  

Other than that. I hope you like the code. I also updated the test cases.

Thanks,
Steven Pak

Checklist:

  • Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
  • Update the API documentation is a new function is added or an existing one is deleted.
  • Added a news entry.
    • copy news/TEMPLATE.rst to news/my-feature-or-branch.rst) and edit it.

@Pakman450 Pakman450 requested a review from hadim as a code owner May 4, 2023 18:41
@codecov
Copy link

codecov bot commented May 4, 2023

Codecov Report

Merging #188 (dc04ffa) into main (9bb6258) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
+ Coverage   91.45%   91.47%   +0.02%     
==========================================
  Files          46       46              
  Lines        3661     3672      +11     
==========================================
+ Hits         3348     3359      +11     
  Misses        313      313              
Flag Coverage Δ
unittests 91.47% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
datamol/scaffold/_fuzzy.py 90.44% <100.00%> (+0.71%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Pakman450
Copy link
Contributor Author

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

@maclandrol
Copy link
Member

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

@Pakman450, it's fine to break compatibility and return only the new structure here. This code has not been maintained for a while, and we should likely refactor most of it. See here too: #119

@Pakman450
Copy link
Contributor Author

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

@Pakman450, it's fine to break compatibility and return only the new structure here. This code has not been maintained for a while, and we should likely refactor most of it. See here too: #119

I changed the if_df defaulted to true, so it will always bring dataframes. Should I completely remove that flag and change to return type just to bring dataframes and no dictionaries?

@Pakman450 Pakman450 changed the title fuzzy_scaffold can now return a pandas dataframe (when flagged) that is more readable. fuzzy_scaffold can now return a pandas dataframe that is more readable. May 5, 2023
Copy link
Member

@maclandrol maclandrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the refactoring of the function itself in a different PR.

@Pakman450 can you document the column name in the dataframes ?

ping @hadim for planning refactoring later.

@@ -80,7 +82,8 @@ def fuzzy_scaffolding(
additional_templates: Optional[List[Mol]] = None,
ignore_non_ring: bool = False,
mcs_params: Optional[Dict[Any, Any]] = None,
) -> Tuple[set, Dict[str, dict], Dict[str, list]]:
if_df: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Pakman450, can you replace by return_df ? otherwise LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I would suggest to instead totally drop if_df and only return dataframe here. I am not too afraid about backward compat here (since it actually makes sense) and I don't see any good reasons for keeping the old returned types.

@maclandrol agree to merge here and tackle the refactoring later. We should likely revisit all the code and break it down in multiple functions.

@hadim
Copy link
Contributor

hadim commented May 5, 2023

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

@Pakman450
Copy link
Contributor Author

Let's put the refactoring of the function itself in a different PR.

@Pakman450 can you document the column name in the dataframes ?

ping @hadim for planning refactoring later.

@maclandrol Where do you want me to document them? In the comments?

@Pakman450
Copy link
Contributor Author

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Sure. I give me a moment.

@Pakman450 Pakman450 force-pushed the fuzzy_out_branch branch from 0da3545 to dc04ffa Compare May 5, 2023 17:58
@Pakman450
Copy link
Contributor Author

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Sure. I give me a moment.

rebase complete.

Copy link
Contributor

@hadim hadim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I'll let Manu doing a last pass before merging.

@maclandrol
Copy link
Member

@maclandrol Where do you want me to document them? In the comments?

In the docstring of the function, so it's rendered in the documentation.

@maclandrol
Copy link
Member

LGTM, thanks @Pakman450

@hadim hadim merged commit 6020ec0 into datamol-io:main May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants