fuzzy_scaffold can now return a pandas dataframe that is more readable. #188

Pakman450 · 2023-05-04T18:40:59Z

Hello! This is for issue #114. So this edit is quite opinionated. I don't know if you guys would this kind of organized dataframe. But first I have to discuss where I added this change. I decided to add a optional flag for datamol users in the fuzzy_scaffolding function. ~~The flag is termed if_df, which is defaulted to False.~~ The flag is termed if_df, which is defaulted to True. Two separate pandas dataframe for each scf2infos and scf2groups will be returned. The rationale is not to confuse users on how fuzzy_scaffolding function would return previously.

NOTE: the output below is my best attempt to express the pandas dataframe. The output is just a df that has 3 columns and not 15 rows.

Let's start with `scf2infos`

The way scf2infos is structured is absolutely perfect to be pandas transposed. Every scaffold output has its corresponding rdkit mol list and its smarts pattern. This means every row will represent the scf and every scf will have its mol list.

output:

                                         scf  \
index                                          
0                      CC(CC1CCCCC1)CC1CCCC1   
1                     CC(CCC1CCCCC1)CC1CCCC1   
2                    CC(CCCC1CCCC1)CCC1CCCC1   
3                      CC(CC1CCCCC1)C1CCCCC1   
4      CC(CCC1CCC(C2CCC3CCCCC32)C1)CC1CCCCC1   

                                                    mols  \
index                                                      
0        [<rdkit.Chem.rdchem.Mol object at 0x1108a1a80>]   
1      [<rdkit.Chem.rdchem.Mol object at 0x1108a06d0>...   
2        [<rdkit.Chem.rdchem.Mol object at 0x1108a0200>]   
3        [<rdkit.Chem.rdchem.Mol object at 0x1108a2dc0>]   
4        [<rdkit.Chem.rdchem.Mol object at 0x1108a2c00>]   

                                                  smarts  
index                                                     
0      [#6]1:&@[#6]:&@[#6]:&@[#6](:&@[#6]:&@[#6]:&@1)...  
1      [#6]-&!@[#8]-&!@[#6]1:&@[#6]:&@[#6]:&@[#6](:&@...  
2      [#6](-&!@[#7]-&!@[#6](=&!@[#8])-&!@[#6]-&!@[#1...  
3      [#6]1-&@[#6]-&@[#6]-&@[#6]-&@[#6]-&@[#7]-&@1-&...  
4      [#15](=,-;!@[#8,#7])(-&!@[#8]-&!@[#6]-&!@[#6]1...

What about `scf2groups`?

scf2groups is trickier because the core 'groups' are in their individual dictionaries contained in a list. This can be difficult to create a df due to these multi-valued attributes. Thankfully, Pandas can control for the multi-valued attributes by calling from the .from_dict() method and setting orient to 'index'. Further, this allows the df for each scf row to have None values if there are no results. So this df will dynamically create n_core_groups if there is an output with n number of core groups. So in the case below, scf index 1 has two core groups but the rest do not.

                                         scf  \
index                                          
0                      CC(CC1CCCCC1)CC1CCCC1   
1                     CC(CCC1CCCCC1)CC1CCCC1   
2                    CC(CCCC1CCCC1)CCC1CCCC1   
3                      CC(CC1CCCCC1)C1CCCCC1   
4      CC(CCC1CCC(C2CCC3CCCCC32)C1)CC1CCCCC1   

                                            0_core_group  \
index                                                      
0      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
1      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
2      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
3      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   
4      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...   

                                            1_core_group  
index                                                     
0                                                   None  
1      {'Core': <rdkit.Chem.rdchem.Mol object at 0x11...  
2                                                   None  
3                                                   None  
4                                                   None

Other than that. I hope you like the code. I also updated the test cases.

Thanks,
Steven Pak

Checklist:

Was this PR discussed in a issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the new introduced feature(s) (if appropriate).
Update the API documentation is a new function is added or an existing one is deleted.
Added a news entry.
- copy news/TEMPLATE.rst to news/my-feature-or-branch.rst) and edit it.

codecov · 2023-05-04T18:49:41Z

Codecov Report

Merging #188 (dc04ffa) into main (9bb6258) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #188      +/-   ##
==========================================
+ Coverage   91.45%   91.47%   +0.02%     
==========================================
  Files          46       46              
  Lines        3661     3672      +11     
==========================================
+ Hits         3348     3359      +11     
  Misses        313      313

Flag	Coverage Δ
unittests	`91.47% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
datamol/scaffold/_fuzzy.py	`90.44% <100.00%> (+0.71%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Pakman450 · 2023-05-04T19:06:51Z

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

maclandrol · 2023-05-04T19:08:40Z

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

@Pakman450, it's fine to break compatibility and return only the new structure here. This code has not been maintained for a while, and we should likely refactor most of it. See here too: #119

Pakman450 · 2023-05-04T19:23:04Z

It seems like I am having trouble dealing with conditional return typing for fuzzy_scaffolding . I am currently trying to figure out how to do this. I found a overload functionality for my issue, so I am gonna try that. IF you guys have any recommendations. That would be great.

@Pakman450, it's fine to break compatibility and return only the new structure here. This code has not been maintained for a while, and we should likely refactor most of it. See here too: #119

I changed the if_df defaulted to true, so it will always bring dataframes. Should I completely remove that flag and change to return type just to bring dataframes and no dictionaries?

maclandrol

Let's put the refactoring of the function itself in a different PR.

@Pakman450 can you document the column name in the dataframes ?

ping @hadim for planning refactoring later.

maclandrol · 2023-05-05T17:09:22Z

datamol/scaffold/_fuzzy.py

@@ -80,7 +82,8 @@ def fuzzy_scaffolding(
    additional_templates: Optional[List[Mol]] = None,
    ignore_non_ring: bool = False,
    mcs_params: Optional[Dict[Any, Any]] = None,
-) -> Tuple[set, Dict[str, dict], Dict[str, list]]:
+    if_df: bool = True,


Thanks @Pakman450, can you replace by return_df ? otherwise LGTM

Maybe I would suggest to instead totally drop if_df and only return dataframe here. I am not too afraid about backward compat here (since it actually makes sense) and I don't see any good reasons for keeping the old returned types.

@maclandrol agree to merge here and tackle the refactoring later. We should likely revisit all the code and break it down in multiple functions.

hadim · 2023-05-05T17:17:20Z

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Pakman450 · 2023-05-05T17:49:42Z

Let's put the refactoring of the function itself in a different PR.

@Pakman450 can you document the column name in the dataframes ?

ping @hadim for planning refactoring later.

@maclandrol Where do you want me to document them? In the comments?

Pakman450 · 2023-05-05T17:50:17Z

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Sure. I give me a moment.

Pakman450 · 2023-05-05T18:00:28Z

@Pakman450 could you rebase this PR to main please as well? So we can make sure this PR is compatible with the latest rdkit.

Sure. I give me a moment.

rebase complete.

hadim

Thanks!

I'll let Manu doing a last pass before merging.

maclandrol · 2023-05-05T18:36:13Z

@maclandrol Where do you want me to document them? In the comments?

In the docstring of the function, so it's rendered in the documentation.

maclandrol · 2023-05-05T18:37:14Z

LGTM, thanks @Pakman450

Pakman450 requested a review from hadim as a code owner May 4, 2023 18:41

Pakman450 changed the title ~~fuzzy_scaffold can now return a pandas dataframe (when flagged) that is more readable.~~ fuzzy_scaffold can now return a pandas dataframe that is more readable. May 5, 2023

maclandrol approved these changes May 5, 2023

View reviewed changes

Pakman450 added 9 commits May 5, 2023 13:54

feat: start dataframe implementation for fuzzy output

3896e7a

feat: add method to return df for scf2groups

fb0158d

feat: relabel the columns and index

f86ece7

feat: black the new edits for df creation

9719db8

feat: update test for enhanced output for fuzzy

36e9aa6

docs: update rst file and update comments

b01929e

fix: add correct "or" type

fdc8238

fix: unionize the return types and change default bool for if_df

f4ce857

fix: change fuzzy func to return only pandas

dc04ffa

Pakman450 force-pushed the fuzzy_out_branch branch from 0da3545 to dc04ffa Compare May 5, 2023 17:58

hadim approved these changes May 5, 2023

View reviewed changes

hadim merged commit 6020ec0 into datamol-io:main May 5, 2023

hadim mentioned this pull request Jun 16, 2023

New more robust datamol release process #197

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuzzy_scaffold can now return a pandas dataframe that is more readable. #188

fuzzy_scaffold can now return a pandas dataframe that is more readable. #188

Pakman450 commented May 4, 2023 •

edited

Loading

codecov bot commented May 4, 2023 •

edited

Loading

Pakman450 commented May 4, 2023

maclandrol commented May 4, 2023

Pakman450 commented May 4, 2023

maclandrol left a comment

maclandrol May 5, 2023

hadim May 5, 2023

hadim commented May 5, 2023

Pakman450 commented May 5, 2023

Pakman450 commented May 5, 2023

Pakman450 commented May 5, 2023

hadim left a comment

maclandrol commented May 5, 2023

maclandrol commented May 5, 2023

fuzzy_scaffold can now return a pandas dataframe that is more readable. #188

fuzzy_scaffold can now return a pandas dataframe that is more readable. #188

Conversation

Pakman450 commented May 4, 2023 • edited Loading

Let's start with scf2infos

What about scf2groups?

codecov bot commented May 4, 2023 • edited Loading

Codecov Report

Pakman450 commented May 4, 2023

maclandrol commented May 4, 2023

Pakman450 commented May 4, 2023

maclandrol left a comment

Choose a reason for hiding this comment

maclandrol May 5, 2023

Choose a reason for hiding this comment

hadim May 5, 2023

Choose a reason for hiding this comment

hadim commented May 5, 2023

Pakman450 commented May 5, 2023

Pakman450 commented May 5, 2023

Pakman450 commented May 5, 2023

hadim left a comment

Choose a reason for hiding this comment

maclandrol commented May 5, 2023

maclandrol commented May 5, 2023

Pakman450 commented May 4, 2023 •

edited

Loading

Let's start with `scf2infos`

What about `scf2groups`?

codecov bot commented May 4, 2023 •

edited

Loading