Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flow + Job magic methods #369

Merged
merged 28 commits into from
Aug 13, 2023
Merged

Flow + Job magic methods #369

merged 28 commits into from
Aug 13, 2023

Conversation

janosh
Copy link
Member

@janosh janosh commented Jul 17, 2023

Closes #368.

from pymatgen.core import Lattice, Structure
from atomate2.vasp.flows.mp import MPMetaGGARelax

struct = Structure(
    lattice=Lattice.cubic(3),
    species=("Fe", "Fe"),
    coords=((0, 0, 0), (0.5, 0.5, 0.5)),
)
flow = MPMetaGGARelax().make(structure)

len(flow)
>>> 3
print(flow)
>>> Flow(name='MP Meta-GGA Relax', uuid='d43cd776-fb64-40f0-bacf-1933ee2342b8')
consisting of 3 jobs:
  1. Job(name='MP pre-relax', uuid='99a016af-eb9d-4d7a-8454-85ecf6b59d41')
  2. Job(name='MP meta-GGA Relax', uuid='496006d6-5375-49ce-91a8-cf53483173e1')
  3. Job(name='MP meta-GGA Static', uuid='8fe17841-a536-4d39-9bfa-dd83bc7900a2')

@janosh janosh added enhancement New feature or request ux User experience labels Jul 17, 2023
@utf
Copy link
Member

utf commented Jul 18, 2023

This is great! Thanks very much.

I notice from the tests that I assumed that job comparison would depend not only on the UUIDs but also on the specific function arguments. I think this is just relying on the default dataclass __eq__() operator. See: https://docs.python.org/3/library/dataclasses.html#module-contents

Personally, I think having stricter equality is better approach but I am open to other thoughts.

@davidwaroquiers
Copy link
Contributor

This is indeed nice. Just to comment or push the discussion a bit further, what "should" be the len of a Flow if it is made of other Flows ? Could be the number of "steps" (jobs or flows) or the number of "flattened" jobs. I think it should be made clear what it is. Same for the print, what happens when you'd print a Flow with Flows in it. I guess it would/should print kind of an indented "structure" of the Flow with its subflows. Maybe it would be nice to have a choice on how "deep" you want to print.

Additional thought on the dynamical aspect. The Flow as such only exist at creation time while it will disappear once inserted into the database. A some point it would be nice to be able to somehow have the information of this Flow stored somewhere. And the len of this Flow can evolve in that case (if you have additions, replace, ...).

I'm maybe mixing up different issues here but they are somehow related to each other.

@utf
Copy link
Member

utf commented Jul 18, 2023

I think those are very good points @davidwaroquiers. To add to that, some comments:

  1. Flow already has a method iterflow which iterates through the flow in such an order that the job output references can always be resolved. E.g., root nodes of the graph are always returned first.
  2. The __iter__ method introduced here has different behaviour and only iterates through the jobs array (which can also contain other flows). This could be somewhat confusing.
  3. If we change the behaviour of __iter__ and __len__ to be consistent with subflows, then this would make the behaviour of __setitem__ and __getitem__ inconsistent.

Maybe its ok to leave the implementation as is, but I agree that the behaviour should be clarified. It would be nice to add examples of how to use these methods in the docs somewhere.

@janosh
Copy link
Member Author

janosh commented Jul 18, 2023

Great comments, thanks @utf @davidwaroquiers.

This is indeed nice. Just to comment or push the discussion a bit further, what "should" be the len of a Flow if it is made of other Flows ? Could be the number of "steps" (jobs or flows) or the number of "flattened" jobs.

I considered returning the flattened length of a Flow but I think the default length should be just the number of items in the jobs list.
I think the user should have to ask for the flattened length explicitly. We could add a property

@property
def flat(self):
    ...

which would recursively unpack subflows. Then people can use len(flow.flat).

Same for the print, what happens when you'd print a Flow with Flows in it.

@davidwaroquiers That's already handled actually. If __str__ is undefined, Python automatically falls back on __repr__ (not vice versa) and since Flow.__repr__ puts every subflow in f"{subflow}" which calls __str__ on it which calls __repr__ on that subflow, it's already recursive. I just tweaked __repr__ to increase indentation for every level down in the subflow hierarchy for added readability. Now this prints:

from jobflow import Flow, Job


def add(a, b):
    return a + b

def get_test_job():
    return Job(add, function_args=(1, 2))

subflow1 = Flow([get_test_job()])
subflow2 = Flow([get_test_job(), get_test_job()])
subsubflow = Flow([get_test_job(), get_test_job()])
subflow3 = Flow([get_test_job(), get_test_job(), subsubflow])
print(Flow([subflow1, subflow2, subflow3]))
Flow(name='Flow', uuid='f7ae544a-7e8d-4976-b768-13e431199fa6') consisting of 3 jobs:
  1. Flow(name='Flow', uuid='9e7b6459-adca-48b1-8c00-b86e15876a38') consisting of 1 jobs:
    1.1. Job(name='add', uuid='0d4a4621-3249-4f94-bdb1-a83910982921')
  2. Flow(name='Flow', uuid='d2b1cdbe-0075-4bba-8686-3851fe81a4b0') consisting of 2 jobs:
    2.1. Job(name='add', uuid='bf687f6b-3e5b-4e48-b579-e1646b0707c7')
    2.2. Job(name='add', uuid='9422bbf9-8c91-4153-a19b-5aad36d96431')
  3. Flow(name='Flow', uuid='b62ebaaf-bea2-489f-8e88-75a2eebfa7a9') consisting of 3 jobs:
    3.1. Job(name='add', uuid='5fae9b50-8051-438e-b295-14298569c527')
    3.2. Job(name='add', uuid='1dbba533-4cc7-4cef-be76-d9d773cfaf80')
    3.3. Flow(name='Flow', uuid='b6526c18-ce3c-48c2-ae26-6207859ccb46') consisting of 2 jobs:
      3.3.1. Job(name='add', uuid='c0c2eb67-65ab-4bcc-a4cc-465dd8898f2c')
      3.3.2. Job(name='add', uuid='42e8f5d2-42d1-4885-a627-13232362447d')

I'll have to get back to the other comments later.

@janosh
Copy link
Member Author

janosh commented Jul 18, 2023

Looking at it now, I guess we can drop the consisting of n jobs part.

@janosh janosh mentioned this pull request Jul 25, 2023
@janosh
Copy link
Member Author

janosh commented Jul 29, 2023

Just had time to take another look at this.

__eq__

I'm a bit torn about Job.__eq__ on the one hand I agree two flows should be different if they have different attributes. On the other hand, what's the point of a UUID if it doesn't uniquely identify a flow? UUID stands for universally unique identifier.

@utf Btw, Jobs are not dataclasses so before they had the default Python behavior of only being equal if they're literally the same object in memory.

class Job(MSONable):
"""

iterflow

I actually think it's a plus to have two ways of iterating through a Flow. As you said @utf, we just need to be clear in the docs on how for job in flow.iterflow and for job in flow differ. But I can see use cases for either one. pandas also has many different ways of iterating though a dataframe (df.iterrows, df.items, df.itertuples, for col in df, df.keys and maybe more).

__repr__

@davidwaroquiers The problem with __repr__ kwargs like depth: int is people usually aren't aware of them. They just print the object or return it from a Jupyter cell in which case Python/ipykernel implicitly calls __repr__ on that object without passing any arguments. So I don't think there would be much value in a depth keyword. To use it would mean

print(flow.__repr__(depth=2)

Happy to take further suggestions esp. about __eq__ but for now I'll start adding tests and docs for the implementation as is.

@janosh janosh force-pushed the flow-job-magic-methods branch from aadb604 to 1cf1b51 Compare August 7, 2023 15:46
@utf
Copy link
Member

utf commented Aug 7, 2023

On the __eq__ point, due to how some of the dynamic behaviour works (specifically the replace functionality), multiple jobs can have the same uuid but different indexs.

So maybe for __eq__ the check could just be whether both uuid and index match.

@janosh
Copy link
Member Author

janosh commented Aug 7, 2023

So maybe for eq the check could just be whether both uuid and index match.

Oh bad timing. I just changed it to self.__dict__ == other.__dict__. 🤦 Happy to revert in which case this test would need to be changed to expect ==:

assert test_job != resolved_job

@codecov
Copy link

codecov bot commented Aug 7, 2023

Codecov Report

Merging #369 (781d3b5) into main (108d2f4) will increase coverage by 0.00%.
Report is 3 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #369   +/-   ##
=======================================
  Coverage   99.86%   99.86%           
=======================================
  Files          19       19           
  Lines        1439     1505   +66     
  Branches      364      374   +10     
=======================================
+ Hits         1437     1503   +66     
  Misses          1        1           
  Partials        1        1           
Files Changed Coverage Δ
src/jobflow/core/reference.py 100.00% <ø> (ø)
src/jobflow/core/flow.py 100.00% <100.00%> (ø)
src/jobflow/core/job.py 100.00% <100.00%> (ø)

@janosh
Copy link
Member Author

janosh commented Aug 7, 2023

@utf Other than the __eq__ decision, I think this is in a good state and ready to go.

@utf
Copy link
Member

utf commented Aug 7, 2023

Thanks @janosh. I'm ok with the current __eq__ implementation but we could always switch it in a future PR if there is a strong need.

I don't mean to be a stickler, but is it possible to add a few more tests so that the test coverage doesn't downgrade. Part of the issue is that you have changed:

if typing.TYPE_CHECKING

to

if TYPE_CHECKING

in some cases, and that means the codecov is moaning about not checking those lines. The simple fix is to add the latter to the coverage exception list here:

jobflow/pyproject.toml

Lines 111 to 115 in 5602cfd

exclude_lines = [
'^\s*@overload( |$)',
'^\s*assert False(,|$)',
'if typing.TYPE_CHECKING:',
]

But there are a couple of if statements etc that are not currently covered.

@janosh
Copy link
Member Author

janosh commented Aug 7, 2023

On it!

@janosh
Copy link
Member Author

janosh commented Aug 7, 2023

@utf Coverage is back up to 100%. 🎉

@utf
Copy link
Member

utf commented Aug 13, 2023

Brilliant, thank you!

@utf utf merged commit 10aad48 into main Aug 13, 2023
@janosh janosh deleted the flow-job-magic-methods branch August 13, 2023 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ux User experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flow + Job magic methods
3 participants