Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate metadata about a study #93

Closed
jsemler opened this issue Apr 27, 2018 · 6 comments
Closed

Generate metadata about a study #93

jsemler opened this issue Apr 27, 2018 · 6 comments
Assignees
Labels
enhancement Good First Issue If you're looking to get involved, these are good places to start! Infrastructure

Comments

@jsemler
Copy link
Collaborator

jsemler commented Apr 27, 2018

It would be useful for MaestroWF to write metadata about a launched study. This would make it easier for post processing scripts to find output data and other information.

Metadata could include:

  • The MaestroWF version used to launch the study
  • Paths to workspaces, MaestroWF logs, etc.
  • Git hashes and other information that may not be in the YAML specification
@FrankD412 FrankD412 added the Good First Issue If you're looking to get involved, these are good places to start! label May 19, 2018
@FrankD412
Copy link
Member

I've been thinking about the extent of metadata Maestro would want to capture and I came up with some of the following:

  • Content hashing dependencies (for git dependencies it's as simple as the commit hash)
    • This one gets trickier for path dependencies. Those can't be tracked, so it's a issue of hashing the contents which could be unfeasible depending on the number of files in a directory.
  • I was also thinking a hash of the timestamp, username of the person who launched the study, and maybe even the yaml file itself.
    • I think it's also prudent to hash the yaml study itself so that in the future when the functionality is introduced to reproduce a study, you can verify the specification.

I had some other thoughts, but I can't seem to find them right now. I'll update as things come back to me.

@jsemler
Copy link
Collaborator Author

jsemler commented Jun 15, 2018

I think that would be useful. It might also be helpful to capture diagnostic information as well.

Does Maestro keep the history of the DAG execution? The status file tracks the most recent state of the workflow, but there isn't really a way to see the full workflow history.

@FrankD412
Copy link
Member

@jsemler -- The only history tracked is when the DAG changes the states of its records. The status is what gives the full history, since it has when a step was submitted and such. Currently there is no full restart history with timestamps. The history tracking was purposely left somewhat lightweight since this is all going into a pickle so I planned to revisit this when I backed things with a proper database.

This ticket is going to have some implications on #95 -- the metadata related to the hashing of inputs and other information is going to dictate either how we restart or if we're able to restart at all. I'm going to be putting in some thoughts in the restart ticket.

@FrankD412
Copy link
Member

@jsemler -- Another thought, some objects would need to be responsible for hashing themselves. Some of the ones that come to mind right off the bat:

  • Dependencies (gets into some of the issues above, especially for paths)
  • Specification (this could be a simple hashing of the file that we set when we load it)

Other things to consider, we might need to pickle the Study instance in order to preserve how it was constructed. The specification isn't enough to construct an exact copy of the study because that construction can vary based on the version of Maestro.

@FrankD412
Copy link
Member

The start of this issue are in PR #120 -- the Study class now has metadata methods that create and load fixed metadata. These can be updated in later PRs as more metadata is needed.

@FrankD412 FrankD412 self-assigned this Jul 26, 2018
@jsemler
Copy link
Collaborator Author

jsemler commented Jul 27, 2018

@FrankD412 -- I agree. I think the PR #120 resolves this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Good First Issue If you're looking to get involved, these are good places to start! Infrastructure
Projects
None yet
Development

No branches or pull requests

2 participants