Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release of 1.1.0dev and Prioritization of next design phase #96

Closed
FrankD412 opened this issue May 7, 2018 · 9 comments
Closed

Release of 1.1.0dev and Prioritization of next design phase #96

FrankD412 opened this issue May 7, 2018 · 9 comments

Comments

@FrankD412
Copy link
Member

I wanted to spin up a discussion on the next phases of design for Maestro. This issue is going to be a place to note down thoughts and discuss the next steps and prioritization of next features. I'm planning to mark the completion of the 1.1.0dev release with the merge of the bugfix/dependency_ordering branch because that fixes a long standing issue of toplogically sorting the nodes before addition with a host of features and bugfixes that have been introduced in the meantime.


So let's start by summarizing the discussions and outstanding feature requests for major additions:

  • The introduction of services to the specification and ExecutionGraph (Add the ability to specify services #77).
    • This feature opens up the ability to boot up and either execute locally or remotely programs and tools that we constitute as "services". These "services" can include (but are not limited to) such things as databases, daemons, APIs to external tools/services/etc., and the like.
    • The more complicated workflows become, the more likely services are going to be required. Being able to spin up these services before a study begins (and by implication forego starting a study if services fail to boot).
  • Specification of resources outside of nodes and procs (or tasks) (Add the ability to specify other types of resources #76).
    • We're rapidly moving towards ubiquitous availability of compute resources that are not CPUs or cores. GPUs are available on the latest petaflop machines and other hardware is being researched for compute specific engines (machine learning etc.). A flexible way of specifying resources would be helpful both from Maestro ease of use and maintainability of the specification itself.
    • The specificity of resources also lends itself more and more to designating resources outside of a study specification and having it be its own input file passed into the maestro run <args> command line.
  • Definitions of command recipes (Command Generation Recipes #71).
    • The discussion about recipes started and didn't really pick up. It's finding more relevance as there are more outstanding requests to be able to specify different flavors of MPI when scheduling jobs using an adapter (say, if a user wants to schedule with SLURM, they don't necessarily always want scripts generated using srun).
  • Restarting of a study (Implement restart capability #95)
    • The restarting of a study at the top level sounds simple enough, but different users and use cases may specify the concept differently. I'd like to have this feature added in the next phase with either features that are general enough to be used by most users or with command options that allow it to be flexibly used.

The next points are things that I would like to achieve that don't necessary provide direct functionality but set the stage for future improvements and features.

  • Generating metadata for studies launched with Maestro (Generate metadata about a study #93)
    • While metadata may not provide full provenance of a workflow/study, its standardized formatting and output provides useful information for tools built on top of a workflow. Tools that could use metadata would be cataloging services, archiving services, and post processing.
  • Creation of a backend object interface for managing study records.
    • Currently the only way Maestro tracks a study is the use of a pickle file in the study output directory.
      • The benefits to the pickle are that it is platform agnostic, does not depend on securing ports to external servers, and requires no additional packages or tools to make work (pickling is standard in Python).
      • The downsides are that it relies on the user's hard drive which means it inherently has the permissions set based on where the study is executed and the backend interface is not general making it difficult to port to other technologies in the future.
    • Moving record keeping behind a standard interface and re-writing the current coupling to pickle exercise the interface provides the first steps towards a standard interface (while allowing current functionality to be a test). The new interface opens up the ability to use other database technologies uniformly within Maestro.
    • The next steps would be to introduce access to a database and possibly a web service back end.
  • Ability to specify cluster information
    • Attempts to implement the LSF adapter have pointed out that the passing of cluster information may be required to achieve higher levels of flexibility. Job launchers like LSF require knowledge of the cluster to calculate resource sets. Additionally, having access to a cluster configuration allows for better validity checking of parameters for parallel jobs and batch submission.

Any thoughts are welcomed and appreciated.
@gonsie @dinatale2 @jsemler

@gonsie
Copy link
Member

gonsie commented May 8, 2018

Re: cluster description:

I want to bring a sys admin on this one. What are the tools that they already use to describe a cluster (genders?) and how can those be leveraged? I know systems often describe themselves (/etc/proc??) but I don’t know the exact details.

@FrankD412
Copy link
Member Author

@gonsie -- I welcome it. A general way to gather that information that is platform independent would be amazing.

@FrankD412
Copy link
Member Author

Alright, 1.1.0 is now out. Time to prioritize!

@FrankD412
Copy link
Member Author

Just released a quick x.x.1 update that corrects a critical bug with parameterized workspaces and adds some other small things that have been added since. @gonsie -- any progress with finding a sys admin?

I've also started formulating a set of high level priorities:

  1. Restarting a study is useful, but also bridges into reproducing studies. Mixed with metadata, one could conceivably have a command saying use a previous study as a base and use its metadata to set up a new one in its image. It has some challenges like passing a study off to other users and permissions to dependencies, but it's a workable model for a single user. @jsemler -- thoughts?
  2. Splitting out resources to their own configuration passed along with the specification. This one makes specifications a little more platform independent. I think that we still need to retain some form of nomenclature in the base specification to know if a step should be scheduled (maybe rename procs to tasks and then remove nodes and other things to their own yaml) because otherwise we're relying solely on a file passed in by the user. In this respect, a base specification would still work because tasks are logically all that's needed for an MPI call.

Any thoughts are welcome. I'd like to target high priority features first.

@jsemler
Copy link
Collaborator

jsemler commented May 30, 2018

I like the idea of being able to spin up new studies from other studies. I think splitting out resources is a good starting point for exploring this use case more. This also allows for the ability to share template studies that can be used for different resources.

I think there are two high-level use cases here:

  1. Copy an existing study specification and modify it to create a new study. The new study could be using the same resources as the original study or defining new resources.
  2. Take an existing study and run it with different resources without modifying the study specification.

I think both cases are a good starting point for considering the composability of a study.

@FrankD412
Copy link
Member Author

#93 has gotten its start in PR #120 -- Just noting it here.

@FrankD412
Copy link
Member Author

#120 has been merged -- that happened a couple of weeks ago. Forgot to update here. I'm looking at revisiting MPI launching and expanding that some. I have an older branch which has some data structures in it, but from the looks of it I may have to start fresh and port some of those over. This feature would also be a prime opportunity to update the specification's resource definition to be in its own key within the steps dictionaries. That will break existing specifications, but that's probably better now than latter when more users have existing workflows.

On the thought of launchers, I'm starting to think that it might be beneficial to offer a different approach to using launchers. We currently have the following methods:

  • $(LAUNCHER)
  • $(LAUNCHER)[Nn,Pp] (in this case, "procs" is a misnomer. It actually refers to tasks but a user could choose to overload processors with more than one task)
  • No token at all, which just runs a set of commands within the specified allocation.

Since we are moving towards having the resources entry in the step, that opens us up to allowing the use of a similar token to the workspaces construct $(stepname.workspace). In this case, a user would be able to specify their resources in the resources entry (with the usual nodes, procs, tasks, gpus, etc) and then refer to entries by $(resources.nodes) for resources[nodes]. That would allow users to construct their own parallel calls without needing to rely on the LAUNCHER construct. So this might look something like mpirun -n $(resources.tasks) ... <command>.

@gonsie -- Any developments on platform configuration information? I've been thinking the simplest solution to this functionality is a YAML file in a hidden Maestro directory in like the home directory (customizable).

Any thoughts? @jsemler @gonsie @tadesautels @kcathey

@FrankD412
Copy link
Member Author

FrankD412 commented Oct 21, 2018

Revisiting this ticket to realign priorities. I've released a PR (#152) that adds the ability to pass custom parameters to the custom parameter generation function a user may define (--pgen). It was a bit more critical in the way of user functionality, so I bumped it.

Revisiting the priorities at the top of this ticket, #76 and #110 (the tweaks in the comment above) could be rolled into the second version of the YAML specification (#151). #151 is introducing a factory for supporting a wider range of specifications and the associated v2.0 of the specification should support a wider set of resource specification. For more on that, see issue #147. The specification improvements should help springboard into a refactoring for MPI (#71).

Another priority that should be tackled is the notion of pre- and post-steps, which is being discussed in #98 -- there are still some outstanding questions about some of the details.

The final thing that is a priority and in progress, but probably the least in terms of others here, is the reworking of local execution to mimic how a scheduler behaves. That could help consolidate some of the different logic regarding how to handle the state transitions for _StepRecords (since local execution skips the submitted state and all local steps are executed sequentially).


As a thought in terms of a future refactor -- it's looking like the responsibility for expansion of a study into its ExecutionGraph may need to shift to the conductor. The shift in responsibility from the Maestro frontend to the backend helps with some probable future issues related to scale and in-situ expansion. It also makes more conceptual sense since the Maestro frontend has everything to do with generating the core class instances that represent a study whereas the backend is responsible for all things required to actually get the study running on metal. Overall, it just seems to be a more flexible and clear division or responsibilities.

@FrankD412
Copy link
Member Author

I'm going to close this issue and make a new one with some other priorities and a summarization of some of the points here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants