Skip to content

OrteFrameworks

Jeff Squyres edited this page Sep 2, 2014 · 3 revisions

ORTE Frameworks

This page is being used to give a brief overview of the current frameworks in Open RTE, as well as showing how they interact.

Currently, this is in draft form. Any updates are welcome.

Frameworks

  1. errmgr: Error Manager
    • This is used to report errors that have happened, and to decide what to do next. Currently, it is does little besides passing and printing error messages, and aborting processes (either directly or by sending commands to others to abort processes).
  2. filem: Remote File Manager (new in 1.3)
    • Was added for the FT code. Is a generic framework used for moving files around between nodes. Currently only works on top of scp/rcp
    • Can be used to preload user's applications to remote nodes, as well as move around checkpoint files.
  3. gpr: General Purpose Registry
    • Publish/subscribe data storage system
    • Used to store node, process, and job information, as well as to distribute this information to any process which requests it.
    • This is a general purpose system which cannot take advantage of any knowledge of data. It is expected that we will soon be using this less so we can optimize our data storage and distribution based on the type of data.
  4. gprcomm: Group Communications (new in 1.3)
    • Provides broadcast-like functions to send a message to an entire job. This is used mainly to send out startup information, such as peer IP addresses.
  5. iof: I/O forwarding
    • Used to forward I/O from user applications and ORTE daemons to/from the head node process.
  6. ns: Name service
    • Provides unique names for processes, as well as string to name and name to string functions.
    • This framework will likely be eliminated soon, and consolidated with the PLS, RMGR, and SMR frameworks into a new PLM framework (process lifecycle management)
  7. odls: Daemon local launch system
    • In systems where the daemons launch the application processes, this does the launching of the application processes, and monitors them
    • In all systems sets up I/O to be forwarded from children
  8. oob: Out of Band communication
    • Low bandwidth channel used for all runtime messaging
    • This is used above by the rml, and not directly by anyone else
  9. pls: Process Launch system
    • Used to launch daemons on the compute nodes. In some systems, this also launches the application processes.
    • Passes context information to launched processes, and does some monitoring of launched processes
    • This framework will likely be eliminated soon, and consolidated with the NS, RMGR, and SMR frameworks into a new PLM framework (process lifecycle management)
  10. ras: Resource allocation system
    • Determines what nodes we can use. Has various components to:
      • process command line -host
      • determine from environment what nodes are available for use (such as from SLURM or TM)
  11. rds: Resource Discovery system
    • Mostly unused. Currently only used to read in hostfiles. This framework will be eliminated soon, and the hostfile capability moved into the RAS
  12. rmaps: Resource Mapping Subsystem
    • Used to determine what processes (i.e. what application and rank) should be launched on each node.
  13. rmgr: Resource Manager Subsystem
    • Central coordinator for many runtime activities. Makes calls to other frameworks to allocate resources, map processes, and launch processes.
    • This framework will likely be eliminated soon, and consolidated with the NS, PLS, and SMR frameworks into a new PLM framework (process lifecycle management)
  14. rml: Runtime Messaging Layer
    • Provides basic point-to-point messaging services. Intended to eventually provide routing using the routed framework.
  15. routed: Routing table for the RML (new in v1.3)
    • provides a routing infrastructure for the RML. Deals only with process names, telling the RML where to send a message next.
  16. schema: Data schema for the GPR
    • provides some data layout functionality, as well as constants used for storing and retrieving data from the GPR.
    • Has no components, just base functionality.
    • Slated for elimination with the planned reduction in usage for the GPR
  17. sds: Start-up discovery service
    • Used by both daemons and application processes to determine their name and other information (such as contact information for mpirun) based on clues in the environment (or through pipes). Clues can come from mpirun, orteds, or the launcher (ex. SLURM).
    • It is planned to expand the capabilities and usage of the sds
  18. smr: State of health monitoring subsystem
    • Used to monitor and report the state of processes, nodes and jobs. Currently, this is not used for much.
    • This framework will likely be eliminated soon, and consolidated with the NS, PLS, and RMGR frameworks into a new PLM framework (process lifecycle management)
  19. snapc: Snapshot coordination (new in v1.3)
    • Added for FT code. Used to coordinate the taking of checkpoints, coordinating the movement of snapshot images, and generating a handle which users can us to restart with

Diagram of how frameworks interact

Still needed. Any volunteers?

Clone this wiki locally