Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate WLM functions #127

Closed
7 tasks done
Spartee opened this issue Jan 17, 2022 · 1 comment · Fixed by #199
Closed
7 tasks done

Consolidate WLM functions #127

Spartee opened this issue Jan 17, 2022 · 1 comment · Fixed by #199
Assignees
Labels
API break Issues that include incompatible API changes area: launcher Issues related to any of the launchers within SmartSim type: feature Issues that include feature request or feature idea

Comments

@Spartee
Copy link
Contributor

Spartee commented Jan 17, 2022

Description

With the addition of #120, there are a few places users can look to find functions specific to their workload manager.

We should consolidate these in a new smartsim.wlm module.

Justification

Dealing with any scheduler can be a headache. SmartSim is designed to remove some of these impediments. With the addition of more specific WLM functions, we can make it even easier for users to write driver scripts.

Implementation Strategy and Acceptance Criteria

  • Move smartsim.slurm to smartsim.wlm.slurm
  • throw dep warning for importing smartsim.slurm
  • Brainstorm and write tickets for future helpful user facing wlm functions.
    For Slurm and PBS, a user will have to be able to also run:
  • get_hosts
  • get_queue
  • get_tasks
  • get_tasks_per_node

Example function

For collecting hostnames on PBS

def collect_db_hosts(num_hosts):
    """A simple method to collect hostnames because we are using
       openmpi. (not needed for aprun(ALPS), Slurm, etc.
    """

    hosts = []
    if "PBS_NODEFILE" in os.environ:
        node_file = os.environ["PBS_NODEFILE"]
        with open(node_file, "r") as f:
            for line in f.readlines():
                host = line.split(".")[0]
                hosts.append(host)
    else:
        raise Exception("could not parse interactive allocation nodes from PBS_NODEFILE")

    # account for mpiprocs causing repeats in PBS_NODEFILE
    hosts = list(set(hosts))

    if len(hosts) >= num_hosts:
        return hosts[:num_hosts]
    else:
        raise Exception(f"PBS_NODEFILE had {len(hosts)} hosts, not {num_hosts}")

Can we generalize this function?

In the Runtime prototype the functions are defined like:

def collect_entire_job_info():
    env = os.environ.copy()
    stats = {}
    stats["user"] = env["SLURM_JOB_USER"]
    stats["jobid"] = env["SLURM_JOBID"]
    stats["jobname"] = env["SLURM_JOB_NAME"]
    stats["job-num-nodes"] = env["SLURM_NNODES"]
    stats["job-node-list"] = env["SLURM_JOB_NODELIST"]
    stats["queue"] = env["SLURM_JOB_PARTITION"]
    return stats

def collect_by_node_job_info():
    env = os.environ.copy()
    stats = {}
    # Lot more we can get from this but not as much
    # from PBS...
    stats["task-pid"] = env["SLURM_TASK_PID"]
    stats["node-id"] = env["SLURM_NODEID"]
    stats["task-id"] = env["SLURM_PROCID"] #LOCALID??
    return stats
@Spartee Spartee added area: launcher Issues related to any of the launchers within SmartSim type: feature Issues that include feature request or feature idea labels Jan 17, 2022
@al-rigazzi
Copy link
Collaborator

We are going to add what was in #186, that is utilities to get many details about allocated (and running) jobs (taking such functions out of the conftest.py file).

@Spartee Spartee added the API break Issues that include incompatible API changes label Apr 25, 2022
@MattToast MattToast linked a pull request May 17, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API break Issues that include incompatible API changes area: launcher Issues related to any of the launchers within SmartSim type: feature Issues that include feature request or feature idea
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants