Skip to content

Application Execution and Examples

melrom edited this page Feb 24, 2012 · 1 revision

Application Execution and Examples

This guide provides a conceptual overview of BigJob and detailed information on how to use BigJob for developing workflows.

Decide whether BigJob is right for your needs?

1. Do you need to execute a good number of compute tasks on a busy HPC cluster? Yes, you need BigJob to avoid queue waiting time involved for each task when submitted through traditional scheduling system.

2. Designing workflows? You need BigJob since it provides decoupling between task submission and resource assignment.

Getting Started

Before starting with development please make sure that BigJob is installed/loaded successfully. Successful Execution of '''import bigjob''' statement in python shell indicates successful installation/loading of BigJob.

(python)-bash-3.2$ python
Python 2.7.1 (r271:86832, Jun 13 2011, 12:48:51) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> import bigjob
01/15/2012 10:05:23 AM - bigjob - DEBUG - Loading BigJob version: 0.4.23
01/15/2012 10:05:23 AM - bigjob - DEBUG - read configfile: /N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.4.23-py2.7.egg/bigjob/../bigjob.conf
01/15/2012 10:05:23 AM - bigjob - DEBUG - Using SAGA C++/Python.

BigJob Jargon

Familiarity with below terms will help you to understand the overview of BigJob functionality.

1. Application - It is a program, which specifies the HPC resources to be used to execute a set of tasks, and provide dependencies between those tasks.

2. Sub-Job - Sub-Job is a task with information like executable, environment variables required to execute the task, number of processes required, arguments to the executable, SPMD variation (serial vs MPI), output file, error file.

3. BigJob-Manager - The BigJob-manager stores the information of sub-jobs and is responsible for orchestrating the interactions between the BigJob-Agents and manager.

4. BigJob-Agent - For each HPC resource specified, BigJob agent is launched. When resource is available the BigJob agent becomes active and pulls the stored information of the Sub-Job and executes it on that HPC resource.

5. Coordination system - Coordination system is a database used by BigJob manager to store the information of SubJobs and orchestrate BigJob agents. Active BigJob agents uses it to pull the Sub-Job information to execute them on HPC resources.

Developing Applications using BigJob API

a. Identify the coordination system to be used. SAGA Advert service or Redis (refer FAQ 6) can be used as coordination systems. Specify suitable COORDINATION_URL in the example scripts as below

Advert Service:

COORDINATION_URL = "advert://localhost/?dbtype=sqlite3"   # uses sqlite3 database as coordination system. Works only on localhost.
COORDINATION_URL = "advert://SAGA:SAGA_client@advert.cct.lsu.edu:8080/?dbtype=postgresql" #uses PostGRESQL database on  
                     #advert.cct.lsu.edu at port 8080 as coordination system. SAGA & SAGA_client are user id and password for the database.

Redis:

COORDINATION_URL = "redis://localhost:6379"   # uses redis database as coordination system.   
COORDINATION_URL = "redis://cyder.cct.lsu.edu:2525"  # uses redis database on cyder.cct.lsu.edu at port 2525 as coordination system. 

b. Identify the HPC clusters to be used and specify resource specifications like resource url, number of nodes, processes per node, wall time, queue, allocation information, working directory ( where BigJob agent executes ). Resource_url depends on the type of adaptor suitable for that infrastructure. Scale to multiple HPC clusters just by appending resource specification to the resource_list object. Please make sure you have password less access enabled when remote jobs are submitted (see https://github.com/saga-project/BigJob/wiki/Configuration-of-SSH-for-Password-less-Authentication ).

example:

resource_list.append( { "resource_url" : "pbs-ssh://eric1.loni.org", "processes_per_node":"4", 
                      "number_of_processes" : "4", "allocation" : "TG-12321" , "queue" : "workq", 
                      "working_directory": (os.getcwd() + "/agent"), "walltime":10 } )

Please use suitable resource url based on the tabular information below.

Infrastructure Supported Adaptors Description Information
LONI GRAM Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (grid-proxy-init) before executing the BigJob application.Example usage of URL :gram:eric1.loni.org/jobmanager-pbs Suggested
fork Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork:localhost
ssh Submit jobs on target machines head node. Password less login to target machine is required. Example usage: ssh:eric1.loni.org
pbs-ssh Submit jobs to target machines scheduling system. Password less login to target machine is required. Example usage: pbs-ssh:eric1.loni.org Doesn't work since ssh adaptors are not available on LONI
XSEDE GRAM Uses Globus to submit jobs. Globus certificates are required. Initiate grid proxy (myproxy-logon) before executing the BigJob application. Example usage of URL :gram:gatekeeper.ranger.tacc.teragrid.org:2119/jobmanager-sge Suggested. Please find the globus resource URLs of XSEDE machines at https://www.xsede.org/wwwteragrid/archive/web/user-support/gram-gatekeepers-gateway.html
fork Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork:localhost
ssh Submit jobs on target machines head node. Password less login to target machine is required. Example usage: ssh:eric1.loni.org
pbs-ssh Submit jobs to target machines scheduling system. Password less login to target machine is required. Example usage: pbs-ssh:eric1.loni.org Not suitable for HPC resources using SGE scheduling system
FutureGrid pbs-ssh Submit jobs to target machines scheduling system. Password less login to target machine is required. Example usage: pbs-ssh:sierra.futuregrid.org Suggested
fork Submit jobs only on localhost head node. Password less login to localhost is required. Example usage: fork:localhost
ssh Submit jobs on target machines head node. Password less login to target machine is required. Example usage: ssh:sierra.futuregrid.org
PBSPro Submit jobs to local machines scheduling system. Example usage: pbspro:localhost

c. Start BigJob agents on HPC resources with resource list and coordination system as parameters.

example:

mjs = many_job_service(resource_list, COORDINATION_URL)

d. Create Sub-Jobs with their specifications like executable, environment variables required to execute the !SubJob, arguments to the executable, number of processes required, SPMD variation (serial vs MPI), output file, error file, .

example:

jd = description()
jd.executable = "/bin/cat"  # Specify the executable name with absolute path.
jd.number_of_processes = "1" # Specify the number of processes required for SubJob.
jd.environment=["k=123","HPATH=/home/usrk/"]
jd.spmd_variation = "single" $ Sepcify the SPMD variation ( single or mpi )
jd.arguments = ["text.txt"]  # Specify the arguments to the executable
jd.working_directory = "/home/pmanth2"; # Specify the location where SubJob has to execute.
jd.output =  "stdout-" + str(i) + ".txt" # Specify the SubJob output file name
jd.error = "stderr-" + str(i) + ".txt"# Specify the SubJob error file name
subjob = mjs.create_job(jd) # Creates SubJob with the given Job description.
subjob.run()  

BigJob Examples

The following BigJob examples can be used to submit local/remote jobs and can be used as building blocks to develop applications. These can be downloaded from https://github.com/drelu/BigJob/tree/master/examples.

Example running single Big-Job and a single Sub-Job on localhost: https://raw.github.com/drelu/BigJob/master/examples/example_local_single.py

Example running single BigJob and multiple !SubJobs on localhost: https://raw.github.com/drelu/BigJob/master/examples/example_local_multiple.py

Example running multiple Big-Jobs and execution of Sub-Jobs on multiple/distributed resources: https://raw.github.com/drelu/BigJob/master/examples/example_manyjob_local.py

Execution

Run the BigJob example script:

python <example script>

Logs & Error

Log & error files are directed to the working directory mentioned in the resource URL and Sub-Job specifications. A Guide for debugging can be found at: Debugging.

Clone this wiki locally