Distributed Parallel Support

Accelerator supports jobs that require multiple CPUs that can be located on different machines.

When a Distributed Parallel (DP) job is submitted to the scheduler, the job is divided into partial jobs. Each partial job can require a different set of resources and cumulatively requires all the resources of the Distributed Parallel job. Accelerator schedules each partial job separately. When all partial jobs have been dispatched, one designated partial job executes the actual Distributed Parallel job. Depending on the submission method, there are several ways to take advantage of the computing resources assigned to the other partial jobs.

Submission Methods

There are three common methods to submit Distributed Parallel jobs.

Use -dp to submit a Distributed Parallel job with N partial jobs
This method creates a Distributed Parallel job with two partial jobs. When both partial jobs have been dispatched, the active partial job becomes the master and begins the execution of the command sleep 10, while the other partial job waits for the previous partial job to terminate. By default, the active partial job is the first one.
For example:
% nc run -dp 2 sleep 10
This method is rarely used due to the burden it puts on the tool integrator to find a way to use the other partial jobs.
Use -dp N with vovparallel LSB_HOSTS
When all N partial jobs have been dispatched,the active partial jobs executes vovparallel, which sets the environment variable LSB_HOSTS before invoking the master command. LSB_HOSTS contains the list of all hosts, possibly with repetitions, currently set aside to run all partial jobs of the Distributed Parallel job. The master command is expected to use rsh or ssh to reach out to those hosts and launch the appropriate software.
In this case, the active component runs sleep 10 with the environment variable LSB_HOSTS set to the list of the two hosts that are running the components.
For example:
nc run -dp 2 sleep 10
Use -dp N together with vovparallel clone
This method clones the specified wrapper script across the selected hosts. The wrapper script then executes the tool-specific commands on the appropriate hosts and will have access to the Distributed Parallel environment variables and job properties.
For example:
nc run -dp 2 vovparallel clone mywrapper
In this case, the active partial job executes vovparallel clone ... which in turn connects to the other partial jobs to invoke N similar instances of the script mywrapper on all partial jobs.
This is the preferred method because it doesn't require ssh/rsh functionality to activate the partial jobs, and it allows Accelerator to track memory and CPU utilization for each partial jobs.
At execution time, each partialTool job sets some environment variables that help each partial jobs of the Distributed Parallel job understand the role to play. The variables are:
  • VOV_DP_TOPJOBID, with the VovId of the top level Distributed Parallel job
  • VOV_DP_COUNT, the overall number of partial jobs used for the Distributed Parallel job
  • VOV_DP_RANK, a number from 1 to VOV_DP_COUNT used to identify each partial jobs

Partial Job Rank and Resources

The active partial job, which starts the execution upon completion of dispatching of all partial jobs, is normally component number 1. This can be overridden by using the dp active argument.
% nc run -dp 8 -dpactive 8 uname -a

To define the resource list for each component of the Distributed Parallel job, use the -dpres option. This option takes a comma-separated list of simple resources lists. The first element in the list is used for the first partial job, the second element for the second partial job and so on. If there are more partial jobs than elements in the list, the last element in the list is used for all remaining partial jobs.

In the following example, the first partial job is dispatched to a Linux machine, the second partial job to a macosx machine, and all other partial jobs go to UNIX machines. The active partial job is number 2, which is the one running on macosx. The tasker that is running UNIX would becomes the master for the Distributed Parallel job set. This also works with tasker names, or any other tasker resource for that matter (kernel, RAM, etc.).
% nc run -dp 10 -dpres "linux,macosx,unix " -dpactive 2 vovparallel clone mywrapper

Distributed Parallel Properties

When a Distributed Parallel job is submitted, the partialTool job sets the following properties on the top-level job:
The rank of the active partial job, which is the partial job that launches the top-level job.
The number of times the partial jobs have failed to be allocated within the allowed time.
This property is created when -nocohortwait is passed with nc run. This instructs partialTool for each cohort task to finish when its subtask process has finished rather than wait for the primary job to complete (which is the default behavior). Using -nocohortwait sets DP_COHORTWAIT to zero (0). Otherwise, the default is 1.
Explanation of failures.
List of hosts for the distributed parallel job . This property is set when the last partial job is launched. The list may contain duplicates.
This is set by partial tool X to the pair host dp_port.
Used by partial jobs to count the rank.
A second synchronization counter used by partialTool.
The ID of the set that contains the list of partial jobs.
Tells the partial job how long to wait before giving up the slot (default 30 sec).
This is the maximum allowed value for DP_WAIT for a job (default 30 mins).
In addition to these properties on the top-level job, the following property is set on all other jobs:
The top-level job ID.

Properties can be accessed via the vovprop command or via the Tcl API with vtk_prop_get. Once obtained, they can be used in conjunction with rsh/ssh, for example, to interact with the chosen hosts.

Using Different Resources and Jobclasses

When specifying DP jobs, you can stack jobclasses so that the primary and component jobs have different resources and job classes.

This is done by setting VOV_JOB_DESC(dp,resources) and VOV_JOB_DESC(dp,jobclasses) to specify the resources and jobclass labels for the master and subcomponent DP jobs.

For example, two jobclass definitions mycalibre_a.tcl and mycalibre_b.tcl are defined as follows:

# Copyright (c) 1995-2021, Altair Engineering
# All Rights Reserved.

# $Id: //vov/trunk/src/scripts/jobclass/short.tcl#3 $

set classDescription "My Calibre job resources"

puts "This is mycalibre jobclass"
if { [ info exists VOV_JOB_DESC(dp,resources) ] } { 
  set VOV_JOB_DESC(dp,resources) [ string cat $VOV_JOB_DESC(dp,resources) ",RAM/200 CORES/2" ]
  set VOV_JOB_DESC(dp,jobclasses) [ string cat $VOV_JOB_DESC(dp,jobclasses) ",mycalibre_a" ]
} else { 
  set VOV_JOB_DESC(dp,resources) "RAM/200 CORES/2"
  set VOV_JOB_DESC(dp,jobclasses) "mycalibre_a"

proc initJobClass {} {
# Copyright (c) 1995-2021, Altair Engineering
# All Rights Reserved.

# $Id: //vov/trunk/src/scripts/jobclass/short.tcl#3 $

set classDescription "My Calibre job resources"

puts "This is mycalibre2.tcl"
if { [info exists VOV_JOB_DESC(dp,res,*)] }  { 
  puts "VOV_JOB_DESC(dp,res,*) already exists.  This is good!" 
} else { 
  puts "VOV_JOB_DESC(dp,res,*) does not appear to exist.  This is bad."

if { [ info exists VOV_JOB_DESC(dp,resources) ] } { 
  set VOV_JOB_DESC(dp,resources) [ string cat $VOV_JOB_DESC(dp,resources) ",RAM/400 CORES/4" ]
  set VOV_JOB_DESC(dp,jobclasses) [ string cat $VOV_JOB_DESC(dp,jobclasses) ",mycalibre_b" ]

} else { 
  set VOV_JOB_DESC(dp,resources) "RAM/400 CORES/4"
  set VOV_JOB_DESC(dp,jobclasses) "mycalibre_b"

proc initJobClass {} {
You then call nc run as:
nc run -v 5 -C mycalibre -C mycalibre2 -e BASE -J jeffjob -dp 4 vovparallel clone sleep 1

This results in the primary job having the jobclass "mycalibre_a" and the resources "RAM/200 CORES/2" while the secondary jobs have a jobclass of "mycalibre_b" and the resources "RAM/400 CORES/4".

Distributed Parallel Slot Timeout

With all of the methods described above, each partial job waits a finite amount of time for all other components to show up . If the time elapses, then the partial job gives up, fails, and returns the slot to the farm. The wait time is 30 seconds by default, but this may be larger if the farm is heavily loaded. The wait time can be specified using the dpwait option.

Regardless of the specified wait time, there is a maximum wait time (default 30 mins) that can be specified by manually setting the Distributed Parallel_WAIT_MAX property on the job. Use the P option with nc run, such as

vovparallel Clone

An example of a script used with vovparallel clone is available in $VOVDIR/training/vnc/simple_dp_script.csh.
#!/bin/csh -f
# Example of a script to be used with  vovparallel clone.
# Example of usage:
# % nc run -dp 4 simple_dp_script.csh

set sleepTime = 30
if ( $#argv > 1 ) then
   set sleepTime = $1

# Each application has its own way to determine a rendezvous port.
# In this example, it is a fixed port.

if ( ! $?VOV_DP_RANK ) then
    echo "ERROR: Variable VOV_DP_RANK not defined."
    echo "       This script needs to be run with vovparallel clone ..."
    exit 0

echo "Hello! I am component $VOV_DP_RANK"

if ( $VOV_DP_RANK == 1 ) then
    echo "This is the master. "
    echo "This should open a socket for communication e.g. $APPLICATION_PORT"
    set DP_HOSTS      = `vovprop GET $VOV_DP_TOPJOBID DP_HOSTS`
    echo $DP_HOSTS
    sleep $sleepTime
    echo "This is the tasker"
    set DP_HOSTS      = `vovprop GET $VOV_DP_TOPJOBID DP_HOSTS`
    set masterHost = $DP_HOSTS[1]
    echo "This should communicate with master through ${masterHost}:$APPLICATION_PORT"
    sleep $sleepTime


exit 0

A more advanced script to use with vovparallel clone is available in $VOVDIR/eda/MentorGraphics/vovcalibremt.

OpenMPI Support

If you have an application that uses OpenMPI, you can submit it as a Distribute Parallel application (options -dp, -dpres, etc.) and you need to use the wrapper "vovmpirun".

For example, if you want the application to run with N components (possibly on different hosts), you can submit it with:
% nc run -dp <N>  vovmpirun ./path/to/application
Note: This support relies on the fact that OpenMPI uses ssh to start orted on the remote hosts. OpenMPI is forced to use $VOVDIR/hidden_mpi/ssh.