Preemption Rules to Speed Up FairShare

Every node in the FairShare tree represents a FairShare group. Each job belongs to one and only one FairShare group. Every node in the FairShare tree is assigned a target share, which depends on both the weights assigned to the nodes in the tree and on the activity of the nodes. A FairShare node is considered active if it has at least one job that is queued, running or suspended. All nodes that are not active are assigned a FairShare target of zero.

The target share of a FairShare node is accessible by the field FS_TARGET for any job that belongs to that FairShare node. The FairShare target is a fractional number less than 1.0, but the FS_TARGET field is an integer in the range from 0 to 10,000 obtained by scaling up the FairShare target by 10,000. For example, a FS_TARGET of 8000 indicates that the FairShare node has a target share of 80%(=0.8).

The field FS_RUNNING represents the fraction of running jobs in a FairShare group, relative to all running jobs in the system. This field is also scaled up by a factor of 10,000. The difference between FS_RUNNING and FS_TARGET is FS_EXCESS_RUNNING.

This measures how much a group is above or below its target. A positive number of FS_EXCESS_RUNNING means that the FairShare group is running more jobs than it should.

The field FS_RUNNING_COUNT is the number of jobs a FairShare group is running.

The field FS_HISTORY represents the fraction of jobs that have been run by a FairShare group in the FairShare window (typically 2 hours) relative to all other jobs that have been run in the system. The difference between FS_HISTORY and FS_TARGET is FS_EXCESS_HISTORY, and is similar to FS_EXCESS_RUNNING explained above.

The field FS_RANK is computed by the scheduler and assigned to each FairShare group that has jobs in the queue. The jobs are dispatched to taskers in ascending order of rank, starting from the group of rank zero (0). Groups that have no jobs in the queue are assigned the conventional rank -1. For FairShare and preemption to work harmoniously, it is important that the rank of the preempted job is greater than the rank of the preempting job, which is why the preemption rules should contain a term of the form FSRANK>@FSRANK@ in the -preemptable option. Since you also want to allow the preemption of jobs that are running but have no queued jobs in the same group, use the field "FS_RANK9", which is the same as FS_RANK, except that the value of FS_RANK9 for groups that have no queued jobs is 9,999,999 instead of -1, which makes for an easier comparison the preemptable rule FSRANK9>@FSRANK9@.


The picture below illustrates the difference between the FS_EXCESS_RUNNING field and FS_EXCESS_RUNNING_LOCAL. While the first considers the total number of running jobs in the system, the second field only considers the balance of running jobs at each local level. In the pictures, the nodes of interest are /class/hsim and /class/vcs.

The node /class/vcs has a total of 4 running jobs and 2 children, with user u1 running 3 jobs and user u5 running 1. Assuming that all weights are the same in all branches, the target share for /class/vcs.u1 and /class/vcs.u5 is exactly the same. Looking at the FS_EXCESS_RUNNING, it is negative for both nodes because the node /class/hsim has a large proportion of the running jobs. In this scenario, a preemption rule based on FS_EXCESS_RUNNING as shown below will not fire:
VovPreemptRule -rulename RuleThatDoesNotFire \
   -preempting   "JOBCLASS==vcs FS_EXCESS_RUNNING<0" \
   -preemptable  "JOBCLASS==vcs FS_EXCESS_RUNNING>0 FSRANK9>@FSRANK9@" \
    -pool fastfairshare -ruletype FAST_FAIRSHARE

Figure 1.
On the other hand, with a local view of /class/vcs, it is apparent that the distribution of jobs is not balanced. To use preemption to speedup the achievement of balance, the FS_EXCESS_RUNNING_LOCAL field can be used as follows:
VovPreemptRule -rulename RuleThatFires \
   -preempting   "JOBCLASS==vcs FS_EXCESS_RUNNING_LOCAL<0" \
    -pool fastfairshare  -ruletype FAST_FAIRSHARE

Practical FairShare Driven Preemption

Frequently used fields that are used in preemption are:
  • FS_RANK and FS_RANK9

Other fields are described in Node Fields; those fields also begin with FS_.

Preemption Based on FairShare

Preemption can be used as a method to accelerate the FairShare mechanism, so that instead of waiting for a job to finish and a slot to open up, the preemption daemon can detect imbalances in the FairShare and preempt a job of a group that has excess share in favor a another group that has a deficit in the share.

A reference rule for this type of preemption can be found in $VOVDIR/etc/config/vovpreemptd/rules/fastfairshare.tcl, as shown below:
# Copyright (c) 1995-2020, Altair Engineering
# All Rights Reserved.

# $Id: $

# Use of preemption to speed-up fairshare.
# We assume a workload organized in jobclasses where the each jobclass
# has its own fairshare node called /class/$JOBCLASS.
# The preempting is triggered if there is a job which has a locally a 
# deficit in the number of running jobs (FSEXCESSRUNNINGLOCAL<0) and has been waiting for at least 10 seconds.
# Also, if a fairshare group already has at least 4 jobs running, do not preempt.
# We do preemption within the same jobclass  (JOBCLASS==@JOBCLASS@)
# and we target the groups that have excessive share of running jobs (FSEXCESSRUNNINGLOCAL>0) and
# also a higher rank (FSRANK9>@FSRANK9@).  We use FSRANK9 instead of FSRANK to
# simplify the comparison of the ranks to include groups that have no rank.
# We do not want to preempt if the group has only one running job (FS_RUNNING_COUNT>1).
# We also consider priority (PRIORITY<=@PRIORITY@) to avoid preempting a job of higher priority.
VovPreemptRule -rulename FastFairshare \
    -preempting  "FSEXCESS<0 GROUP~/class FS_RUNNING_COUNT<=3" \
    -bucketage   10 \
    -killage 2m  \
    -pool FastFairshare \
    -ruletype FAST_FAIRSHARE