vovtaskermgr

The main way to start, configure, and stop the taskers is with the vovtaskermgr command. This command acts relative to the VOV-project enabled in the shell where it is issued.

The file taskers.tcl in the project.swd directory stores the configuration information used by this command.
Note: Changes made to taskers.tcl are not automatically propagated to the running vovtaskers. To do this, use the update subcommand.

A vovtasker listed in the taskers.tcl file may be running or stopped. The show subcommand gives information on the running vovtaskers currently connected to the vovserver. The list subcommand gives the names of all the vovtaskers defined in vovtaskers, whether running or stopped.


vovslavemgr: Usage Message
  
  USAGE:
      vovslavemgr <SUBCOMMAND> [options] [slaveList]
  
      Subcommand is case-insensitive.
  
      The slaveList consists of slave names or slave id's.
  
  SUBCOMMAND is one of:
      LIST           -- To list all hosts named in the slaves.tcl file.
      RESTART        -- Same as STOP followed by START.
      REFRESH        -- Refresh cached environments and equivalences.
                        The default behavior is for slaves to obtain the
                        equivalences from the server. If changes are made to the
                        equiv.tcl file, the server will need to be instructed to
                        reread the file using the "vovproject reread" command
                        prior to requesting a slave refresh.
                        If VOVEQUIV_CACHE_FILE is set to "legacy", a host-based
                        equivalence cache file will be created and updated in
                        the SWD/equiv.caches directory. If VOVEQUIV_CACHE_FILE
                        is set to a file path, the specified file will be used
                        instead.
      SHOW           -- Show info about connected or down slaves.
      PRINTSTATUS    -- Tell slaves to print their status in their log file.
  
      START          -- Start configured slaves.  If a list of hosts is
                        given, start slaves only on those hosts.  Otherwise,
                        start all configured slaves that are not running.
      UPDATE         -- Update configuration of running slaves.
  
      RESERVE        -- To reserve specified slaves.
      RESERVESHOW    -- Show current slave reservations.
      CONFIGURE      -- To reconfigure the specified slaves on-the-fly.
                        Changes only persist until the slave is stopped.
      STOP           -- Stop slaves; let jobs finish, unless -force is given.
  
      CANCELSHUTDOWN -- Revert stopped but still running slaves to normal
                        so they continue running and accept new jobs.
      ROTATELOG      -- To recreate new log files for specified slaves
                        if log files are missing, create slave log directories
                        if needed, and have no impact on slave startup logs.
      CLOSE [MSG]    -- Close slaves from accepting jobs. Closed slaves will
                        start and run, but will do so in a suspended state,
                        displaying the closure message, until opened by the
                        administrator. The default closure message is
                        'Closed by administrator'.
      OPEN [MSG]     -- Open slaves to accept jobs. The accompanying message
                        will be displayed on running slaves until another
                        message is generated during the course of normal
                        operation. Slaves that are not running will not display
                        the message after starting. The default opening message
                        is an empty string.
  Global Options are:
      -l            -- Use longer format with LIST (may be repeated).
      -v            -- Increase verbosity of messages.
      -cfgfile      -- Specify path to slave config file, relative to SWD.
                       Default: slaves.tcl
      -failover     -- Restrict operation to dedicated failover slaves only.
  
  Options for SHOW are:
      -nameonly     -- Show only the names of the connected slaves.
      -nameid       -- Show only the names and id's of the connected slaves.
      -resourceonly -- Show only the resources of the connected slaves.
      -down         -- Show names of configured slaves that are down.
      -license      -- Show licensed capabilities of connected slaves.
      -slavegroups  -- Show slave group for each connected slave.
  
  Options for START and RESTART are:
      -server      -- Start the slaves by rsh/ssh from the vovserver host.
                      By default, the slaves are started
                      by the host that executes this script.
      -random      -- Start slaves in random order.
                      This is useful to start a large pool of slave,
                      by running multiple concurrent commands like:
                        % vovslavemgr start -random &
                        % vovslavemgr start -random &
                        % vovslavemgr start -random &
      -nolog       -- Redirect slave output to /dev/null.
                      Useful to avoid huge log files in /usr/tmp
  
  Options for RESERVE are:
      -user        -- Reserve the slave(s) for given list of users
                      (comma separated list)
      -group       -- Reserve the slave(s) for given list of fairshare groups
                      (comma separated list)
      -jobclass    -- Reserve the slave(s) for given list of jobclasses
                      (comma separated list)
      -jobproj     -- Reserve the slave(s) for given list of job projects
                      (comma separated list)
      -osgroup     -- Reserve the slave(s) for given list of Unix groups
                      (comma separated list)
      -bucketid    -- Reserve the slave(s) for given list of queue buckets
                      (comma separated list)
      -id          -- Reserve the slave(s) for given list of jobs
                      (comma separated list of job ids)
  
      -start       -- Reservation start time
      -end         -- Reservation end time
      -duration    -- Reservation duration (VOV timespec)
      -cancel      -- Cancel the reservation on slave(s)
  
  Options for STOP are:
      -force       -- Stop slaves with force. BEWARE: kills running jobs.
      -noconfirm   -- Do not prompt for confirmation. Default is to prompt.
      -all         -- Stop all running slaves.
      -sick [TIMESPEC]
                   -- Stop all slaves that have been sick for 
                      at least N seconds.
                      N is compared against the last time a heartbeat was
                      received by the server for each sick slave.
                      All jobs running on a sick slave being stopped will be
                      marked as failed in the server, even if the job does,
                      or has, completed successfully while the slave is sick.
                      It is recommended to check slave host connectivity before
                      using this function and allow for the slave to reconnect
                      and send a heartbeat in case connectivity is restored.
  
  Parameters for CONFIGURE are:
      -allowcoredump <bool>    -- Control core-dump behavior.
      -autokillmethod <d|n|v>  -- Control autokill method.
      -capacity <CAP>[MAXCAP]  -- Specify capacity and optionally the
                                  max-capacity of the slave. The capacity is
                                  the maximum number of jobs that can be run by
                                  slave. The max_capacity is the maximum slots
                                  a slave can be expanded to have when jobs are
                                  suspended. The default value for capacity is
                                  equal to the number of CORES present. The
                                  default value for max_capacity is 2*CAPACITY.
                                  Use N, N/N, CORES[-+*/]N, CORES[-+*/]N/N,
                                  N/CORES[-+*/]N, CORES[-+*/]N/CORES[-+*/]N to
                                  make adjustments from the default.
                                  Examples: 4, 4/8, CORES-2, CORES*0.8,
                                            CORES+0/20, CORES+2/CORES*2
      -cpus       <bool>       -- Number of CPU's in this machine.
      -debugcontainers <bool>  -- Enable debug logging of container activity.
      -debugjobcontrol <bool>  -- Enable debug logging of job control activity.
      -debugmultienv   <bool>  -- Enable debug logging of environment switching.
      -debugnuma       <bool>  -- Enable debug logging of NUMA activity.
      -debugusageinfo  <bool>  -- Enable debug logging of memory usage analysis.
      -maxload      <MAXLOAD>  -- Maximum load above which new jobs are refused.
                                  The default value for max_load is
                                  CAPACITY+0.5.
                                  Use 0 or less than 0 to specify default value.
                                  Use N or CAPACITY[-+*/]N to make adjustments
                                  from the default.
                                  Examples: 12.0, CAPACITY+2, CAPACITY*2
      -maxwaitnostart <N>      -- How long to wait for a job to start.
      -message    <string>     -- Set vovslave message.
      -numabindtosocket <bool> -- Bind to entire socket or individual cores.
                                  Experimental. Default is to bind to entire
                                  socket.
      -resources  <string>     -- vovslave resources.
      -slavegroup <string>     -- The slave group.
      -minramfree <N>          -- Minimum amount of free RAM in MB.
      -name       <string>     -- Name of vovslave.
      -ramsentry  <bool>       -- Activate/Deactivate RAM SENTRY.
      -efftotram  <N>          -- Effective total RAM in MB.
      -retrychdir <N>          -- Specify number of retries for failed chdirs.
      -retrychdirsleep <N>     -- Specify the sleep interval time between
                                  retries for failed chdirs.
      -retrychdirbackoff <N>   -- Specify the factor multiplied to the sleep
                                  interval to increase sleep interval between
                                  retries for failed chdirs.
      -liverecorder on|off     -- Enable/disable Live Recorder debugging
                                  capability (linux64 only).
      -liverecorder.logdir <string>
                               -- Specify the directory in which the Live
                                  Recorder recording file should be saved. The
                                  directory must exist. Default is ".", which is
                                  the CWD of the process that is running Live
                                  Recorder.
      -liverecorder.logsize <N> --
                                  Specify the Live Recorder log size in MB.
                                  Default: 256, Min: 256, Max: 65536.
      -rawpower                -- Specify a raw power figure for initial slave
                                  startup.
      -mindisk                 -- Specify minimum /tmp disk in MB or
                                  percentage (0%-99%, for example, 10%)
                                  required for slave startup.
      -coeff                   -- Specify a scaling factor from 0.01-100.0
                                  used to derate slave power.
      -sendenv  <name>         -- Send a named environment to a slave.
      -setenv   VAR=VALUE      -- Set a variable in the slave environment. 
                                  ("VAR=VALUE" must be quoted on Windows)
      -unsetenv VAR            -- Unset a variable in the slave environment.
  
  EXAMPLES:
      % vovslavemgr show
      % vovslavemgr show -nameid
      % vovslavemgr start
      % vovslavemgr start unix1
      % vovslavemgr start -random             -- Start slaves in random order.
      % vovslavemgr update
      % vovslavemgr restart
      % vovslavemgr stop                      -- Stop all slaves, let running
                                                 jobs finish.
      % vovslavemgr stop -noconfirm           -- Like above, no confirmation
                                                 required.
      % vovslavemgr stop -force               -- Kill running jobs now
                                                 (-noconfirm implied).
      % vovslavemgr reserve -user john \\
               -duration 3h jupiter           -- Reserve slave jupiter for user
                                                 john for 3h from now
      % vovslavemgr configure -message "shutdown 1PM" farm11 farm12
      % vovslavemgr printstatus farm11
      % vovslavemgr rotatelog                 -- Recreate missing log files for
                                                 all connected slaves
      % vovslavemgr rotatelog farm2 farm11    -- Recreate missing log files for
                                                 slave farm2 farm11
  
      % vovslavemgr configure jupiter -sendenv BASE
                                              -- send the BASE environment to
                                                 slave jupiter
  

Starting Many Taskers in Parallel

If you have hundreds of taskers to start, it may take some time. You can speed up the process by running multiple start scripts with the -random option, which is useful to start taskers in random order.

For example:
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &
% vovtaskermgr start -random &

Tasker Configuration on the Fly

Many vovtasker characteristics can be changed on the fly using vovtaskermgr configure. For example, you can change the capacity of a tasker, i.e. the maximum number of jobs that the tasker can take, with:
% vovtaskermgr configure -capacity 8 pluto
Setting the capacity to zero effectively disables the tasker:
% vovtaskermgr configure -capacity 0 pluto
% vovtaskermgr configure -message "Temporarily disabled by John" pluto

Tasker Capacity

The behavior of manually overriding vovtasker cores and capacity has been improved. By default, the capacity follows the core count, but it can also be manually set via the -T option or by defining the SLOTS/N consumable resource via the -r option, where N is a positive integer. In all cases, the capacity directly affects the number of slot licenses that will be requested.

Tasker Reservation

Below is an example of using vovtaskermgr to set a reservation on a tasker. In this case, you want to reserve the tasker called 'pluto' for user 'john' for 2 days.

If you wish for the vovtaskers to be reserved when they start, use the -reserve option in the taskers.tcl file.
% vovtaskermgr reserve -user john -duration 2d pluto