Health Monitoring and vovnotifyd

You can set up basic Altair Accelerator health monitoring tests with mail notification. When the Accelerator system gets into any of your defined "unhealthy" conditions, a list of configured users will receive alert email notifications.

Tests are provided that check for the following conditions:
  • Long jobs that are stuck: stuck jobs do not use any CPU
  • Someone has jobs waiting in queue for too long
  • Any user has an unusually high ratio of failed jobs
  • Any host(s) fails all jobs
  • The server size (and other server related parameters) is growing
  • There are too many out of queue jobs (for Allocator (also known as MultiQueue) setup only)
The checks are all procedures of which the name starts with "doTestHealth". These checks are defined in one or more files of the following files:
Role Location Notes
Global $VOVDIR/tcl/vtcl/vovhealthlib.tcl Part of distribution
Site specific $VOVDIR/local/vovhealthlib.tcl Optional
Project specific PROJECT.swd/vovnotifyd/vovhealthlib.tcl Optional

Configuring Health Monitoring

By default, all checks defined in the vovhealthlib.tcl files are enabled.

If you want to change parameters in the health checks, you need to change the config.tcl file in the vovnotifyd directory.

The following table contains an example of vovnotifyd/config.tcl:
set NOTIFYD(server)        "tiger"
set NOTIFYD(port)          25
set NOTIFYD(sourcedomain)  ""

set admin  "dexin"
set cadmgr "john"

# Check if we have long stuck jobs. Check every 1 minute. If we have such
# jobs, send alert emails to the owner of the job (@USER@) and
# admin (here "dexin").
# Definition of "long stuck job": has been running at lease 10 hours and but has used no
# more than 20 seconds CPU time in total.
registerHealthCheck "doTestHealthLongJobsNoCpu -longJobDur 10h -minCpu 20"  -checkFreq 1m  -mailFreq  1d  -recipients "@USER@ $admin"

# Check if we have jobs stuck, i.e., jobs that are not burning any CPU at all. And this situation
# has been persistent for at least 10 minutes.
# Check every 1 minute. If we have such jobs, send alert emails to the owner of the
# job (@USER@) and admin (here "dexin").
# This check is similar to doTestHealthLongJobsNoCpu but will be quicker to detect stuck jobs.
registerHealthCheck "doTestHealthJobStuck -maxNoCpuTime 10m"  -checkFreq 1m  -mailFreq  1d  -recipients "@USER@ $admin"

# Check if any user has too many failed jobs. Check every 30 minutes. If we
# have such users, send alert emails to the owner of the job (@USER@),
# admin (here "dexin") and cadmgr ( here "john" ).
# Definition of "too many failures": a user has at least 1000 jobs in NetworkComputer
# with at least 90% failures
registerHealthCheck "doTestHealthTooManyFailures -minJobs 1000 -failRatio 0.9"  -checkFreq 30m  -mailFreq  1d  -recipients "@USER@ $admin $cadmgr"

# Check if the server size is growing. Check other server related parameters as
# well, including number of jobs, number of queued jobs, etc.
# For everything that is checked, if the number grows over 60.0% compared to last time
# it is checked, send alert emails to the admin (here "dexin")
# and cadmgr ( here "john" ).
# Also send alert emails if the number of files is 5.0 times or more than the number
# of jobs.
# Check every 2 hours and send such alert emails once a day (1d).
registerHealthCheck "doTestHealthServerSize -filejobsRatio 5.0 -warnPercent 60.0"  -checkFreq 2h  -mailFreq  1d  -recipients "$admin $cadmgr"

# Check if some user has jobs sitting in the queue for too long.
# Check every 2 hours.
# If we find such users, send alert emails to the user
# and cadmgr ( here "john" ). Send such alert emails once a day (1d).
# Definition of "waiting for too long": none of jobs in one category(bucket)
# get dispatched in the last 4 hours.
registerHealthCheck "doTestHealthJobsWaitingForTooLong -maxQueueTime 4h"  -checkFreq 2h  -mailFreq  1d  -recipients "@USER@ $cadmgr"

The config.tcl file is checked for updates at regular intervals controlled by the variable NOTIFYD(timeout).