Create a Snapshot Using the HPC Cluster Snapshot Tool

Create a snapshot of an HPC cluster so it can be imported for use in simulations.

HPC Snapshot

Information about HPC snapshots.

A command line tool is available that creates a snapshot of an HPC cluster. The snapshot tool creates a compressed tarball (a snapshot) that includes HPC cluster information about the scheduler, server, execution nodes (MOMs), resource definitions, custom resources, jobs, queues, etc. The snapshot tarball is imported into Control for running simulations to perform what-if analysis.

Structure of the Tarball

An example of the structure of the tarball file is provided below.
total 281
drwxr-xr-x 2 root mysql    26 2019-06-07 08:51 comm_logs
-rw-r--r-- 1 root mysql    11 2019-03-11 03:42 ctime
drwxr-xr-x 2 root mysql    72 2019-06-07 08:51 hook
drwxr-xr-x 2 root mysql   275 2019-06-07 08:51 job
drwxr-xr-x 2 root mysql    26 2019-06-07 08:51 mom_logs
drwxr-xr-x 3 root mysql    73 2019-06-07 08:51 mom_priv
drwxr-xr-x 2 root mysql   311 2019-06-07 08:51 node
-rw-r--r-- 1 root mysql   176 2019-03-11 03:42 pbs.conf
-rw-r--r-- 1 root mysql 17034 2019-03-11 03:42 pbs_snapshot.log
drwxr-xr-x 2 root mysql    64 2019-06-07 08:51 reservation
drwxr-xr-x 2 root mysql    26 2019-06-07 08:51 sched_logs
drwxr-xr-x 2 root mysql   175 2019-06-07 08:51 sched_priv
drwxr-xr-x 2 root mysql    33 2019-06-07 08:51 scheduler
drwxr-xr-x 2 root mysql   176 2019-06-07 08:51 server
drwxr-xr-x 2 root mysql    26 2019-06-07 08:51 server_logs
drwxr-xr-x 5 root mysql   297 2019-06-07 08:51 server_priv
drwxr-xr-x 2 root mysql   356 2019-06-07 08:51 system

./comm_logs:
total 26
-rw-r--r-- 1 root mysql 2070 2019-03-11 02:48 20190311

./hook:
total 579
-rw-r--r-- 1 root mysql   1275 2019-03-11 03:42 qmgr_lpbshook.out
-rw-r--r-- 1 root mysql 272976 2019-03-11 03:42 qmgr_ph_default.out

./job:
total 134
-rw-r--r-- 1 root mysql     0 2019-03-11 03:42 qstat_f_F_dsv.out
-rw-r--r-- 1 root mysql     0 2019-03-11 03:42 qstat_f.out
-rw-r--r-- 1 root mysql 10083 2019-03-11 03:42 qstat_fx_F_dsv.out
-rw-r--r-- 1 root mysql     0 2019-03-11 03:42 qstat_ns.out
-rw-r--r-- 1 root mysql     0 2019-03-11 03:42 qstat.out
-rw-r--r-- 1 root mysql     0 2019-03-11 03:42 qstat_tf.out
-rw-r--r-- 1 root mysql     0 2019-03-11 03:42 qstat_t.out
-rw-r--r-- 1 root mysql 11728 2019-03-11 03:42 qstat_xf.out
-rw-r--r-- 1 root mysql   545 2019-03-11 03:42 qstat_x.out

./mom_logs:
total 50
-rw-r--r-- 1 root mysql 14023 2019-03-11 03:22 20190311

./mom_priv:
total 85
-rwxr-xr-x 1 root mysql  51 2019-03-11 02:48 config
drwxr-xr-x 2 root mysql 390 2019-06-07 08:51 hooks
-rwxr-xr-x 1 root mysql   5 2019-03-11 02:48 mom.lock

./mom_priv/hooks:
total 929
-rwxr-xr-x 1 root mysql   1923 2019-03-11 02:48 pbs_cgroups.CF
-rwxr-xr-x 1 root mysql    219 2019-03-11 02:48 pbs_cgroups.HK
-rwxr-xr-x 1 root mysql 201494 2019-03-11 02:48 pbs_cgroups.PY

./node:
total 230
-rw-r--r-- 1 root mysql 560 2019-03-11 03:42 pbsnodes_aFdsv.out
-rw-r--r-- 1 root mysql 731 2019-03-11 03:42 pbsnodes_a.out
-rw-r--r-- 1 root mysql 371 2019-03-11 03:42 pbsnodes_aSj.out
-rw-r--r-- 1 root mysql 351 2019-03-11 03:42 pbsnodes_aS.out
-rw-r--r-- 1 root mysql 560 2019-03-11 03:42 pbsnodes_avFdsv.out
-rw-r--r-- 1 root mysql 371 2019-03-11 03:42 pbsnodes_avSj.out
-rw-r--r-- 1 root mysql 351 2019-03-11 03:42 pbsnodes_avS.out
-rw-r--r-- 1 root mysql 731 2019-03-11 03:42 pbsnodes_va.out
-rw-r--r-- 1 root mysql 457 2019-03-11 03:42 qmgr_pn_default.out

./reservation:
total 3
-rw-r--r-- 1 root mysql 0 2019-03-11 03:42 pbs_rstat_f.out
-rw-r--r-- 1 root mysql 0 2019-03-11 03:42 pbs_rstat.out

./sched_logs:
total 26
-rw-r--r-- 1 root mysql 6345 2019-03-11 03:32 20190311

./sched_priv:
total 177
-rwxr-xr-x 1 root mysql  1665 2019-03-11 02:47 dedicated_time
-rwxr-xr-x 1 root mysql  2537 2019-03-11 02:47 holidays
-rwxr-xr-x 1 root mysql  1782 2019-03-11 02:47 resource_group
-rwxr-xr-x 1 root mysql 16389 2019-03-11 02:47 sched_config
-rwxr-xr-x 1 root mysql     5 2019-03-11 02:48 sched.lock
-rwxr-xr-x 1 root mysql     0 2019-03-11 02:48 sched_out

./scheduler:
total 26
-rw-r--r-- 1 root mysql 294 2019-03-11 03:42 qmgr_lsched.out

./server:
total 129
-rw-r--r-- 1 root mysql   0 2019-03-11 03:41 qmgr_pr.out
-rw-r--r-- 1 root mysql 904 2019-03-11 03:41 qmgr_ps.out
-rw-r--r-- 1 root mysql 948 2019-03-11 03:41 qstat_Bf.out
-rw-r--r-- 1 root mysql 225 2019-03-11 03:41 qstat_B.out
-rw-r--r-- 1 root mysql 250 2019-03-11 03:41 qstat_Qf.out
-rw-r--r-- 1 root mysql 234 2019-03-11 03:41 qstat_Q.out

./server_logs:
total 194
-rw-r--r-- 1 root mysql 64378 2019-03-11 03:42 20190311

./server_priv:
total 194
drwxr-xr-x 2 root mysql  26 2019-06-07 08:51 accounting
-rwxr-xr-x 1 root mysql   5 2019-03-11 02:48 comm.lock
-rwxr-xr-x 1 root mysql  32 2019-03-11 02:47 db_password
-rwxr-xr-x 1 root mysql  10 2019-03-11 02:47 db_svrhost
-rwxr-xr-x 1 root mysql   7 2019-03-11 02:47 db_user
drwxr-xr-x 2 root mysql 668 2019-06-07 08:51 hooks
-rwxr-xr-x 1 root mysql  29 2019-03-11 02:48 prov_tracking
-rwxr-xr-x 1 root mysql   5 2019-03-11 02:48 server.lock
-rwxr-xr-x 1 root mysql   0 2019-03-11 03:21 svrlive
drwxr-xr-x 2 root mysql  28 2019-06-07 08:51 topology
-rwxr-xr-x 1 root mysql   0 2019-03-11 02:47 tracking

./server_priv/accounting:
total 26
-rw-r--r-- 1 root mysql 4853 2019-03-11 03:22 20190311

./server_priv/hooks:
total 1349

-rwxr-xr-x 1 root mysql   1923 2018-05-14 12:27 pbs_cgroups.CF
-rwxr-xr-x 1 root mysql    219 2018-05-14 12:27 pbs_cgroups.HK
-rwxr-xr-x 1 root mysql 201494 2018-05-14 12:27 pbs_cgroups.PY

./server_priv/topology:
total 26
-rwxr-xr-x 1 root mysql 6213 2019-03-11 02:48 vm18vm7

./system:
total 1858
-rw-r--r-- 1 root mysql   1023 2019-03-11 03:42 df_h.out
-rw-r--r-- 1 root mysql 286367 2019-03-11 03:42 dmesg.out
-rw-r--r-- 1 root mysql    135 2019-03-11 03:42 etc_hosts
-rw-r--r-- 1 root mysql   1746 2019-03-11 03:42 etc_nsswitch_conf
-rw-r--r-- 1 root mysql 675643 2019-03-11 03:42 lsof_pbs.out
-rw-r--r-- 1 root mysql     60 2019-03-11 03:42 os_info
-rw-r--r-- 1 root mysql     19 2019-03-11 03:42 pbs_environment
-rw-r--r-- 1 root mysql    177 2019-03-11 03:42 pbs_hostn_v.out
-rw-r--r-- 1 root mysql  19425 2019-03-11 03:42 pbs_probe_v.out
-rw-r--r-- 1 root mysql   6546 2019-03-11 03:42 process_info
-rw-r--r-- 1 root mysql  24028 2019-03-11 03:42 ps_leaf.out
-rw-r--r-- 1 root mysql    246 2019-03-11 03:42 vmstat.out

Snapshot Validation

Some degree of validation is done when importing a snapshot to verify that the information is valid and is supported by Control. For example, a site has implemented backfilling by setting the backfill scheduler parameter in PBS_HOME/sched_priv/sched_config:
backfill: True prime

The backfill scheduler parameter will be verified to make sure it is a boolean value and it has been applied to either primetime, non-primetime, all the time, or none of the time by verifying that the second attribute is set to one of the following values: 'ALL', 'all', 'none', 'NONE', 'prime', 'PRIME', 'non_prime', or 'NON_PRIME'.

Snapshot Rejection

The snapshot is checked for the following options, attributes, or features that make the simulation behavior unpredictable. The simulation may hang, fail, or may not produce legitimate simulation results. These validation checks cause the snapshot to be rejected and the snapshot is not added to the simulation Snapshots list.

Routing Queues
A queue that is a routing queue.
Example:
Qmgr: set queue <queuename> queue_type = route
Shrink to Fit Jobs
A job that requests the min_walltime resource.
Host-Level Dynamic Resources
Host-level dynamic resources defined in the PBS_HOME/sched_priv/sched_config file.
Example:
mom_resources: "LocalScratch"
Resource Limits for Cumulative Resources
Limits that are set at the server or queue level which depend on one of the following resources that accumulate over the length of a job: cpupercent, mem, vmem and cput.
Example:
Qmgr: set server max_queued_res.mem = [u:tsmith=100mb] 
Qmgr: set queue <queue name> max_queued_res.mem = [u:tsmith=100mb]
Run Attempts
A job run_count value set to anything great than 22.
Load Balancing
Loading balancing enabled in the PBS_HOME/sched_priv/sched_config file.
Example:
load_balancing: True
SMP Cluster Distribution
Distributing single-chunk jobs to a cluster by placing the job on the host with the lowest load average by setting the smp_cluster_dist scheduler parameter in the PBS_HOME/sched_priv/sched_config file to the value lowest_load.
Example:
smp_cluster_dist: lowest_load
Peer Scheduling
Defining a peer_queue in the PBS_HOME/sched_priv/sched_config file.
Example:
peer_queue: "<pulling queue> <furnishing queue>@<furnishing server>.domain"
Missing Resource Definition File
The snapshot must contain a resourcedef file in the server_priv directory.

HPC Cluster Snapshot Tool

A tool for creating a snapshot of a PBS cluster. Snapshots are imported into Control for use in simulations.

Simulations evaluate how proposed changes to a HPC system will affect core utilization and job throughput. A snapshot of an HPC is the basis for performing this what-if analysis. A snapshot consists of the HPC's workload over a specific time period and a snapshot of the HPC's current system configuration. The snapshot of the HPC system includes the number of execution nodes that comprise the HPC environment and their associated CPU and RAM configuration, as well as the Workload Manager server and scheduler settings. A command-line tool is available that captures both the workload and current system configuration information when ran on the PBS Server.
Note: To run this command, you must be root on Linux or Admin on Windows.
For more information about the pbs_snapshot tool see the PBS Professional Reference Guide.

Name

pbs_snapshot

Synopsis

pbs_snapshot[ -odirectory] [OPTIONS]

Where directory is the path to the directory where pbs_snapshot writes its output tarball. The path can be absolute or relative to current working directory.

For example, if you specify -o/temp, the tarball is written to /temp/snapshot_<timestamp>.tgz.

The output directory path must already exist.

Options

-Hserver_hostname
Specifies server hostname. By default, the value of the PBS_SERVER parameter in pbs.conf is used. When you use this option, server_hostname is used instead.
-l loglevel
Specifies level at which pbs_snapshot writes its log. The log file is pbs_snapshot.log, in the output directory path specified using the -odirectoryoption.

Valid values, from most comprehensive to least: DEBUG2, DEBUG, INFOCLI2, INFOCLI, INFO, WARNING, ERROR, FATAL

Default: INFOCLI2

-h, --help
Display usage information.
--daemon-logs=days

Specifies number of days of daemon logs to be collected; this count includes the current day. All daemon logs are captured on the server host, and if you specify --additional-hosts= hostname[,hostname...], MoM logs are captured on those hosts as well.

Value of number of days must be >=0:
  • If number of days is 0, no logs are captured.
  • If number of days is 1, only the logs for the current day are captured.

Default: pbs_snapshot collects 5 days of daemon logs

--accounting-logs=days
Specifies number of days of accounting logs to be collected; this count includes the current day.
Value of number of days must be >=0:
  • If number of days is 0, no logs are captured.
  • If number of days is 1, only the logs for the current day are captured.

Default: pbs_snapshot collects 30 days of accounting logs

--additional-hosts= hostname[,hostname...]

Specifies that pbs_snapshot should gather data from the specified list of non-server hosts. pbs_snapshot always gathers data from the server host.

The command collects the following information from the specified hosts:
  • MoM and comm logs, for the number of days of logs being captured, specified via the --daemon-logs= number_of_days option

  • The PBS_HOME/mom_priv directory
  • System information

This option can greatly bloat the size of the snapshot, and cause pbs_snapshot to take a long time copying what may be large amounts of data over the network.

--map=filepath

Specifies path for file containing obfuscation map, which is a key/value pair-mapping of obfuscated data. Path can be absolute or relative to current working directory.

Default: pbs_snapshot writes its obfuscation map in a file called obfuscate.map in the location specified via the -odirectoryoption.

Can only be used with the --obfuscate option.

--obfuscate

Obfuscates (anonymizes) or deletes sensitive PBS data captured by pbs_snapshot.

Obfuscates the following data: euser, egroup, project, Account_Name, operators, managers, group_list, Mail_Users, User_List, server_host, acl_groups, acl_users, acl_resv_groups, acl_resv_users, sched_host, acl_resv_hosts, acl_hosts, Job_Owner, exec_host, Host, Mom, resources_available.host, resources_available.vnode

Deletes the following data: Variable_List, Error_Path, Output_Path, mail_from, Mail_Points, Job_Name, jobdir, Submit_arguments, Shell_Path_List

--version
Prints its PBS version number and exits. Can only be used alone

Create a Snapshot of an HPC Cluster

Create a snapshot of an HPC cluster so that it can be imported into Control for use in simulations.

A command line tool is available that creates a snapshot of an HPC cluster - pbs_snapshot. The following requirements must be met to run this tool:
  • The HPC cluster must be running PBS Professional 13.x or later on a Linux platform.
  • Python version 2.7 is required.
  • The following commands must be executed as root or as a user with sudo privileges using the sudo command.
Review the pbs_snapshot documentation before using the tool.

The pbs_snapshot command line tool captures HPC cluster information about the scheduler, server, execution nodes (MOMs), resource definitions, custom resources, jobs, queues, etc. The snapshot is imported into Control for use in simulations to perform what-if analysis.

This tool is available with PBS Professional version 18.2.x and is also available in the unsupported directory for PBS Professional 14.2. If you are using PBS Professional 14.2 or 18.2.x, use the version of the tool shipped with PBS Professional. Otherwise, you can use the pbs_snapshot tool shipped with Control. Its location is PC_HOME/pbs-control-simulator/bin/OS_DIR where OS_DIR is the supported OS. Currently, the pbs_snapshot tool shipped with Control is only supported on CentOS 6 & 7 and RHEL 6 & 7.

  1. Login to the PBS Server.
  2. Create a snapshot of your HPC cluster by running the pbs_snapshot tool:
    ./pbs_snapshot -o DIRECTORY --accounting-logs=DAYS --daemon-logs =DAYS
    Where DAYS specifies the number of days of accounting and/or daemon logs to be collected (this count includes the current day) and the -o DIRECTORY specifies where you want the command output directed.
    A tarfile is created and placed in the directory specified by -o DIRECTORY .
Create a new simulation using the snapshot tarball file as input.

Unsupported PBS Options, Features and Attributes

A list of PBS Professional options, features and attributes that are not currently supported or taken into account when running a simulation.

If any of these options, attributes, or features are used at your site, then the simulation behavior may be unpredictable. The simulation may hang, fail, or may not produce legitimate simulation results. You can turn off these features and configurations in the snapshot by opening the snapshot tarfile and editing the appropriate PBS Professional configuration file to remove the unsupported feature or configuration.
  • Standing reservations
  • Jobs submitted with the -a option (attribute: Execution_Time) - currently only supported for workload states
  • Job dependencies (Example: qsub -W depend=afterok:123.host1.domain.com /tmp/script)
  • Job states H (for job dependencies) and W (for qsub -a datetime)
  • Shrink-to-fit jobs
  • OS Provisioning
  • Other scheduling policies:
    • load_balancing
    • smp_cluster_dist (the 'lowest_load' option is not supported)
    • peer_queue (peer scheduling is not supported)
    • mom_resources
  • Handling node failures (and node_fail_requeue)
  • Hooks
  • Green Provisioning
  • Other limits that are enforced by the MoM: average CPU usage and CPU burst usage
  • Job attribute: substate
  • The job array attribute: array_state_count
  • Routing queues:
    • Limits: max_queued, max_queued_res, queued_jobs_threshold, queued_jobs_threshold_res, max_queuable, resources_max, resources_min
    • Permissions for routing
  • pbs_est (est_start_time_freq)
  • Server settings:
    • scheduler_iteration
    • scheduling (this might be used to "pause" the scheduler, when pausing the simulation)

Out of Scope PBS Professional Features and Attributes

The following features and attributes are not considered relevant to the simulation functionality:

  • Job states E, T and U
  • Logging for non-scheduler daemons (Server, Comm, MOM)
  • Server settings:
    • state_count
    • total_jobs
    • max_array_size
    • job_requeue_timeout
    • pbs_license_info, pbs_license_linger_time, pbs_license_max, pbs_license_min, license_count
    • rpp_highwater, rpp_retry
    • single_signon_password_enable
    • server_state
    • jobscript_max_size
  • Queue settings:
    • state_count, total_jobs, kill_delay
  • Vnode attributes:
    • hpcbp_enable, hpcbp_stage_protocol, hpcbp_user_name, hpcbp_webservice_address
    • ntype, Port, state (it's only updated to 'job-busy' and 'free')
    • license, license_info
  • Job attributes:
    • alt_id, block, Error_Path, Exit_status, forward_x11_cookie, forward_x11_port, group_list, Hold_Types, interactive, jobdir, Join Path, Keep_Files, mtime, Output_Path, queue_rank, run_count, run_version, sandbox, session_id, Shell_Path_List, stagein, stageout, Stageout_status, umask, User_List
  • Permissions (except for routing queues)
  • Mailing
  • Job history