CHECKPOINT(5) File Formats Manual CHECKPOINT(5) NAME checkpoint - Univa Grid Engine checkpointing environment configuration file format DESCRIPTION Checkpointing is a facility to save the complete status of an executing program or job and to restore and restart from this so called check- point at a later point of time if the original program or job was halted, e.g. through a system crash. Univa Grid Engine provides various levels of checkpointing support (see sge_ckpt(1)). The checkpointing environment described here is a means to configure the different types of checkpointing in use for your Univa Grid Engine cluster or parts thereof. For that purpose you can define the operations which have to be executed in initiating a checkpoint generation, a migration of a checkpoint to another host or a restart of a checkpointed application as well as the list of queues which are eli- gible for a checkpointing method. Supporting different operating systems may easily force Univa Grid Engine to introduce operating system dependencies for the configuration of the checkpointing configuration file and updates of the supported operating system versions may lead to frequently changing implementa- tion details. Please refer to the /ckpt directory for more information. Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1) command to manipulate checkpointing environments from the command-line or use the corresponding qmon(1) dialogue for X-Windows based interac- tive configuration. Note, Univa Grid Engine allows backslashes (\) be used to escape new- line (\newline) characters. The backslash and the newline are replaced with a space (" ") character before any interpretation. FORMAT The format of a checkpoint file is defined as follows: ckpt_name The name of the checkpointing environment as defined for ckpt_name in sge_types(1). qsub(1) -ckpt switch or for the qconf(1) options men- tioned above. interface The type of checkpointing to be used. Currently, the following types are valid: hibernator The Hibernator kernel level checkpointing is interfaced. cpr The SGI kernel level checkpointing is used. cray-ckpt The Cray kernel level checkpointing is assumed. transparent Univa Grid Engine assumes that the jobs submitted with reference to this checkpointing interface use a checkpointing library such as provided by the public domain package Condor. userdefined Univa Grid Engine assumes that the jobs submitted with reference to this checkpointing interface perform their private check- pointing method. application-level Uses all of the interface commands configured in the checkpoint- ing object like in the case of one of the kernel level check- pointing interfaces (cpr, cray-ckpt, etc.) except for the restart_command (see below), which is not used (even if it is configured) but the job script is invoked in case of a restart instead. ckpt_command A command-line type command string to be executed by Univa Grid Engine in order to initiate a checkpoint. migr_command A command-line type command string to be executed by Univa Grid Engine during a migration of a checkpointing job from one host to another. restart_command A command-line type command string to be executed by Univa Grid Engine when restarting a previously checkpointed application. clean_command A command-line type command string to be executed by Univa Grid Engine in order to cleanup after a checkpointed application has finished. ckpt_dir A file system location to which checkpoints of potentially considerable size should be stored. ckpt_signal A Unix signal to be sent to a job by Univa Grid Engine to initiate a checkpoint generation. The value for this field can either be a sym- bolic name from the list produced by the -l option of the kill(1) com- mand or an integer number which must be a valid signal on the systems used for checkpointing. when The points of time when checkpoints are expected to be generated. Valid values for this parameter are composed by the letters s, m, x and r and any combinations thereof without any separating character in between. The same letters are allowed for the -c option of the qsub(1) command which will overwrite the definitions in the used checkpointing environment. The meaning of the letters is defined as follows: s A job is checkpointed, aborted and if possible migrated if the corresponding sge_execd(8) is shut down on the job's machine. m Checkpoints are generated periodically at the min_cpu_interval interval defined by the queue (see queue_conf(5)) in which a job executes. x A job is checkpointed, aborted and if possible migrated as soon as the job gets suspended (manually as well as automatically). r A job will be rescheduled (not checkpointed) when the host on which the job currently runs went into unknown state and the time interval reschedule_unknown (see sge_conf(5)) defined in the global/local cluster configuration will be exceeded. RESTRICTIONS Note, that the functionality of any checkpointing, migration or restart procedures provided by default with the Univa Grid Engine distribution as well as the way how they are invoked in the ckpt_command, migr_com- mand or restart_command parameters of any default checkpointing envi- ronments should not be changed or otherwise the functionality remains the full responsibility of the administrator configuring the check- pointing environment. Univa Grid Engine will just invoke these proce- dures and evaluate their exit status. If the procedures do not perform their tasks properly or are not invoked in a proper fashion, the check- pointing mechanism may behave unexpectedly, Univa Grid Engine has no means to detect this. SEE ALSO sge_intro(1), sge_ckpt(1), sge__types(1), qconf(1), qmod(1), qsub(1), sge_execd(8). COPYRIGHT See sge_intro(1) for a full statement of rights and permissions. Univa Grid Engine File Formats UGE 8.5.4 CHECKPOINT(5)