Diagnosing Univa Grid Engine -DiagnosingoUniva(Grid Engine - sge_diagnostics() NAME Diagnostics - Diagnostics and Debugging of Univa Grid Engine DESCRIPTION The sections below describe aspects of diagnosing qmaster behaviour and obtaining more detailed information about the state of Univa Grid Engine. LOGGING Certain components as sge_qmaster(1) or sge_execd(1) create informa- tive, warning, error or debugging messages that are written to a mes- sage file of the corresponding component. The parameter loglevel of the global configuration of Univa Grid Engine allows to change the level of information that is written to the mes- sage file. When the loglevel is set to log_debug then more detailed information is written that allows to see details of the internal state of the component and to debug certain error scenarios that would be difficult to diagnose otherwise. Received and Sent Messages When the loglevel log_debug is activated then Univa Grid Engine writes log messages whenever sge_qmaster receives messages or sends messages. Message have the following format: ACTION: HOSTNAME/COMPROC-NAME/COM- PROC-ID/MESSAGE-ID:MESSAGE-TAG:SIZE o ACTION: SEND or RECEIVE o HOSTNAME: Identifies the host were the message was send from. o COMPROC-NAME: Name of the daemon or command that sent the message (e.g. qsub, execd, qmon, ...) o COMPROC-ID: Univa Grid Engine internal ID used for communication o MESSAGE-ID: Message ID that identifies the request on the communica- tion layer. o MESSAGE-TAG: Type of message: TAG_GDI_REQUEST, TAG_ACK_REQUEST, TAG_REPORT_REQUEST, ... o SIZE: Size of the message in bytes Request execution When the loglevel log_debug is activated then Univa Grid Engine writes log messages whenever sge_qmaster accepts new requests from client com- mands (e.g qsub(1), qalter(1), qconf(1), ...), other server components (e.g. sge_execd) or qmaster internal threads (lothread when the Univa Grid Engine cluster is connected to Univa License Orchestrator). Incoming requests are stored in qmaster internal queues till a thread is available that is able to handle the request properly. Log messages will also be written when one of the internal qmaster threads start executing such a request and when request handling has finished. In low performing clusters this allows to identify hosts, users, requests types ... that are the root cause for the performance decrease. Messages related to request execution have following format: ACTION: HOSTNAME/COMPROC-NAME/COMPROC-ID/MESSAGE-ID:USER:SIZE:INTERFACE:REQUEST-DETAILS[:DURATION] o ACTION: QUEUE, FETCHED, STARTED or FINISHED o HOSTNAME: Identifies the host were the request was send from. o COMPROC-NAME: Name of the daemon or command that sent the request (e.g. qsub, execd, qmon, ...) o COMPROC-ID: Univa Grid Engine internal ID used for communication o MESSAGE-ID: Message ID that identifies the request on the communica- tion layer. o USER: Name of the user that caused the request to be send to qmaster. o SIZE: Size of the request in bytes (the commlib message) when receiv- ing requests from external clients, else 0 o INTERFACE: Interface that was used to trigger the request (GDI or REP) o REQUEST-DETAILS: For GDI requests this will show the operation type (e.g ADD, MOD, DEL, ...) and the object type (JB for job object, CQ for cluster queue object, ...) o DURATION: optionally: time in seconds since the last action on the request, e.g. time a request was queued, time it took from fetching a request till it can be processed (acquiring locks), time for pro- cessing a request Messages related to non GDI requests modifying event clients (e.g. acknowledge receipt of an event package) have the following format: ACTION(E): REQUEST:ID[:DURATION] o ACTION: QUEUE, STARTED or FINISHED o REQUEST: type of request, e.g. ACK o ID: the event client id, see qconf -secl o DURATION: optionally: time in seconds since the last action on the request, e.g. time a request was queued, time for processing a request JOB LIMITS SUPPORTED LIMITS The following table shows what kind of limits are supported via job submission and queue setting and where the observation is implemented (sge_execd, sge_shepherd or via cgroups): limit execd shepherd cgroups description ------------------------------------------------------------- h_cpu/s_cpu yes yes no cpu time limit in sec- onds h_vmem/s_vmem yes* yes* yes virtual mem- ory size h_rss/s_rss yes* yes* no resident set size h_stack/s_stack no yes no stack size limit h_data/s_data no yes no data segment size limit h_core/s_core no yes no max. size of a core file** h_fsize/s_fsize no yes no max. file size** (* = If supported by OS) (** = This kind of limit is not adjusted on pe settings) In order to setup limit observation by the sge_execd or sge_shepherd the "execd_params" parameter "ENFORCE_LIMITS" in the configuration of the execution hosts is used (see sge_conf(5) man page). This parameter only allows settings for the supported limits (cpu, vmem and rss). The remaining limits (stack, data, core and fsize) cannot be switched off by this parameter. If virtual memory size is set to be observed by cgroups the sge_execd observation is disabled for the "h_vmem" limit. If cgroups limit set- ting did not report any error at the sge_sheperd startup the "h_vmem" resource limit will be set to "infinity" with the setrlimit() system call. How to enable cgroups "h_vmem" limit observation is described in the man page sge_conf(5) ("h_vmem_limit" parameter of "cgroups_params"). If a limit is observerd by sge_execd the execd is responsible for killing the job. For sge_shepherd the limits are set via setrlimit() command to let the Kernel enforce the process limit. The cgroups implementation will write the corresponding limit for all processes of the job into the "memory.memsw.limit_in_bytes" file which is created in the cgroups directory of the job. SUPPORTED LIMITS VIA EXECD CONFIGURATION The following table shows the limits that can be set by sge_shepherd at job start via setrlimit() system call to enable Kernel enforced process limit. The OS must of course support this limit type. Limit Description ------------------------------------------------------ h_descriptors/s_descrip- nr of open file descrip- tors tors h_maxproc/s_maxproc max nr of processes h_locks/s_locks nr of locks h_memorylocked/s_memory- maximum number of bytes of locked memory locked into RAM Please see also the sge_conf(5) man page. The section "execd_params" contains information how to enable this limits. LIMIT ADJUSTMENT DEPENDING ON PE SETUP For parallel jobs the resulting limit value depends on the used paral- lel environment (PE) settings. The following diagrams should explain this in more detail. List of abbreviations: Name Description ------------------------------------------ slave limit Value specified with "-l" master limit Value specified with "-masterl" n Nr of slave slots on this host CS Boolean "control_slaves" PE option JFT Boolean "job_is_first_task" PE option MFS Boolean "mas- ter_forks_slaves" PE option DFS Boolean "dae- mon_forks_slaves" PE option n/a Not applicable situation Master task limit adjustments This diagrams shows how the resulting limit for the master task of a parallel job is calculated: MFS | TRUE-----+------FALSE | | JFT Case C | TRUE----+----FALSE | | Case A Case B Case A: ======= master limit requested? | TRUE----------+---------FALSE | | master limit + slave limit + (n-1) * slave limit (n-1) * slave limit Case B: ======= master limit requested? | TRUE----------+---------FALSE | | master limit + slave limit + n * slave limit n * slave limit Case C: ======= master limit requested? | TRUE----------+---------FALSE | | master limit slave limit Adjustments for slave tasks running on master host This diagrams shows the resulting limit for any slave task of a paral- lel job which is started on the master task host: CS | TRUE----+----FALSE | | MFS n/a | TRUE---+---FALSE | | n/a DFS | TRUE---+---FALSE | | JFT slave limit | TRUE---+----FALSE | | (n-1) * n * slave limit slave limit Adjustments for slave tasks running on a slave host This diagrams shows the resulting limit for any slave task of a paral- lel job which is started on a slave host: CS | TRUE----+----FALSE | | DFS n/a | TRUE----+----------FALSE | | n * slave limit slave limit Examples for master_forks_slave=true in pe setting qsub -l h_vmem=1G -pe mpi 3 h_vmem = 1G + 1G * 3 = 4G (job first task=false) h_vmem = 1G + 1G * 2 = 3G (job first task=true) qsub -masterl h_vmem=0.5G -l h_vmem=1G -pe mpi 3 h_vmem = 0.5G + 3 * 1G = 3.5G (job first task = false) h_vmem = 0.5G + 2 * 1G = 2.5G (job first task = true) qsub -pe fixed 16 -masterl h_vmem=64G h_vmem = 64G + INFINITY * 16 = INFINITY (job first task = false) h_vmem = 64G + INFINITY * 15 = INFINITY (job first task = true) qsub -pe fixed 16 -masterl h_vmem=2G -l h_vmem=4G h_vmem = 2G + 4G * 16 = 2G + 64G = 66G (job first task = false) h_vmem = 2G + 4G * 15 = 2G + 60G = 62G (job first task = true) qsub -pe fixed 16 -l h_vmem=4G h_vmem = 4G + 4G * 16 = 4G + 64G = 68G (job first task = false) h_vmem = 4G + 4G * 15 = 4G + 60G = 64G (job first task = true) h_vmem limit for cgroups The cgroups h_vmem limit will be the sum of the limits of all tasks started on this host. Once the individual limit for master task, slave task on master host and slave task on slave host are calculated the resulting sum for the cgroups h_vmem setting is done the following way: On master task host: resulting master task limit + nr of started slave tasks * resulting slave limit On slave task host: nr of started slave tasks * resulting slave limit Note: The PE parameters daemon_forks_slaves and master_forks_slaves have an influence on the nr of slave jobs that can be started on each host. More information about this parameters can be found in the sge_pe(5) man page. MONITORING MESSAGE FILE MONITORING Monitoring output of the sge_qmaster(1) component is disabled by default. It can be enabled by defining MONITOR_TIME as qmaster_param in the global configuration of Univa Grid Engine (see sge_conf(5)). MONITOR_TIME defines the time interval when monitoring information is printed. The generated output provides information per thread and it is written to the message file or displayed with qping(1). The messages that are shown start with the name of a qmaster thread followed by a three digit number and a colon character (:). The number allows to distinguish monitoring output of different threads that are part of the same thread pool. All counters are reset when the monitoring output was printed. This means that all numbers show activity characteristics of about one MONI- TOR_TIME interval. Please note that the MONITOR_TIME is only a guide- line and not a fixed interval. The interval that is actually used is shown by time in the monitoring output. For each thread type the output contains following parameters: o runs: [iterations per second] number of cycles per second a thread executed its main loop. Threads typically handle one work package (message, request) per iteration. o out: [messages per second] number of outgoing TCP/IP communication messages per second. Only those threads trigger outgoing messages that handle requests that were triggered by external commands or interfaces (client commands, DRMAA, ...). o APT: [cpu time per message] average processing time per message or request. o idle: [%] percentage how long the thread was idle and waiting for work. o wait: [%] percentage how long the thread was waiting for required resources that where already in use by other threads. o time: [seconds] time since last monitoring output for this thread was written. Depending on the thread type the output will contain more details: LISTENER Listener threads listen for incoming messages that are send to qmaster via generic data interface, event client interface, mirror interface or reporting interface. Requests are unpacked and verified. For simple requests a response will also be sent back to the client but in most cases the request will be stored in one of the request queues that are processed by reader, worker threads or the event master thread. o IN g: [requests per second] number of requests received via GDI interface. o IN a: [messages per second] handled ack's for a request response. o IN e: [requests per second] event client requests received from applications using the event client or mirror interface. o IN r: [requests per second] number of reporting requests received from execution hosts. o OTHER wql: [requests] number of pending read-write requests that can immediately be handled by a worker thread. o OTHER rql: [requests] number of pending read-only requests that can immediately be handled by a reader thread. o OTHER wrql: number of waiting read-only requests. read-only requests in waiting-state have to be executed as part of a GDI session and the data store of the read-only thread pool is not in a state to execute those requests immediately. READER/WORKER Reader and worker threads handle GDI and reporting requests. Reader threads will handle read-only requests only whereas all requests that require read-write access will be processed by worker threads. o EXECD l: [reports per second] handled load reports per second. o EXECD j: [reports per second] handled job reports per second. o EXECD c: [reports per second] handled configuration version requests. o EXECD p: [reports per second] handled processor reports. o EXECD a: [messages per second] handled ack's for a request response. o GDI a: [requests per second] handled GDI add requests per second. o GDI g: [requests per second] handled GDI get requests per second. o GDI m: [requests per second] handled GDI modify requests per second. o GDI d: [requests per second] handled GDI delete requests per second. o GDI c: [requests per second] handled GDI copy requests per second. o GDI t: [requests per second] handled GDI trigger requests per second. o GDI p: [requests per second] handled GDI permission requests per sec- ond. EVENT MASTER The event master thread is responsible for handling activities for reg- istered event clients that either use the event client or the mirror interface. The interfaces can be used to register and subscribe all or a subset of event types. Clients will automatically receive updates for subscribed information as soon as it is added, modified or deleted within qmaster. Clients using those interfaces don't have the need to poll required information. o clients: [clients] connected event clients. o mod: [modifications per second] event client modifications per sec- ond. o ack: [messages per second] handled ack's per second. o blocked: [clients] number of event clients blocked during send. o busy: [clients] number of event clients busy during send. o events: [events per second] newly added events per second. o added: [events per second] number of all events per second. o skipped: [events per second] ignored events per second (because no client has subscribed them). TIMED EVENT The timed event thread is used within qmaster to either trigger activi- ties once at a certain point in time or in regular time intervals. o pending: [events] number of events waiting that start time is reached. o executed: [events per second] executed events per second. QPING MONITORING The qping(1) command provides monitoring output of Univa Grid Engine components. REQUEST QUEUES Requests that are accepted by qmaster but that cannot be immediately handled by one of the reader or worker threads are stored in qmaster internal request queues. qping(1) is able to show details about those pending requests when this is enabled by defining the parameter MONI- TOR_REQUEST_QUEUES as qmaster_param in the global configuration of Univa Grid Engine. The output format of requests is the same as for requests log messages (explained in the section Logging -> Request exe- cution above). GRID ENGINE ERROR, FAILURE AND EXIT CODES Univa Grid Engine provides a number of job or feature related exit codes, which can be used to trigger a job or a queue behaviour and a resulting consequence, for either the job or also the queue. These exit codes are shown in the following tables. Job related error and exit codes The following table lists the consequences of different job-related error codes or exit codes. These codes are valid for every type of job. Script/Method Exit or Error Code Consequence --------------------------------------------------------- Job Scrips 0 Success 99 Re-queue Rest Success: Exit code in accounting Epilog/Prolog 0 Success 99 Re-queue 100 Job in Error state Rest Queue in Error state, Job re-queued Parallel-Environment-Related Error or Exit Codes The following table lists the consequences of error codes or exit codes of jobs related to parallel environment (PE) configuration. Script/Method Error or Exit Code Consequence --------------------------------------------------------- pe_start 0 Success Rest Queue set to error state, job re-queued pe_stop 0 Success Rest Queue set to error state, job not re-queued Queue-Related Error or Exit Codes The following table lists the consequences of error codes or exit codes of jobs related to queue configuration. These codes are valid only if corresponding methods were overwritten. Script/Method Error or Exit Code Consequence --------------------------------------------------------- Job Starter 0 Success Rest Success, no other special meaning Suspend 0 Success Rest Success, no other special meaning Resume 0 Success Rest Success, no other special meaning Terminate 0 Success Rest Success, no other special meaning Checkpointing-Related Error or Exit Codes The following table lists the consequences of error or exit codes of jobs related to checkpointing. Script/Method Error or Exit Code Consequence --------------------------------------------------------- Checkpoint 0 Success Rest Success. For ker- nel checkpoint, however, this means that the checkpoint was not successful. Migrate 0 Success Rest Success. For ker- nel checkpoint, however, this means that the checkpoint was not successful. Migration will occur. Restart 0 Success Rest Success, no other special meaning Clean 0 Success Rest Success, no other special meaning qacct -j failed line Codes For jobs that run successfully, the qacct -j command output shows a value of 0 in the failed field, and the output shows the exit status of the job in the exit_status field. However, the shepherd might not be able to run a job successfully. For example, the epilog script might fail, or the shepherd might not be able to start the job. In such cases, the failed field displays one of the code values listed in the following table. Code Description Accounting valid Meaning for Job -------------------------------------------------------------- 0 No failure t Job ran, exited normally 1 Presumably f Job could not be before job started 3 Before writing f Job could not be config started 4 Before writing f Job could not be PID started 5 On reading con- f Job could not be fig file started 6 Setting proces- f Job could not be sor set started 7 Before prolog f Job could not be started 8 In prolog f Job could not be started 9 Before pestart f Job could not be started 10 In pestart f Job could not be started 11 Before job f Job could not be started 12 Before pestop t Job ran, failed before calling PE stop proce- dure 13 In pestop t Job ran, PE stop procedure failed 14 Before epilog t Job ran, failed before calling epilog script 15 In epilog t Job ran, failed in epilog script 16 Releasing pro- t Job ran, proces- cessor set sor set could not be released 24 Migrating t Job ran, job (checkpointing will be migrated jobs) 25 Rescheduling t Job ran, job will be resched- uled 26 Opening output f Job could not be file started, stderr/stdout file could not be opened 27 Searching f Job could not be requested shell started, shell not found 28 Changing to f Job could not be working direc- started, error tory changing to start directory 29 No message -> f Job could not be AFS problem started 30 Rescheduling on f Job ran until application application error failed, rescheduling 31 Accessing f Job could not be sgepasswd file started, job failure 32 Entry is missing f Job could not be in password file started, job failure 33 Wrong password f Job could not be started, job failure 34 Communicating f Job could not be with Grid Engine started, job Helper Service failure 35 Before job in f Job could not be Grid Engine started, job Helper Service failure 36 Checking config- f Job could not be ured daemons started, job failure 37 Qmaster enforced t Job was killed h_rt limit by qmaster, enforcing a resource limit, job failure 38 No Message -> f Job could not be ADD_GRP_ID can started, not be set ADD_GRP_ID can not be set 100 Assumedly after t Job ran, job job killed by a sig- nal The Code column lists the value of the failed field. The Description column lists the text that appears in the qacct -j output. If acct- Valid is set to t, the job accounting values are valid. If acctValid is set to f, the resource usage values of the accounting record are not valid. The Meaning for Job column indicates whether the job ran or not. SEE ALSO sge_intro(1) sge_qmaster(1) sge_execd(1) qconf(1) qping(1) sge_conf(5) COPYRIGHT See sge_intro(1) for a full statement of rights and permissions. AUTHORS Copyright (c) 2015-2017 Univa Corporation. Diagnosing Univa Grid Engine - sge_diagnostics()