SCHED_CONF(5) File Formats Manual SCHED_CONF(5) NAME sched_conf - Univa Grid Engine default scheduler configuration file DESCRIPTION sched_conf defines the configuration file format for Univa Grid Engine's scheduler. In order to modify the configuration, use the graphical user's interface qmon(1) or the -msconf option of the qconf(1) command. A default configuration is provided together with the Univa Grid Engine distribution package. Note, Univa Grid Engine allows backslashes (\) be used to escape new- line (\newline) characters. The backslash and the newline are replaced with a space (" ") character before any interpretation. FORMAT The following parameters are recognized by the Univa Grid Engine sched- uler if present in sched_conf: algorithm Note: Deprecated, may be removed in future release. Allows for the selection of alternative scheduling algorithms. Currently default is the only allowed setting. load_formula A simple algebraic expression used to derive a single weighted load value from all or part of the load parameters reported by sge_execd(8) for each host and from all or part of the consumable resources (see complex(5)) being maintained for each host. The load formula expres- sion syntax is that of a summation weighted load values, that is: {w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]] Note, no blanks are allowed in the load formula. The load values and consumable resources (load_val1, ...) are speci- fied by the name defined in the complex (see complex(5)). Note: Administrator defined load values (see the load_sensor parameter in sge_conf(5) for details) and consumable resources available for all hosts (see complex(5)) may be used as well as Univa Grid Engine default load parameters. The weighting factors (w1, ...) are positive integers. After the expression is evaluated for each host the results are assigned to the hosts and are used to sort the hosts corresponding to the weighted load. The sorted host list is used to sort queues subsequently. The default load formula is "np_load_avg". job_load_adjustments The load, which is imposed by the Univa Grid Engine jobs running on a system varies in time, and often, e.g. for the CPU load, requires some amount of time to be reported in the appropriate quantity by the oper- ating system. Consequently, if a job was started very recently, the reported load may not provide a sufficient representation of the load which is already imposed on that host by the job. The reported load will adapt to the real load over time, but the period of time, in which the reported load is too low, may already lead to an oversubscription of that host. Univa Grid Engine allows the administrator to specify job_load_adjustments which are used in the Univa Grid Engine scheduler to compensate for this problem. The job_load_adjustments are specified as a comma separated list of arbitrary load parameters or consumable resources and (separated by an equal sign) an associated load correction value. Whenever a job is dis- patched to a host by the scheduler, the load parameter and consumable value set of that host is increased by the values provided in the job_load_adjustments list. These correction values are decayed linearly over time until after load_adjustment_decay_time from the start the corrections reach the value 0. If the job_load_adjustments list is assigned the special denominator NONE, no load corrections are per- formed. The adjusted load and consumable values are used to compute the com- bined and weighted load of the hosts with the load_formula (see above) and to compare the load and consumable values against the load thresh- old lists defined in the queue configurations (see queue_conf(5)). If the load_formula consists simply of the default CPU load average param- eter np_load_avg, and if the jobs are very compute intensive, one might want to set the job_load_adjustments list to np_load_avg=1.00, which means that every new job dispatched to a host will require 100 % CPU time, and thus the machine's load is instantly increased by 1.00. load_adjustment_decay_time The load corrections in the "job_load_adjustments" list above are decayed linearly over time from the point of the job start, where the corresponding load or consumable parameter is raised by the full cor- rection value, until after a time period of "load_adjust- ment_decay_time", where the correction becomes 0. Proper values for "load_adjustment_decay_time" greatly depend upon the load or consumable parameters used and the specific operating system(s). Therefore, they can only be determined on-site and experimentally. For the default np_load_avg load parameter a "load_adjustment_decay_time" of 7 minutes has proven to yield reasonable results. maxujobs The maximum number of jobs any user may have running in a Univa Grid Engine cluster at the same time. If set to 0 (default) the users may run an arbitrary number of jobs. schedule_interval At the time the scheduler thread initially registers at the event mas- ter thread in sge_qmaster(8)process schedule_interval is used to set the time interval in which the event master thread sends scheduling event updates to the scheduler thread. A scheduling event is a status change that has occurred within sge_qmaster(8) which may trigger or affect scheduler decisions (e.g. a job has finished and thus the allo- cated resources are available again). In the Univa Grid Engine default scheduler the arrival of a scheduling event report triggers a scheduler run. The scheduler waits for event reports otherwise. Schedule_interval is a time value (see queue_conf(5) for a definition of the syntax of time values). queue_sort_method This parameter determines in which order several criteria are taken into account to product a sorted queue list. Currently, two settings are valid: seqno and load. However in both cases, Univa Grid Engine attempts to maximize the number of soft requests (see qsub(1) -s option) being fulfilled by the queues for a particular as the primary criterion. Then, if the queue_sort_method parameter is set to seqno, Univa Grid Engine will use the seq_no parameter as configured in the current queue configurations (see queue_conf(5)) as the next criterion to sort the queue list. The load_formula (see above) has only a meaning if two queues have equal sequence numbers. If queue_sort_method is set to load the load according the load_formula is the criterion after maxi- mizing a job's soft requests and the sequence number is only used if two hosts have the same load. The sequence number sorting is most use- ful if you want to define a fixed order in which queues are to be filled (e.g. the cheapest resource first). The default for this parameter is load. halftime When executing under a share based policy, the scheduler "ages" (i.e. decreases) usage to implement a sliding window for achieving the share entitlements as defined by the share tree. The halftime defines the time interval in which accumulated usage will have been decayed to half its original value. Valid values are specified in hours or according to the time format as specified in queue_conf(5). If the value is set to 0, the usage is not decayed. -1 results in imme- diate decay. usage_weight_list Univa Grid Engine accounts for the consumption of the resources wall- clock-time, CPU-time, memory and IO to determine the usage which is imposed on a system by a job. A single usage value is computed from these four input parameters by multiplying the individual values by weights and adding them up. The weights are defined in the usage_weight_list. The format of the list is wallclock=wwallclock,cpu=wcpu,mem=wmem,io=wio where wwallclock, wcpu, wmem and wio are the configurable weights. The weights are real numbers. The sum of all tree weights should be 1. compensation_factor Determines how fast Univa Grid Engine should compensate for past usage below of above the share entitlement defined in the share tree. Recom- mended values are between 2 and 10, where 10 means faster compensation. weight_user The relative importance of the user shares in the functional policy. Values are of type real. weight_project The relative importance of the project shares in the functional policy. Values are of type real. weight_department The relative importance of the department shares in the functional pol- icy. Values are of type real. weight_job The relative importance of the job shares in the functional policy. Values are of type real. weight_tickets_functional The maximum number of functional tickets available for distribution by Univa Grid Engine. Determines the relative importance of the functional policy. See under sge_priority(5) for an overview on job priorities. weight_tickets_share The maximum number of share based tickets available for distribution by Univa Grid Engine. Determines the relative importance of the share tree policy. See under sge_priority(5) for an overview on job priorities. weight_deadline The weight applied on the remaining time until a jobs latest start time. Determines the relative importance of the deadline. See under sge_priority(5) for an overview on job priorities. weight_waiting_time The weight applied on the jobs waiting time since submission. Deter- mines the relative importance of the waiting time. See under sge_pri- ority(5) for an overview on job priorities. weight_urgency The weight applied on jobs normalized urgency when determining priority finally used. Determines the relative importance of urgency. See under sge_priority(5) for an overview on job priorities. weight_priority The weight applied on jobs normalized POSIX priority when determining priority finally used. Determines the relative importance of POSIX pri- ority. See under sge_priority(5) for an overview on job priorities. weight_ticket The weight applied on normalized ticket amount when determining prior- ity finally used. Determines the relative importance of the ticket policies. See under sge_priority(5) for an overview on job priorities. flush_finish_sec The parameters are provided for tuning the system's scheduling behav- ior. By default, a scheduler run is triggered in the scheduler inter- val. When this parameter is set to 1 or larger, the scheduler will be triggered x seconds after a job has finished. Setting this parameter to 0 disables the flush after a job has finished. flush_submit_sec The parameters are provided for tuning the system's scheduling behav- ior. By default, a scheduler run is triggered in the scheduler inter- val. When this parameter is set to 1 or larger, the scheduler will be triggered x seconds after a job was submitted to the system. Setting this parameter to 0 disables the flush after a job was submitted. schedd_job_info The default scheduler can keep track why jobs could not be scheduled during the last scheduler run. This parameter enables or disables the observation. The value true enables the monitoring false turns it off. It is also possible to activate the observation only for certain jobs. This will be done if the parameter is set to job_list followed by a comma separated list of job ids. The user can obtain the collected information with the command qstat -j. params This is foreseen for passing additional parameters to the Univa Grid Engine scheduler. The following values are recognized: DURATION_OFFSET If set, overrides the default of value 60 seconds. This parame- ter is used by the Univa Grid Engine scheduler when planning resource utilization as the delta between net job runtimes and total time until resources become available again. Net job run- time as specified with -l h_rt=... or -l s_rt=... or -l d_rt=... or default_duration always differs from total job run- time due to delays before and after actual job start and finish. Among the delays before job start is the time until the end of a schedule_interval, the time it takes to deliver a job to sge_execd(8) and the delays caused by prolog in queue_conf(5) , start_proc_args in sge_pe(5) and starter_method in queue_conf(5) (notify, terminate_method or checkpointing), procedures run after actual job finish, such as stop_proc_args in sge_pe(5) or epilog in queue_conf(5) , and the delay until a new sched- ule_interval. If the offset is too low, resource reservations (see max_reser- vation) can be delayed repeatedly due to an overly optimistic job circulation time. JC_FILTER Note: Deprecated, may be removed in future release. If set to true, the scheduler limits the number of jobs it looks at during a scheduling run. At the beginning of the scheduling run it assigns each job a specific category, which is based on the job's requests, priority settings, and the job owner. All scheduling policies will assign the same importance to each job in one category. Therefore the number of jobs per category have a FIFO order and can be limited to the number of free slots in the system. A exception are jobs, which request a resource reservation. They are included regardless of the number of jobs in a category. This setting is turned off per default, because in very rare cases, the scheduler can make a wrong decision. It is also advised to turn report_pjob_tickets off. Otherwise qstat -ext can report outdated ticket amounts. The information shown with a qstat -j for a job, that was excluded in a scheduling run, is very limited. PROFILE If set equal to 1, the scheduler logs profiling information sum- marizing each scheduling run. In combination with WARN_DISPATCH- ING_TIME it is possible to get profiling data for the longest and shortest job scheduling. MONITOR If set equal to 1, the scheduler records information for each scheduling run allowing to reproduce job resources utilization in the file //common/schedule. In order to see entries in the schedule file resource reservation must be turned on (max_reservation must be greater than 0) and jobs need a run- time (using h_rt, s_rt, d_rt or setting a default_duration). The format of the schedule file is: : The jobs id. : The array task id or 1 in case of non-array jobs. : One of RUNNING, SUSPENDED, MIGRATING, STARTING, RESERVING. : Start time in seconds after 1.1.1970. : Assumed job duration in seconds. : One of {P, G, H, Q} standing for {PE, Global, Host, Queue}. : The name of the PE, global, host, queue. : The name of the consumable resource. The resource utilization debited for the job. A line "::::::::" marks the begin of a new schedule interval. Please note this file is not truncated. Make sure the monitoring is switched off in case you have no automated procedure setup that truncates the schedule file. PE_RANGE_ALG This parameter sets the algorithm for the pe range computation. The default is "bin", which means that the scheduler will use a binary search to select the best one. It should not be necessary to change it to a different setting in normal operation. If a custom setting is needed, the following values are available: auto : the scheduler selects the best algorithm least : starts the resource matching with the lowest slot amount first bin : starts the resource matching in the middle of the pe slot range highest : starts the resource matching with the highest slot amount first PREFER_SOFT_REQUESTS If this parameter is set scheduler will try to find an assign- ment or a resource reservation which matches as many soft requests as possible. "PREFER_SOFT_REQUESTS" only has impact on parallel jobs. In case of the dispatching of jobs (no reservation) by default (PREFER_SOFT_REQUESTS not set) resources will be prefered which provide more slots (in case of pe ranges), with the parameter set resources will be preferred which have less infringements for soft requests. In case of resource reservation without the parameter set the scheduler reserves the earliest available resources in time even when soft requests for the job can not be fulfilled. When the parameter is set then resources are preferred which have less infringements for soft requests. PE_SORT_ORDER When using wildcard parallel environment selection during sub- mission time, the parallel environment the scheduler chooses is arbitrary. In order to fill up the parallel environments in a specific order this parameter allows to change the sorting of matching parallel environments either to an ascending or descending order. When PE_SORT_ORDER is set to ASCENDING (or 1) the first PE which is tested for job selection is the one which is in alpha-numerical order the first one (test1pe before test2pe and test10pe before test2pe, when submitting with -pe test*). When it is set to DESCENDING (or 2) the PE which is tested is in alpha-numerical order the last one (testpe2 in the previous example). When it is set to 0 or NONE then the first matching PE is arbitrary (default), which is a good choice for balancing PEs and the same than with absence of the parameter. COUNT_CORES_AS_THREADS If set to 1 or TRUE the scheduler treats the requested amount of cores of a job (with -binding parameter) as request for hardware supported threads. On hosts with SMT (topology string with threads, like SCTTCTT) the amount of requested cores is divided by the number of threads per core. In case a core is filled only partially the complete core is requested by the job. Example: When a job requests 3 cores, on a host with hyper-threading (2 hardware threads per core) the request is transformed to 2 cores (because 3 threads are needed). On a host without hyper-thread- ing the job requests 3 cores, and on a host with 4 hardware- threads supported per core the job requests 1 core. WRITE_SCHEDD_RUNLOG If set equal to 1, scheduler will write trace messages of the next scheduling run to the file //com- mon/schedd_runlog when triggered by qconf -tsm. Writing the schedd_runlog file can have significant impact on scheduler per- formance. This feature should only be enabled when the debug- ging information contained in the file is actually needed. Default setting is disabled. MAX_SCHEDULING_TIME This parameter can be used to specify a maximum time interval (time_specifier, see sge_types(1)) for one scheduling run. If the scheduler has not finished a dispatching run within this time interval job dispatching is stopped for this one scheduling run. In the next scheduling run job dispatching again starts with the highest priority job. Default for this parameter is 0 (do full dispatching from the highest priority job down to the lowest priority job). In huge clusters with a high number of pending jobs setting this parameter to reasonable values (e.g. one minute) can improve cluster utilization and responsiveness of sge_qmaster. MAX_DISPATCHED_JOBS This parameter can be used to limit the number of jobs which get scheduled in one scheduling interval. Can be set to any positive number or 0 (do not limit the number of scheduled jobs). Default is 0. Limiting the number of jobs getting scheduled in a single scheduling interval can be useful to avoid overload on the cluster, especially on file servers due to many jobs start- ing up at the same time. But use this option with care: Setting it to a too low value can lead to bad utilization of the clus- ter. HIGH_PRIO_DRAINS_CLUSTER When this parameter is set to 1 or TRUE the cluster will be drained until the highest priority job could be scheduled. This can be used as a workaround to avoid starvation of parallel jobs when resource reservation cannot be applied, e.g. as job run- times are unknown. Use this parameter with care and only tempo- rarily: It can lead to very bad utilization of the cluster. WARN_DISPATCHING_TIME When this parameter is set to a threshold in milliseconds the Univa Grid Engine scheduler will print a warning to the sge_qmaster(8) messages file when dispatching a job takes longer than the given threshold. If this parameter is enabled and PRO- FILE is turned on the profiling output will contain additional information about the longest and shortest job scheduling time. The default for "WARN_DISPATCHING_TIME" is 0 (switched off). SHARE_BASED_ON_SLOTS When this parameter is set to 1 or TRUE, the scheduler will con- sider the number of slots being used by running jobs and by pending jobs when pushing users and projects toward their shar- ing targets as defined by the share tree. That is, a parallel job using 4 slots will be considered to be equal to 4 serial jobs. When the parameter is set to FALSE (default), every job is considered equal. The urgency_slots PE attribute in sge_pe(5) will be used to determine the number of slots when a job is sub- mitted with a PE range. Changing params will take immediate effect. The default for params is none. reprioritize_interval Interval (HH:MM:SS) to reprioritize jobs on the execution hosts based on the current ticket amount for the running jobs. If the interval is set to 00:00:00 the reprioritization is turned off. The default value is 00:00:00. The reprioritization tickets are calculated by the sched- uler and update events for running jobs are only sent after the sched- uler calculated new values. How often the schedule should calculate the tickets is defined by the reprioritize_interval. Because the scheduler is only triggered in a specific interval (scheduler_interval) this means the reprioritize_interval has only a meaning if set greater than the scheduler_interval. For example, if the scheduler_interval is 2 minutes and reprioritize_interval is set to 10 seconds, this means the jobs get re-prioritized every 2 minutes. report_pjob_tickets This parameter allows to tune the system's scheduling run time. It is used to enable / disable the reporting of pending job tickets to the qmaster. It does not influence the tickets calculation. The sort order of jobs in qstat and qmon is only based on the submit time, when the reporting is turned off. The reporting should be turned off in a system with a very large amount of jobs by setting this parameter to "false". halflife_decay_list The halflife_decay_list allows to configure different decay rates for the "finished_jobs usage types, which is used in the pending job ticket calculation to account for jobs which have just ended. This allows the user the pending jobs algorithm to count finished jobs against a user or project for a configurable decayed time period. This feature is turned off by default, and the halftime is used instead. The halflife_decay_list also allows one to configure different decay rates for each usage type being tracked (cpu, io, and mem). The list is specified in the following format: =