SLURM Node Selection Plugin API

Overview

This document describes SLURM node selection plugins and the API that defines them. It is intended as a resource to programmers wishing to write their own SLURM node selection plugins. This is version 0 of the API.

SLURM node selection plugins are SLURM plugins that implement the SLURM node selection API described herein. They are intended to provide a mechanism for both selecting nodes for pending jobs and performing any system-specific tasks for job launch or termination. The plugins must conform to the SLURM Plugin API with the following specifications:

const char plugin_type[]
The major type must be "select." The minor type can be any recognizable abbreviation for the type of node selection algorithm. We recommend, for example:

The plugin_name and plugin_version symbols required by the SLURM Plugin API require no specialization for node selection support. Note carefully, however, the versioning discussion below.

A simplified flow of logic follows:

slurmctld daemon starts
if (select_p_state_restore)() != SLURM_SUCCESS)
   abort

slurmctld reads the rest of its configuration and state information
if (select_p_node_init() != SLURM_SUCCESS)
   abort
if (select_p_block_init() != SLURM_SUCCESS)
   abort

wait for job 
if (select_p_job_test(all available nodes) != SLURM_SUCCESS) {
   if (select_p_job_test(all configured nodes) != SLURM_SUCCESS)
      reject the job and tell the user it can never run
   else
      leave the job queued for later execution
} else {
   update job's node list and node bitmap
   if (select_p_job_begin() != SLURM_SUCCESS)
      leave the job queued for later execution
   else {
      while (!select_p_job_ready())
        wait
      execute the job
      wait for job to end or be terminated
      select_p_job_fini()
    }
}

wait for slurmctld shutdown request
select_p_state_save()

Depending upon failure modes, it is possible that select_p_state_save() will not be called at slurmctld terminatation. When slurmctld is restarted, other function calls may be replayed. select_p_node_init() may be used to syncronize the plugin's state with that of slurmctld.

Data Objects

These functions are expected to read and/or modify data structures directly in the slurmctld daemon's memory. Slurmctld is a multi-threaded program with independent read and write locks on each data structure type. Thererfore the type of operations permitted on various data structures is identified for each function.

These functions make use of bitmaps corresponding to the nodes in a table. The function select_p_node_init() should be used to establish the initial mapping of bitmap entries to nodes. Functions defined in src/common/bitmap.h should be used for bitmap manipulations (these functions are directly accessible from the plugin).

API Functions

The following functions must appear. Functions which are not implemented should be stubbed.

Global Node Selection Functions

int select_p_state_save (char *dir_name);

Description: Save any global node selection state information to a file within the specified directory. The actual file name used is plugin specific. It is recommended that the global switch state contain a magic number for validation purposes. This function is called by the slurmctld deamon on shutdown.

Arguments: dir_name    (input) fully-qualified pathname of a directory into which user SlurmUser (as defined in slurm.conf) can create a file and write state information into that file. Cannot be NULL.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_state_restore (char *dir_name);

Description: Restore any global node selection state information from a file within the specified directory. The actual file name used is plugin specific. It is recommended that any magic number associated with the global switch state be verified. This function is called by the slurmctld deamon on startup.

Arguments: dir_name    (input) fully-qualified pathname of a directory containing a state information file from which user SlurmUser (as defined in slurm.conf) can read. Cannot be NULL.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR, causing slurmctld to exit.

int select_p_node_init (struct node_record *node_ptr, int node_cnt);

Description: Note the initialization of the node record data structure. This function is called when the node records are initially established and again when any nodes are added to or removed from the data structure.

Arguments:
node_ptr   (input) pointer to the node data records. Data in these records can read. Nodes deleted after initiialization may have their the name field in the record cleared (zero length) rather than rebuilding the node records and bitmaps.
node_cnt    (input) number of node data records.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR, causing slurmctld to exit.

int select_p_block_init (List block_list);

Description: Note the initialization of the partition record data structure. This function is called when the partition records are initially established and again when any partition configurations change.

Arguments: part_list   (input) list of partition record entries. Note that some of these partitions may have no associated nodes. Also consider that nodes can be removed from one partition and added to a different partition.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR, causing slurmctld to exit.

int select_p_update_block (update_part_msg_t *part_desc_ptr);

Description: This function is called when the admin needs to manually update the state of a block.

Arguments: part_desc_ptr   (input) partitition description variable. Containing the block name and the state to set the block.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_pack_node_info (time_t last_query_time, Buf *buffer_ptr);

Description: pack node specific information into a buffer.

Arguments: last_query_time   (input) time that the data was last saved.
buffer_ptr   (input/output) buffer into which the node data is appended.

Returns: SLURM_SUCCESS if successful, SLURM_NO_CHANGE_IN_DATA if data has not changed since last packed, otherwise SLURM_ERROR

Job-Specific Node Selection Functions

int select_p_job_test (struct job_record *job_ptr, bitstr_t *bitmap, int min_nodes, int max_nodes, int req_nodes, bool test_only);

Description: Given a job's scheduling requirement specification and a set of nodes which might be used to satisfy the request, identify the nodes which "best" satify the request. Note that nodes being considered for allocation to the job may include nodes already allocated to other jobs, even if node sharing is not permitted. This is done to ascertain whether or not job may be allocated resources at some later time (when the other jobs complete). This permits SLURM to reject non-runnable jobs at submit time rather than after they have spent hours queued. Informing users of problems at job submission time permits them to quickly resubmit the job with appropriate constraints.

Arguments:
job_ptr    (input) pointer to the job being considered for scheduling. Data in this job record may safely be read. Data of particular interst include details->contiguous (set if allocated nodes should be contiguous), num_procs (minimum processors in allocation) and details->req_node_bitmap (specific required nodes).
bitmap    (input/output) bits representing nodes which might be allocated to the job are set on input. This function should clear the bits representing nodes not required to satisfy job's scheduling request. Bits left set will represent nodes to be used for this job. Note that the job's required nodes (details->req_node_bitmap) will be a superset bitmap when the function is called.
min_nodes    (input) minimum number of nodes to allocate to this job. Note this reflects both job and partition specifications.
max_nodes    (input) maximum number of nodes to allocate to this job. Note this reflects both job and partition specifications.
req_nodes    (input) the requested (desired) of nodes to allocate to this job. This reflects job's maximum node specification (if supplied).
test_only    (input) if set then we only want to test our ability to run the job at some time, not necesarily now with currently available resources.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR and future attempts may be made to schedule the job.

int select_p_job_begin (struct job_record *job_ptr);

Description: Note the initiation of the specified job is about to begin. This function is called immediately after select_p_job_test() sucessfully completes for this job.

Arguments: job_ptr    (input) pointer to the job being initialized. Data in this job record may safely be read or written. The nodes and node_bitmap fields of this job record identify the nodes which have already been selected for this job to use. For an example of a job record field that the plugin may write into, see select_id.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR, which causes the job to be requeued for later execution.

int select_p_job_ready (struct job_record *job_ptr);

Description: Test if resources are configured and ready for job execution. This function is only used in the job prolog for BlueGene systems to determine if the bglblock has been booted and is ready for use.

Arguments: job_ptr    (input) pointer to the job being initialized. Data in this job record may safely be read. The nodes and node_bitmap fields of this job record identify the nodes which have already been selected for this job to use.

Returns: 1 if the job may begin execution, 0 otherwise.

int select_p_job_fini (struct job_record *job_ptr);

Description: Note the termination of the specified job. This function is called as the termination process for the job begins (prior to killing the tasks).

Arguments: job_ptr    (input) pointer to the job being terminated. Data in this job record may safely be read or written. The nodes and/or node_bitmap fields of this job record identify the nodes which were selected for this job to use.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_job_suspend (struct job_record *job_ptr);

Description: Suspend the specified job. Release resources for use by other jobs.

Arguments: job_ptr    (input) pointer to the job being suspended. Data in this job record may safely be read or written. The nodes and/or node_bitmap fields of this job record identify the nodes which were selected for this job to use.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return a SLURM error code.

int select_p_job_resume (struct job_record *job_ptr);

Description: Resume the specified job which was previously suspended.

Arguments: job_ptr    (input) pointer to the job being resumed. Data in this job record may safely be read or written. The nodes and/or node_bitmap fields of this job record identify the nodes which were selected for this job to use.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return a SLURM error code.

int select_p_get_job_cores (uint32_t job_id, int alloc_index, int s);

Description: Get socket-specific core information from a job.

Arguments: job_id    (input) ID of the job from which to obtain the data.
alloc_index    (input) index of the allocated node to the job from which to obtain the data.
s    (input) socket index from which to obtain the data.

Returns: the number of cores allocated to the given socket on the given node for the given job. On failure, the plugin should return zero.

Get/set plugin information

int select_p_get_extra_jobinfo(struct node_record *node_ptr, struct job_record *job_ptr, enum select_data_info info, void *data);

Description: Get plugin-specific information related to the specified job and/or node.

Arguments:
node_ptr    (input) pointer to the node for which information is requested.
job_ptr    (input) pointer to the job for which information is requested.
info    (input) identifies the type of data requested.
data    (output) the requested data.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_get_select_nodeinfo(struct node_record *node_ptr, enum select_data_info info, void *data);

Description: Get plugin-specific information related to the specified node.

Arguments:
node_ptr    (input) pointer to the node for which information is requested.
info    (input) identifies the type of data requested.
data    (output) the requested data.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_update_nodeinfo(struct node_record *node_ptr);

Description: Update plugin-specific information related to the specified node. This is called after changes in a node's configuration.

Argument: node_ptr    (input) pointer to the node for which information is requested.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_get_info_from_plugin(enum select_data_info info, void *data);

Description: Get plugin-specific information.

Arguments:
info    (input) identifies the type of data to be updated.
data    (output) the requested data.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_job_init(List job_list);

Description: Used at slurm startup to syncrhonize plugin (and node) state with that of currectly active jobs.

Arguments: job_list    (input) list of slurm jobs from slurmctld job records.

Returns: SLURM_SUCCESS if successful. On failure, the plugin should return SLURM_ERROR.

int select_p_update_node_state (int index, uint16_t state);

Description: push a change of state into the plugin the index should be the index from the slurmctld of the entire system. The state should be the same state the node_record was set to in the slurmctld.

Arguments: index   (input) index of the node in reference to the entire system.
state   (input) new state of the node.

Returns: SLURM_SUCCESS if successful, otherwise SLURM_ERROR

int select_p_alter_node_cnt (enum select_node_cnt type, void *data);

Description: Used for systems like a Bluegene system where SLURM sees 1 node where many nodes really exists, in Bluegene's case 1 node reflects 512 nodes in real live, but since usually 512 is the smallest allocatable block slurm only handles it as 1 node. This is a function so the user can issue a 'real' number and the fuction will alter it so slurm can understand what the user really means in slurm terms.

Arguments: type   (input) enum telling the plug in what the user is really wanting.
data   (input/output) Is a void * so depending on the type sent in argument 1 this should adjust the variable returning what the user is asking for.

Returns: SLURM_SUCCESS if successful, otherwise SLURM_ERROR

int select_p_reconfigure (void);

Description: Used to notify plugin of change in partition configuration or general configuration change. The plugin will test global variables for changes as appropriate.

Returns: SLURM_SUCCESS if successful, otherwise SLURM_ERROR

Versioning

This document describes version 1 of the SLURM node selection API. Future releases of SLURM may revise this API. A node selection plugin conveys its ability to implement a particular API version using the mechanism outlined for SLURM plugins. In addition, the credential is transmitted along with the version number of the plugin that transmitted it. It is at the discretion of the plugin author whether to maintain data format compatibility across different versions of the plugin.

Last modified 8 October 2007

Lawrence Livermore National Laboratory
7000 East Avenue • Livermore, CA 94550
Operated by Lawrence Livermore National Security, LLC, for the Department of Energy's
National Nuclear Security Administration
NNSA logo links to the NNSA Web site Department of Energy logo links to the DOE Web site