SLURM: A Highly Scalable Resource Manager
SLURM is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
SLURM is not a sophisticated batch system, but it does provide an Applications Programming Interface (API) for integration with external schedulers such as The Maui Scheduler and Moab Cluster Suite. While other resource managers do exist, SLURM is unique in several respects:
- Its source code is freely available under the GNU General Public License.
- It is designed to operate in a heterogeneous cluster with up to 65,536 nodes.
- It is portable; written in C with a GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets. A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc.
- SLURM is highly tolerant of system failures, including failure of the node executing its control functions.
- It is simple enough for the motivated end user to understand its source and add functionality.
SLURM provides resource management on about 1000 computers worldwide, including many of the most powerful computers in the world:
- BlueGene/L at LLNL with 106,496 dual-core processors
- EKA at Computational Research Laboratories, India with 14,240 Xeon processoers and Infiniband interconnect
- ASC Purple an IBM SP/AIX cluster at LLNL with 12,208 Power5 processors and a Federation switch
- MareNostrum a Linux cluster at Barcelona Supercomputer Center with 10,240 PowerPC processors and a Myrinet switch
SLURM is actively being developed, distributed and supported by Lawrence Livermore National Laboratory, Hewlett-Packard, Bull, Cluster Resources and SiCortex.
Last modified 29 November 2007