Runpod Slurm Clusters provide a managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup. For more information on working with Slurm, refer to the Slurm documentation.Documentation Index
Fetch the complete documentation index at: https://runpod-b18f5ded-promptless-github-integration-timeout-clari.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Key features
Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:- Zero configuration setup: Slurm and munge are pre-installed and fully configured.
- Instant provisioning: Clusters deploy rapidly with minimal setup.
- Automatic role assignment: Runpod automatically designates controller and agent nodes.
- Built-in optimizations: Pre-configured for optimal NCCL performance.
- Full Slurm compatibility: All standard Slurm commands work out-of-the-box.
Deploy a Slurm Cluster
- Open the Instant Clusters page on the Runpod console.
- Click Create Cluster.
- Select Slurm Cluster from the cluster type dropdown menu.
- Configure your cluster specifications:
- Cluster name: Enter a descriptive name for your cluster.
- Pod count: Choose the number of Pods in your cluster.
- GPU type: Select your preferred GPU type.
- Region: Choose your deployment region.
- Network volume (optional): Add a network volume for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
- Pod template: Select a Pod template or click Edit Template to customize start commands, environment variables, ports, or container/volume disk capacity.
- Click Deploy Cluster.
Connect to a Slurm Cluster
Once deployment completes, you can access your cluster from the Instant Clusters page. From this page you can select a cluster to view it’s component nodes, including a label indicating the Slurm controller (primary node) and Slurm agents (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management. Connect to a node using the Connect button, or using any of the connection methods supported by Pods.Submit and manage jobs
All standard Slurm commands are available without configuration. For example, you can: Check cluster status and available resources:Advanced configuration
While Runpod’s Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the web terminal or SSH. Access Slurm configuration files in their standard locations:/etc/slurm/slurm.conf- Main configuration file./etc/slurm/gres.conf- Generic resource configuration.
Troubleshooting
If you encounter issues with your Slurm Cluster, try the following:- Jobs stuck in pending state: Check resource availability with
sinfoand ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. - Authentication errors: Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.