Skip to content

Running Jobs on Perlmutter GPUs

Perlmutter uses Slurm for batch job scheduling. During Allocation Year 2021 jobs run on Perlmutter are free of charge.

Tip

To run a job on Perlmutter GPU nodes, you must submit the job using a project GPU allocation account name, which ends in _g (e.g., m9999_g). An account name without the trailing _g is for charging CPU jobs on Cori and Phase 2 CPU-only nodes.

For general information on how to submit jobs using Slurm and monitor jobs, etc., see:

Tips and Tricks

Use the -C gpu flag to access Perlmutter GPU nodes

The -C gpu or --constraint=gpu flag must be included in your script or on the command line when submitting a job (e.g., #SBATCH -C gpu). Failing to do so may result in output such as the following from sbatch:

sbatch: error: Job request does not match any supported policy.
sbatch: error: Batch job submission failed: Unspecified error

Always explicitly request GPU resources

The --gpus or -G flag is required in order to assure that GPU resources are allocated and visible to your srun command. Typically you would add this flag in the #SBATCH preamble of your script, e.g., #SBATCH --gpus=4. You may also use the #SBATCH --gpus-per-task flag to set the number of GPUs per MPI task.

Failing to explicitly request GPU resources may result in output such as the following:

 no CUDA-capable device is detected
 No Cuda device found

--gpus-per-task does not enforce GPU affinity or binding

Despite what its name suggests, --gpus-per-task in the examples below only counts the number of GPUs to allocate to the job; it does not enforce any binding or affinity of GPUs to CPUs or tasks. For more information on affinity on Perlmutter, please see the Perlmutter section of the Process Affinity webpage.

Append _g to your project name when submitting to Perlmutter GPU nodes

GPU nodes and CPU nodes at NERSC are allocated separately. The "bank account" for GPU allocations is the project name with _g appended to it. To submit a job to run on the Perlmutter GPU nodes, you must use the _g account. Therefore your job script should contain a line like #SBATCH -A mxxxx_g, where mxxxx represents a common project naming pattern (the letter "m" followed by four digits); your project may be named differently.

If you do not use the _g version of your project name, your job submission will fail with errors such as:

sbatch: error: Job request does not match any supported policy.
sbatch: error: Batch job submission failed: Unspecified error

The -c argument is based on the number of CPU tasks per node

The whole-number argument to the -c flag is inversely proportional to the number of CPU tasks per node. Compute this value with

2\times\left \lfloor{\frac{64}{\mbox{tasks per node}}}\right \rfloor

For example, if you want to run 5 MPI tasks per node, then your argument to the -c flag would be calculated as

2\times\left \lfloor{\frac{64}{5}}\right \rfloor = 2 \times 12 = 24.

Example scripts

Tip

The below examples use a code called ./gpus_for_tasks. To build ./gpus_for_tasks for yourself, see the code and commands in the GPU affinity settings section.

1 node, 1 task, 1 GPU

#!/bin/bash
#SBATCH -A <account_g>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 1
#SBATCH --ntasks-per-node=1
#SBATCH -c 128
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun ./gpus_for_tasks

Output:

Rank 0 out of 1 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:C1:00.0

1 node, 4 tasks, 4 GPUs, all GPUs visible to all tasks

#!/bin/bash
#SBATCH -A <account_g>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 4
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun ./gpus_for_tasks

Output:

Rank 0 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0
1 for rank 0: 0000:41:00.0
2 for rank 0: 0000:81:00.0
3 for rank 0: 0000:C1:00.0
Rank 2 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:02:00.0
1 for rank 2: 0000:41:00.0
2 for rank 2: 0000:81:00.0
3 for rank 2: 0000:C1:00.0
Rank 3 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:02:00.0
1 for rank 3: 0000:41:00.0
2 for rank 3: 0000:81:00.0
3 for rank 3: 0000:C1:00.0
Rank 1 out of 4 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:02:00.0
1 for rank 1: 0000:41:00.0
2 for rank 1: 0000:81:00.0
3 for rank 1: 0000:C1:00.0

1 node, 4 tasks, 4 GPUs, 1 GPU visible to each task

#!/bin/bash
#SBATCH -A <account_g>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 4
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=map_gpu:0,1,2,3

export SLURM_CPU_BIND="cores"
srun ./gpus_for_tasks

Output:

Rank 0 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0
Rank 3 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:C1:00.0
Rank 2 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:81:00.0
Rank 1 out of 4 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:41:00.0

4 nodes, 16 tasks, 16 GPUs, all GPUs visible to all tasks

#!/bin/bash
#SBATCH -A <account_g>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 16
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1

export SLURM_CPU_BIND="cores"
srun ./gpus_for_tasks

Output:

Rank 13 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 13: 0000:02:00.0
1 for rank 13: 0000:41:00.0
2 for rank 13: 0000:81:00.0
3 for rank 13: 0000:C1:00.0
Rank 3 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:02:00.0
1 for rank 3: 0000:41:00.0
2 for rank 3: 0000:81:00.0
3 for rank 3: 0000:C1:00.0
Rank 11 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 11: 0000:02:00.0
1 for rank 11: 0000:41:00.0
2 for rank 11: 0000:81:00.0
3 for rank 11: 0000:C1:00.0
Rank 5 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 5: 0000:02:00.0
1 for rank 5: 0000:41:00.0
2 for rank 5: 0000:81:00.0
3 for rank 5: 0000:C1:00.0
Rank 15 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 15: 0000:02:00.0
1 for rank 15: 0000:41:00.0
2 for rank 15: 0000:81:00.0
3 for rank 15: 0000:C1:00.0
Rank 14 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 14: 0000:02:00.0
1 for rank 14: 0000:41:00.0
2 for rank 14: 0000:81:00.0
3 for rank 14: 0000:C1:00.0
Rank 12 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 12: 0000:02:00.0
1 for rank 12: 0000:41:00.0
2 for rank 12: 0000:81:00.0
3 for rank 12: 0000:C1:00.0
Rank 9 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 9: 0000:02:00.0
1 for rank 9: 0000:41:00.0
2 for rank 9: 0000:81:00.0
3 for rank 9: 0000:C1:00.0
Rank 10 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 10: 0000:02:00.0
1 for rank 10: 0000:41:00.0
2 for rank 10: 0000:81:00.0
3 for rank 10: 0000:C1:00.0
Rank 8 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 8: 0000:02:00.0
1 for rank 8: 0000:41:00.0
2 for rank 8: 0000:81:00.0
3 for rank 8: 0000:C1:00.0
Rank 1 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:02:00.0
1 for rank 1: 0000:41:00.0
2 for rank 1: 0000:81:00.0
3 for rank 1: 0000:C1:00.0
Rank 2 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:02:00.0
1 for rank 2: 0000:41:00.0
2 for rank 2: 0000:81:00.0
3 for rank 2: 0000:C1:00.0
Rank 0 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0
1 for rank 0: 0000:41:00.0
2 for rank 0: 0000:81:00.0
3 for rank 0: 0000:C1:00.0
Rank 6 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 6: 0000:02:00.0
1 for rank 6: 0000:41:00.0
2 for rank 6: 0000:81:00.0
3 for rank 6: 0000:C1:00.0
Rank 7 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 7: 0000:02:00.0
1 for rank 7: 0000:41:00.0
2 for rank 7: 0000:81:00.0
3 for rank 7: 0000:C1:00.0
Rank 4 out of 16 processes: I see 4 GPUs. Their PCI Bus IDs are:
0 for rank 4: 0000:02:00.0
1 for rank 4: 0000:41:00.0
2 for rank 4: 0000:81:00.0
3 for rank 4: 0000:C1:00.0

4 nodes, 16 tasks, 16 GPUs, 1 GPU visible to each task

#!/bin/bash
#SBATCH -A <account_g>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 16
#SBATCH --ntasks-per-node=4
#SBATCH -c 32
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=map_gpu:0,1,2,3

export SLURM_CPU_BIND="cores"
srun ./gpus_for_tasks

Output:

Rank 6 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 6: 0000:81:00.0
Rank 13 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 13: 0000:41:00.0
Rank 1 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 1: 0000:41:00.0
Rank 10 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 10: 0000:81:00.0
Rank 15 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 15: 0000:C1:00.0
Rank 9 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 9: 0000:41:00.0
Rank 7 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 7: 0000:C1:00.0
Rank 14 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 14: 0000:81:00.0
Rank 11 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 11: 0000:C1:00.0
Rank 5 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 5: 0000:41:00.0
Rank 12 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 12: 0000:02:00.0
Rank 8 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 8: 0000:02:00.0
Rank 4 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 4: 0000:02:00.0
Rank 2 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 2: 0000:81:00.0
Rank 3 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 3: 0000:C1:00.0
Rank 0 out of 16 processes: I see 1 GPUs. Their PCI Bus IDs are:
0 for rank 0: 0000:02:00.0

Single-GPU tasks in parallel

Users who have many independent single-GPU tasks may wish to pack these into one job which runs the tasks in parallel on different GPUs. There are multiple ways to accomplish this; here we present one example.

srun

The Slurm srun command can be used to launch individual tasks, each allocated some amount of resources requested by the job script. An example of this is:

#!/bin/bash
#SBATCH -A <account_g>
#SBATCH -C gpu
#SBATCH -N 1
#SBATCH -t 5
#SBATCH --ntasks-per-node=4

srun --exact -u -n 1 --gpus-per-task 1 -c 1 --mem-per-cpu=4G bash -c 'date +%M:%S; sleep 15; date +%M:%S' &
srun --exact -u -n 1 --gpus-per-task 1 -c 1 --mem-per-cpu=4G bash -c 'date +%M:%S; sleep 15; date +%M:%S' &
srun --exact -u -n 1 --gpus-per-task 1 -c 1 --mem-per-cpu=4G bash -c 'date +%M:%S; sleep 15; date +%M:%S' &
srun --exact -u -n 1 --gpus-per-task 1 -c 1 --mem-per-cpu=4G bash -c 'date +%M:%S; sleep 15; date +%M:%S' &

wait

Output shows all steps started at the same time:

12:48
12:48
12:48
12:48
13:03
13:03
13:03
13:03

Each srun invocation requests one task and one GPU for that task. Specifying --exact will allow the steps to be launched in parallel if the rest of the resources still fit on the node. Hence, it is necessary to also specify memory and cpu usage with -c 1 --mem-per-cpu=4G as otherwise each step would claim all cpus and memory (the default) which would cause the steps to wait for each other to free up resources. If these 4 tasks are all you wish to run on that node, you can specify more memory and cpus per task/gpu, e.g. -c 32 --mem-per-gpu=60G would split the node's resources into 4 equally sized parts. The & at the end of each line puts the tasks in the background, and the final wait command is needed to allow all of the tasks to run to completion.

Set CUDA_VISIBLE_DEVICES manually in each srun task

Currently, slurm doesn't forward the right device indices to the task/jobsteps, which will result in a failure when trying to acquire a GPU device in code. Set CUDA_VISIBLE_DEVICES=0 if you give each jobstep one GPU, CUDA_VISIBLE_DEVICES=0,1 if you give each jobstep two GPUs, etc..

Do not use srun for large numbers of tasks

This approach is feasible for relatively small numbers (i.e., tens) of tasks but should not be used for hundreds or thousands of tasks. To run larger numbers of tasks, GNU parallel is preferred, which will be provided on Perlmutter soon.