GPU Power Capping on Perlmutter¶

As HPC enters exascale, power has become a critical limiting factor in HPC. Power capping as one of the commonly used power management approaches can effectively keep the system and jobs within a preset power limit. To prepare more for power-constrained future systems, we encourage users to explore the power capping option with your production workloads. NERSC encourages users to explore this option with their production workloads and see if their workloads can adopt the power capping without hurting performance too much.

Perlmutter allows end users to cap GPU power through a SLURM directive. This feature makes the nvidia-smi -pl <power limit> command which requires root privileges, available to users via a SLURM plugin developed at NERSC.

How to Apply a GPU Power Cap¶

The supported power limits on Perlmutter GPU nodes are:

A100 40 GB GPUs: Power cap range is 100 W - 400 W.
A100 80 GB GPUs: Power cap range is 100 W - 500 W.

To request a specific power cap value for your job, use the following SLURM directive in your job script:

#SBATCH --gpu-power=200

or

#SBATCH --gpu-power=200W

This will apply a 200 W GPU power cap to all allocated nodes for your job.

Sample job script:¶

#!/bin/bash 

#SBATCH -J pc200w 
#SBATCH -q regular 
#SBATCH -C gpu 
#SBATCH -N 2 
#SBATCH -G 8
#SBATCH -t 4:00:00 
#SBATCH -A mxyz 
#SBATCH --gpu-power=200
#SBATCH -o %x-%j.out

srun -n 8 -c 32 --cpu-bind=cores -G 8 --gpu-bind=none ./a.out

Note

GPU power capping also works for the shared QOS (#SBATCH -q=shared) on Perlmutter, where it sets the power limit for individual GPUs on the shared node.

How to Track Power Cap Usage¶

You can track the GPU power cap usage via the sacct command, which reports the power cap in the AdminComment field on Perlmutter.

Example Workflow¶

Request an interactive job with a 200 W power cap:

elvis@perlmutter:login34:~> salloc -C gpu -q interactive --gpu-power=200 -A mxyz 
...
salloc: Nodes nid001124 are ready for job

Check the GPU power cap and usage using nvidia-smi:

elvis@nid001124:~> nvidia-smi -q -i 0 -d POWER

Sample output:

============== NVSMI LOG ==============
Timestamp                                 : Mon Dec 16 11:34:42 2024
Driver Version                            : 535.216.01
CUDA Version                              : 12.2

Attached GPUs                             : 4
GPU 00000000:03:00.0
    GPU Power Readings
        Power Draw                        : 53.87 W
        Current Power Limit               : 200.00 W
        Requested Power Limit             : 200.00 W
        Default Power Limit               : 400.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 400.00 W
    Power Samples
        Duration                          : 2.39 sec
        Number of Samples                 : 119
        Max                               : 53.94 W
        Min                               : 53.87 W
        Avg                               : 53.88 W
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A

Exit the interactive session:

elvis@nid001124:~> exit
exit
salloc: Relinquishing job allocation 33994454

Check the power cap usage with sacct:

elvis@perlmutter:login34:~> sacct -j 33994454 -XPno admincomment | jq . | fgrep gpuPower
"gpuPower": "200",
"gpuPowerRaw": "200",

The gpuPower field displays the applied power cap (e.g., 200 W in this case). If the #SBATCH --gpu-power=200W directive is used (note the "W" for watts), the gpuPowerRaw field will report "200W", while the gpuPower field will display "200" (without the unit).

Notes¶

The power cap applies to all GPUs allocated to your job throughout the job duration (hence to all job steps).
Currently, power capping capability is not available via srun (per job step).
Ensure you specify the correct power limit based on the GPU model (e.g., A100 40 GB vs. A100 80 GB). If the requested power exceeds or falls short of the allowed range, the power limit will be automatically adjusted to the nearest valid value. Your job will proceed with a message similar to the following:
```
slurmstepd: error: gpu-power: nid001204: requested power 80W less than 100W, setting to 100W
```
or
```
slurmstepd: error: gpu-power: nid001021: requested power 600W greater than maximum rating of 400W, setting to 400W
```
The nvidia-smi tool can provide detailed power usage metrics, which can be useful for debugging or monitoring power consumption.

For more information about the options of nvidia-smi and sacct used in this document, please refer to the nvidia-smi man page (nvidia-smi -h) and the sacct command man page (man sacct).