Skip to content

Perlmutter Timeline

This page records a brief timeline of significant events and user environment changes on Perlmutter. Please see our current known issues page for a list of known issues on Perlmutter.

October 28, 2022

Charging for all jobs began.

October 26, 2022

  • Slurm updated to 22.05
  • The 128-node limit on the regular QOS for GPU nodes has been removed. Regular can now accept jobs of all sizes.
  • The early_science QOS has been removed. Please use regular instead. All queued jobs in the early_science QOS have been moved to regular.
  • Numerous updates intended to improve system stability and networking

October 11, 2022

  • Major changes to the internal network and file system to get Perlmutter into its final configuration. Some tuning and changes are still required and will be applied over the next few weeks

September 15, 2022

  • Perlmutter scratch is now available, but it is still undergoing physical maintenance. We expect scratch performance to be degraded and single-component failures could cause the filesystem to become unavailable during this physical maintenance. We estimate a 20% chance that this will occur in the next month. Please hold any jobs with scratch licenses that you don't want to run by noon on Friday (9/16) with scontrol hold <jobid>.
  • Numerous updates intended to improve system stability and Community and Home File System access.

September 7, 2022

  • The software environment has been retooled to better focus on GPU usage. These changes should be transparent to the vast majority of both GPU and CPU codes and will help remove the toil of reloading the same modules for every script for GPU-based codes. As our experience with the system grows, we expect to be adding more settings that are expected to be globally beneficial.
    • New gpu module added as a default module loaded at login. It includes:
      • module load cudatoolkit
      • module load load craype-accel-nvidia80
      • Sets MPICH_GPU_SUPPORT_ENABLED=1 to enable access to CUDA-aware Cray MPICH at runtime
    • A companion cpu module
      • This module is mutually exclusive to the gpu module; if one is loaded, the other will be unloaded
      • In the future we may add any modules or environment settings we find to be generally beneficial to CPU codes, but for now it is empty
      • Given the current contents, CPU users should be able to run their codes with the gpu module. But if there are any problems, users can module load cpu to revert the gpu module
    • Shifter users who want CUDA-aware Cray MPICH at runtime will need to use the cuda-mpich shifter module
  • Long-lived scrontab capabilities added to better support workflows
  • A number of performance counters (e.g., CPU, Memory) that are used by NERSC supported performance profiling tools have been re-enabled on the system

August 24, 2022

  • Perlmutter Scratch file system unmounted for upgrading. All data on Perlmutter Scratch will be unavailable. Jobs already in the queue that were submitted from Perlmutter Scratch will be automatically held. If you submitted a job that depends on scratch from another file system, you can add a scratch license with scontrol update job=<job id> Licenses=scratch[,<other existing licenses>...] to have your job held until scratch is available.
  • Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.

August 15, 2022

  • All Slingshot10 GPU nodes are removed from the system along with their corresponding QOSes (e.g., regular_ss10)
    • Any queued jobs in the Slingshot10 QOSes were moved to their corresponding Slingshot11 QOSes
  • Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.

August 8, 2022

  • Added NVIDIA HPC SDK Version 22.7
    • To use: module load PrgEnv-nvidia nvidia/22.7
  • Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.

August 1, 2022

  • Default switched to Slingshot11 for GPU nodes.
    • Default QOS switched from GPU nodes using the Slingshot10 interconnect to nodes using the Slingshot11 interconnect. If you still wish to run on the Slingshot10 GPU nodes, you can add _ss10 to the QOS on your job submission line (e.g., -q regular_ss10 -C gpu). All queued jobs will run in the QOS that was active when they were submitted.
    • Use squeue --me -O JobID,Name,QOS,Partition to check which QOS and partition your jobs are in.
    • Login nodes now use the Slingshot11 interconnect.
  • CUDA driver upgraded to version 515.48.07
  • NVIDIA HPC SDK (PrgEnv-nvidia) and CUDA Toolkit (cudatoolkit) module defaults upgraded to 22.5 and 11.7 respectively. The previous versions are still available.
    • CUDA compatibility libraries are no longer needed, so if you were employing work arounds to remove them they should no longer be needed.
  • Numerous internal updates to the software and network for the Phase-2 integration of Perlmutter.

June 20, 2022

Default striping of all user scratch directories set to stripe across a single OST because of a bug in the Progressive File Layout striping schema. If you are reading or writing files larger than 1GB please see our recommendations for Lustre file striping.

July 18, 2022

  • The second set of GPU nodes have been upgraded to Slingshot11 and added to the regular_ss11 QOS (see the discussion in July 11, 2022).
    • We expect the number of Slingshot11 GPU nodes to be changing over the next few weeks, so we recommend you use sinfo to track the number of nodes in each partition. You can use sinfo --format="%.15b %.8D" for concise summary of nodes or sinfo -o "%.20P %.5a %.10D %.16F" for more verbose output.

July 11, 2022

  • First GPU nodes are upgraded to use the Slingshot11 interconnect. These nodes have upgraded software and 4x25GB/s NICs (previously they had 2x12.5GB/s NICs). Jobs will need to explicitly request these nodes by adding _ss11 to the QOS, eg -C gpu -q regular_ss11.
    • There are currently 256 nodes converted to Slingshot11. We expect this number of nodes to be changing over the next few weeks, so we recommend you use sinfo to track the number of nodes in each partition. You can use sinfo --format="%.15b %.8D" for concise summary of nodes or sinfo -o "%.20P %.5a %.10D %.16F" for more verbose output.
  • CPE default updated to 22.06. Notable changes:
  • NVIDIA compiler version 22.5 and cudatoolkit SDK version 11.7 now available on the system. These will become the defaults soon.
  • Shared QOS now available on the CPU nodes
  • Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter and make cvmfs more stable

June 6, 2022

  • Changes to the batch system
    • Users can now use just -A <account name> (i.e., the extra _g is no longer needed) for jobs requesting GPU resources.
    • Xfer queue added for data transfers
    • Debug QOS now the default
  • The cuda compatibility libraries were removed from the PrgEnv-nvidia module (specifically the nvidia module). The cuda compatibility libraries are now exclusively in the cudatoolkit module and users are reminded to load this module if they are compiling code for the GPUs.
  • Second set of CPU nodes are now available to users.
  • Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter

June, 2022

Achieving 70.9 Pflop/s (FP64 Tensor Core) using 1,520 compute nodes, Perlmutter is ranked 7th in the Top500 list.

May 25, 2022

  • Maximum job walltime for regular (CPU and GPU nodes) and early_science (GPU nodes) QOSes increased to 12 hours

May 17, 2022

  • Perlmutter opened to all NERSC Users!
  • The default Programming Environment is changed to PrgEnv-gnu
  • Shifter MPI now working on CPU nodes
  • PrgEnv-aocc now working
  • Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter

May 11, 2022

April 29, 2022

  • CPE default updated to 22.04. You may choose to load an older CPE but the behavior is not guaranteed.
    • Notable changes — cray mpich upgraded to 8.1.15
  • Nvidia driver has been updated to 470.103.01
  • Removed nvidia/21.9 (nvhpc sdk 21.9) from the system
  • Numerous internal upgrades (software and network stack) to prepare the Phase-2 integration of Perlmutter
    • Re-compile is not needed, but if you’re having issues please do try recompiling your application first.

April 21, 2022

April 7, 2022

March 25, 2022

  • Numerous internal updates to improvement configuration, reliability, and performance

March 10, 2022

  • Nvidia HPC SDK v21.11 now default
  • Older cudatoolkit modules removed
  • Slurm upgrade to 21.08, codes that use gpu-binding will need to be reworked
  • CPE 21.11 has been retired
    • There will be no support for gcc/9.3
    • nvcc v11.0 (cudatoolkit/11.0) retired, will no longer be supported
  • Numerous internal updates to improvement configuration, reliability, and performance

February 24, 2022

  • Cudatoolkit modules simplified
    • New modules with shorter names point to the most recent releases available
    • Old modules will remain on the system for a short time to allow time to switch over
  • Nvidia HPC SDK v21.11 now available
    • Default will remain 21.9 for a short time to allow time for testing
    • nvidia/21.9 does not support Milan, so the Cray compiler wrappers will build for Rome instead. We recommend that users switch to nvidia/22.11.
  • Upgraded to CPE 22.02. Major changes include:
    • MPICH 8.1.12 to 8.1.13
    • PMI 6.0.15 to 6.0.17
    • hdf5 1.12.0.7 to 1.12.1.1
    • netcdf 4.7.4 to 4.8.1.1
  • Change to sshproxy to support broader kinds of logins
  • Realtime qos functionality added
  • Numerous internal updates to improvement configuration, reliability, and performance

February 10, 2022

  • Node limit for all jobs temporarily lowered to 128 nodes
  • QOS priority modified to encourage wider job variety

January 25, 2022

January 11, 2022

  • Upgraded to CPE 21.12. Major changes include:
    • MPICH upgraded to v8.1.12 (from 8.1.11)
  • The previous programming environment can now be accessed using the cpe module.
  • Numerous internal upgrades to improve configuration and performance.

December 21, 2021

  • GPUs are back in "Default" mode (fixes Known Issue "GPUs are in "Exclusive_Process" instead of "Default" mode")
  • User access to hardware counters restored (fixes Known Issue "Nsight Compute or any performance profiling tool requesting access to h/w counters will not work")
  • Cuda 11.5 compatibility libraries installed and incorporated into Shifter
  • QOS priority modified to encourage wider job variety
  • Numerous internal upgrades

December 6, 2021

  • Major changes to the user environment. All users should recompile their code following our compile instructions
  • The cuda, cray-pmi, and cray-pmi-lib modules have been removed from the default environment
  • The darshan v3.3.1 module has been added to the default environment
  • Default NVIDIA compiler upgraded to v21.9
    • Users must load a cudatoolkit module to compile GPU codes
  • Upgraded to CPE 21.11
    • MPICH upgraded to v8.1.11 (from 8.1.10)
    • PMI upgraded to v6.0.16 (from 6.0.14)
    • FFTW upgraded to 3.3.8.12 (from 3.3.8.11)
    • Python upgraded to 3.9 (from 3.8)
  • Upgrade to SLES15sp2 OS
  • Numerous internal upgrades

November 30, 2021

  • Upgraded Slingshot (internal high speed network) to v1.6
  • Upgraded Lustre server
  • Internal configuration upgrades

November 16, 2021

This was a rolling update where the whole system was updated with minimal interruptions to users.

  • Set MPICH_ALLGATHERV_PIPELINE_MSG_SIZE=0 to improve MPI communication speed for large buffer size.
  • Added gpu and cuda-mpich Shifter modules to better support Shifter GPU jobs
  • Deployed fix for CUDA Unknown Error errors that occasionally happen for Shifter jobs using the GPUs
  • Changed ssh settings to reduce frequency of dropped ssh connections
  • Internal configuration updates

November, 2021

Perlmutter achieved 70.9 Pflop/s (FP64 Tensor Core) using 1,520 compute nodes, putting the system at No. 5 in the Top500 list.

November 2, 2021

  • Updated to CPE 21.10. A recompile is recommended but not required. See the documentation of CPE changes from HPE for a full list of changes. Major changes of note include:
    • Upgrade MPICH to 8.1.10 (from 8.1.9)
    • Upgrade DSMML to 0.2.2 (from 0.2.1)
    • Upgraded PMI to 6.0.14 (from 6.0.13)
  • Adjusted QOS configurations to facilitate Jupyter notebook job scheduling.
  • Added preempt QOS. Jobs submitted to this QOS may get preempted after two hours, but may start more quickly. Please see our instructions for running preemptible jobs for details.

October 20, 2021

External ssh access enabled for Perlmutter login nodes.

October 18, 2021

  • Updated slurm job priorities to more efficiently utilize the system and improve the diversity of running jobs.

October 14, 2021

  • Updated NVIDIA driver (to 450.162). This is not expected to have any user impact.
  • Upgraded internal management framework.

October 9, 2021

  • Screen and tmux installed
  • Installed boost v1.66
  • Upgraded nv_peer_mem driver to 1.2 (not expected to have any user impact)

October 5, 2021

Deployed sparewarmer QOS to assist with node-level testing. This is not expected to have any user impact.

October 4, 2021

Limited the wall time of batch jobs to 6 hours to allow a variety of jobs to run during testing. If you need to run jobs for longer than 6 hours, please open a ticket.

September 29, 2021

  • Numerous internal network and management upgrades.

New batch system structure deployed

  • Users will need to specify a QOS (with -q regular, debug, interactive, etc.) as well as a project GPU allocation account name which ends in _g (e.g., -A m9999_g)
  • Please see our Running Jobs Section for examples and an explanation of new queue policies

September 24, 2021

  • Upgraded internal management software
  • Upgraded system I/O forwarding software and moved it to a more performant network
  • Fixed csh environment
  • Performance profiling tool that request access to hardware counters (such as Nsight Compute) should work now

September 16, 2021

  • Deployed numerous network upgrades and changes intended to increase responsiveness and performance
  • Increased robustness for login node load balancing

September 10, 2021

  • Updated to CPE 21.09. A recompile is recommended but not required. Major changes of note include:
    • Upgrade MPICH to 8.1.9 (from 8.1.8)
    • Upgrade DSMML to 0.2.1 (from 0.2.0)
    • Upgrade PALS to 1.0.17 (from 1.0.14)
    • Upgrade OpenSHMEMX to 11.3.3 (from 11.3.2)
    • Upgrade craype to 2.7.10 (from 2.7.9)
    • Upgrade CCE to 12.0.3 (from 12.0.2)
    • Upgrade HDF5 to 1.12.0.7 (from 1.12.0.6)
    • GCC 11.2.0 added
  • Added cuda module to the list of default modules loaded at startup
  • Set BASH_ENV to Lmod setup file
  • Deployed numerous network upgrades and changes intended to increase responsiveness and performance
  • Performed kernel upgrades to login nodes for better fail over support
  • Added latest CMake release as cmake/git-20210830, and is set as the default cmake on the system

September 2, 2021

  • Updated NVIDIA driver (to nvidia-gfxG04-kmp-default-450.142.00_k4.12.14_150.47-0.x86_64). This is not expected to have any user impact.

August 30, 2021

Numerous changes to the NVIDIA programming environment

  • Changed default NVIDIA compiler from 20.9 to 21.7
  • Installed needed CUDA compatibility libraries
  • Added support for multi-CUDA HPC SDK
  • Removed the cudatoolkit and craype-accel-nvidia80 modules from default

Tips for users:

  • Please use module load cuda and module av cuda to get the CUDA Toolkit, including the CUDA C compiler nvcc, and associated libraries and tools.
  • CMake may have trouble picking up the correct mpich include files. If it does, you can use set ( CMAKE_CUDA_FLAGS "-I/opt/cray/pe/mpich/8.1.8/ofi/nvidia/20.7/include") to force it to pick up the correct one.

June, 2021

Perlmutter achieved 64.6 Pflop/s (FP64 Tensor Core) using 1,424 compute nodes, putting it at No. 5 in the Top500 list.

May 27, 2021

Perlmutter supercomputer dedication.

November, 2020 - March, 2021

Perlmutter Phase 1 delivered.