Perlmutter Timeline¶
This page records a brief timeline of significant events and user environment changes on Perlmutter.
June 20, 2022¶
Default striping of all user scratch directories set to stripe across a single OST because of a bug in the Progressive File Layout striping schema. If you are reading or writing files larger than 1GB please see our recommendations for Lustre file striping.
June 6, 2022¶
- Changes to the batch system
- Users can now use just
-A <account name>
(i.e. the extra_g
is no longer needed) for jobs requesting GPU resources. - Xfer queue added for data transfers
- Debug QOS now the default
- Users can now use just
- The cuda compatibility libraries were removed from the PrgEnv-nvidia module (specficially the
nvidia
module). The cuda compatibility libraries are now exclusively in thecudatoolkit
module and users are reminded to load this module if they are compiling code for the GPUs. - Second set of CPU nodes are now available to users.
- Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter
May 25, 2022¶
- Maximum job walltime for
regular
(CPU and GPU nodes) andearly_science
(GPU nodes) QOSes increased to 12 hours
May 17, 2022¶
- Perlmutter opened to all NERSC Users!
- The default Programming Environment is changed to PrgEnv-gnu
- Shifter MPI now working on CPU nodes
- PrgEnv-aocc now working
- Numerous internal updates to the software and network to prepare the Phase-2 integration of Perlmutter
May 11, 2022¶
- The first set of Slingshot11 CPU nodes are now available for user jobs. Please see the Perlmutter QOS policy page for QOS details.
April 29, 2022¶
- CPE default updated to 22.04. You may choose to load an older CPE but the behavior is not guaranteed.
- Notable changes — cray mpich upgraded to 8.1.15
- Nvidia driver has been updated to 470.103.01
- Removed
nvidia/21.9
(nvhpc sdk 21.9) from the systemcudatoolkit/11.0
andcudatoolkit/11.4
dropped as available modules.- You can continue to compile using older cuda versions, see note about CUDA compatibility libraries
- Numerous internal upgrades (software and network stack) to prepare the Phase-2 integration of Perlmutter
- Re-compile is not needed, but if you’re having issues please do try recompiling your application first.
April 21, 2022¶
- Node limit restriction for
early_science
qos has been lifted. Perlmutter QOS policy.
April 7, 2022¶
- NVIDIA Data Center GPU Manager (dcgm) enabled on all nodes. Users will need to disable dcgm before running profiler tools that require acess to hardware counters.
- Newest versions (2022.x) of nsight-compute and nsight-system removed pending vendor bug fixes
- Numerous internal updates to improvement configuration, reliability, and performance
March 25, 2022¶
- Numerous internal updates to improvement configuration, reliability, and performance
March 10, 2022¶
- Nvidia HPC SDK v21.11 now default
- Older cudatoolkit modules removed
- Slurm upgrade to 21.08, codes that use gpu-binding will need to be reworked
- CPE 21.11 has been retired
- There will be no support for
gcc/9.3
- nvcc v11.0 (
cudatoolkit/11.0
) retired, will no longer be supported
- There will be no support for
- Numerous internal updates to improvement configuration, reliability, and performance
February 24, 2022¶
- Cudatoolkit modules simplified
- New modules with shorter names point to the most recent releases available
- Old modules will remain on the system for a short time to allow time to switch over
- Nvidia HPC SDK v21.11 now available
- Default will remain 21.9 for a short time to allow time for testing
nvidia/21.9
does not support Milan, so the Cray compiler wrappers will build for Rome instead. We recommend that users switch tonvidia/22.11
.
- Upgraded to CPE 22.02. Major changes include:
- MPICH 8.1.12 to 8.1.13
- PMI 6.0.15 to 6.0.17
- hdf5 1.12.0.7 to 1.12.1.1
- netcdf 4.7.4 to 4.8.1.1
- Change to sshproxy to support broader kinds of logins
- Realtime qos functionallity added
- Numerous internal updates to improvement configuration, reliability, and performance
February 10, 2022¶
- Node limit for all jobs temporarily lowered to 128 nodes
- QOS priority modified to encourage wider job variety
January 25, 2022¶
- Cudatoolkit modules now link to correct math libraries (fixes Known Issue "Users will encounter problems linking CUDA math libraries").
- Update to DVS configuration to support CVMFS.
- The latest Nsight systems and Nsight compute performance tools are now available.
- Numerous internal upgrades to improve configuration and performance.
January 11, 2022¶
- Upgraded to CPE 21.12. Major changes include:
- MPICH upgraded to v8.1.12 (from 8.1.11)
- The previous programming environment can now be accessed using the
cpe
module. - Numerous internal upgrades to improve configuration and performance.
December 21, 2021¶
- GPUs are back in "Default" mode (fixes Known Issue "GPUs are in "Exclusive_Process" instead of "Default" mode")
- User access to hardware counters restored (fixes Known Issue "Nsight Compute or any performance profiling tool requesting access to h/w counters will not work")
- Cuda 11.5 compatability libraries installed and incorporated into Shifter
- QOS priority modified to encourage wider job variety
- Numerous internal upgrades
December 6, 2021¶
- Major changes to the user environment. All users should recompile their code following our compile instructions
- The
cuda
,cray-pmi
, andcray-pmi-lib
modules have been removed from the default environment - The
darshan
v3.3.1 module has been added to the default environment - Default NVIDIA compiler upgraded to v21.9
- Users must load a
cudatoolkit
module to compile GPU codes
- Users must load a
- Upgraded to CPE 21.11
- MPICH upgraded to v8.1.11 (from 8.1.10)
- PMI upgraded to v6.0.16 (from 6.0.14)
- FFTW upgraded to 3.3.8.12 (from 3.3.8.11)
- Python upgraded to 3.9 (from 3.8)
- Upgrade to SLES15sp2 OS
- Numerous internal upgrades
November 30, 2021¶
- Upgraded Slingshot (internal high speed network) to v1.6
- Upgraded Lustre server
- Internal configuration upgrades
November 16, 2021¶
This was a rolling update where the whole system was updated with minimal interruptions to users.
- Set
MPICH_ALLGATHERV_PIPELINE_MSG_SIZE=0
to improve MPI communication speed for large buffer size. - Added
gpu
andcuda-mpich
Shifter modules to better support Shifter GPU jobs - Deployed fix for
CUDA Unknown Error
errors that occasionally happen for Shifter jobs using the GPUs - Changed ssh settings to reduce frequency of dropped ssh connections
- Internal configuration updates
November 2, 2021¶
- Updated to CPE 21.10. A recompile is recommended but not required. See the documentation of CPE changes from HPE for a full list of changes. Major changes of note include:
- Upgrade MPICH to 8.1.10 (from 8.1.9)
- Upgrade DSMML to 0.2.2 (from 0.2.1)
- Upgraded PMI to 6.0.14 (from 6.0.13)
- Adjusted qos configurations to facilitate Jupyter notebook job scheduling.
- Added
preempt
QOS. Jobs submitted to this QOS may get preempted after two hours, but may start more quickly. Please see our instructions for running preemptible jobs for details.
October 20, 2021¶
External ssh access enabled for Perlmutter login nodes.
October 18, 2021¶
- Updated slurm job priorities to more efficiently utilize the system and improve the diversity of running jobs.
October 14, 2021¶
- Updated NVIDIA driver (to 450.162). This is not expected to have any user impact.
- Upgraded internal management framework.
October 9, 2021¶
Screen
andtmux
installed- Installed boost v1.66
- Upgraded nv_peer_mem driver to 1.2 (not expected to have any user impact)
October 5, 2021¶
Deployed sparewarmer
QOS to assist with node-level testing. This is not expected to have any user impact.
October 4, 2021¶
Limited the wall time of batch jobs to 6 hours to allow a variety of jobs to run during testing. If you need to run jobs for longer than 6 hours, please open a ticket.
September 29, 2021¶
- Numerous internal network and management upgrades.
New batch system structure deployed¶
- Users will need to specify a QOS (with
-q regular
,debug
,interactive
, etc.) as well as a project GPU allocation account name which ends in _g (e.g.,-A m9999_g
)- We have some instructions for setting your default allocation account in our Slurm Defaults Section
- Please see our Running Jobs Section for examples and an explanation of new queue policies
September 24, 2021¶
- Upgraded internal management software
- Upgraded system I/O forwarding software and moved it to a more performant network
- Fixed csh environment
- Performance profiling tool that request access to hardware counters (such as Nsight Compute) should work now
September 16, 2021¶
- Deployed numerous network upgrades and changes intended to increase responsiveness and performance
- Increased robustness for login node load balancing
September 10, 2021¶
- Updated to CPE 21.09. A recompile is recommended but not required. Major changes of note include:
- Upgrade MPICH to 8.1.9 (from 8.1.8)
- Upgrade DSMML to 0.2.1 (from 0.2.0)
- Upgrade PALS to 1.0.17 (from 1.0.14)
- Upgrade OpenSHMEMX to 11.3.3 (from 11.3.2)
- Upgrade craype to 2.7.10 (from 2.7.9)
- Upgrade CCE to 12.0.3 (from 12.0.2)
- Upgrade HDF5 to 1.12.0.7 (from 1.12.0.6)
- GCC 11.2.0 added
- Added
cuda
module to the list of default modules loaded at startup - Set BASH_ENV to Lmod setup file
- Deployed numerous network upgrades and changes intended to increase responsiveness and performance
- Performed kernel upgrades to login nodes for better fail over support
- Added latest CMake release as
cmake/git-20210830
, and is set as the defaultcmake
on the system
September 2, 2021¶
- Updated NVIDIA driver (to
nvidia-gfxG04-kmp-default-450.142.00_k4.12.14_150.47-0.x86_64
). This is not expected to have any user impact.
August 30, 2021¶
Numerous changes to the NVIDIA programming environment¶
- Changed default NVIDIA compiler from 20.9 to 21.7
- Installed needed CUDA compatibility libraries
- Added support for multi-CUDA HPC SDK
- Removed the
cudatoolkit
andcraype-accel-nvidia80
modules from default
Tips for users:
- Please use
module load cuda
andmodule av cuda
to get the CUDA Toolkit, including the CUDA C compilernvcc
, and associated libraries and tools. - Cmake may have trouble picking up the correct mpich include files. If it does, you can use
set ( CMAKE_CUDA_FLAGS "-I/opt/cray/pe/mpich/8.1.8/ofi/nvidia/20.7/include")
to force it to pick up the correct one.
June, 2021¶
Perlmutter achieved 64.6 Pflop/s, putting it at No. 5 in the Top500 list.
May 27, 2021¶
Perlmutter supercomputer dedication.
November, 2020 - March, 2021¶
Perlmutter Phase 1 delivered.