Network¶
Perlmutter Slingshot Interconnect¶
Perlmutter uses Slingshot, a high-performance networking interconnect technology developed by Hewlett Packard Enterprise (HPE). The hardware component of Perlmutter's Slingshot interconnect is described in more detail on the Perlmutter Architechture page. We will track some user-tunable features of the Slingshot host software (SHS) here. In general more details can be found in the intro_mpi
man page (man intro_mpi
).
Adjusting message matching mode¶
Hardware message matching is a powerful tool for improving the performance, efficiency, and reliability of HPC network communication. It typically involves the use of specialized hardware components that are optimized for high-speed data processing. These components can perform message matching operations in real-time, enabling high-speed data transfer and processing in complex systems.
By default, SHS is set to use hardware message matching and only switch over to (slower) software message matching when the hardware counters are full. This is controlled by setting the environment variable FI_CXI_RX_MATCH_MODE
to hybrid
. This ensures that codes that generate many MPI messages will continue to function once they've filled up the hardware queues on the Network Interface Cards (albeit at a slower rate). This comes at the cost of a slight overhead in memory usage. The exact value depends on the number of ranks and the size of each request buffer, but will generally be less than 2% of the existing memory on the node. If users wish to default to only using hardware message matching, they can set FI_CXI_RX_MATCH_MODE=hardware
.
Defaulting to pure hardware mode can cause jobs to fail
Setting FI_CXI_RX_MATCH_MODE=hardware
can cause jobs to fail when they exhaust the hardware message queue (usually by sending too many MPI messages). If this happens, the job will fail with a message like LE resources not recovered during flow control. FI_CXI_RX_MATCH_MODE=[hybrid|software] is required
.
Network Performance Report¶
If desired, SHS can collect network data such as the number of network timeouts and counter data for Network Interface Cards during an MPI job. Often times it's difficult to put these reports in context for a single job, but if you wish to enable this report, it can be done by setting the environment variable MPICH_OFI_CXI_COUNTER_REPORT
to a value between 1 and 5. See man intro_mpi
for a list of the different reporting levels available. Please note if there are no network timeouts, no extra information will be printed.
Cori: Message Routing on the Cray Dragonfly Network¶
Cori employs the "Dragonfly" topology for the interconnection network. This topology is a group of interconnected local routers connected to other similar router groups by high-speed global links. The groups are arranged such that data transfer from one group to another requires only one route through a global link. For more information about the network please see the Cori Interconnect page.
Message Routing¶
Messages that travel between two nodes on the network are routed over the Cray Aries dragonfly network. Generally, on a low-traffic network, a message will take the shortest path between two nodes. But on a congested network, a message might take a longer route that takes less time than the direct route, taking a detour around a traffic jam.
Network congestion can significantly impact application performance on HPC systems. The Cray Aries dragonfly network implements adaptive routing to provide alternative routes in the presence of congestion. In the default case data traverses the minimal route through the network. However, as congestion is detected in the network, the traffic adjusts to take an alternative path as illustrated in the figure below:
In this figure, congestion (indicated by the lightning bolt) is detected on the red minimal path link between Node A and Node B, causing the data to take the alternate route (indicated in dark bold blue).
Routing Configuration¶
The switch from a minimal to non-minimal path in the network can be configured via several Cray environment variables: (1) MPICH_GNI_ROUTING_MODE
, (2) MPICH_GNI_A2A_ROUTING_MODE
. MPICH_GNI_ROUTING_MODE
controls all the routing policy within Cray MPI except for all-to-all collectives, which are controlled by MPICH_GNI_A2A_ROUTING_MODE
. These environment variables can be set to the following:
ADAPTIVE_0
: Least bias towards minimal; most likely to take alternate route in event of congestionADAPTIVE_1
: Slight bias towards minimalADAPTIVE_2
: Moderate bias towards minimalADAPTIVE_3
: High bias towards minimal; least likely to take alternate route in event of congestion
Pros and Cons of Minimal and Non-Minimal Routing¶
Users wanting to try different routing modes should consider the pros and cons of the adaptive settings.
Minimal Bias (ADAPTIVE_3
and ADAPTIVE_2
)
- Pros:
- Lower best-case latency
- Fewer false positives (end-point congestion can’t be avoided by routing)
- Con:
- Bisection-bandwidth bound applications will not perform as well
Non-minimal Bias (ADAPTIVE_1
and ADAPTIVE_0
)
- Pro:
- Alternate route to bypass intermediate congestion
- Cons:
- Switching routes may force a flush of data on the route -- incurring delay
- Double best-case latency
- If an application is creating congestion for itself, it may just propagate the congestion across more routers by taking the longer route, which in turn slows down other applications
Recommendations¶
Tip
These recommendations are for the most common job sizes at NERSC (512 nodes and under). While larger jobs may benefit from them as well, we do not have sufficient data to make a recommendation for full system jobs.
There is no single setting that is universally best for all applications. However, we have characterized the workloads that typically run on NERSC systems examining and believe that ADAPTIVE_3
(high minimal bias) provides the best experience for the majority of our workloads and is the default setting on Cori. That is because many of our applications are limited by latency bound, small-message (e.g. 8 Byte) MPI_allreduce
s or similar operations that are dependent on the slowest process. By selecting a strong preference for the minimal path you favor lower best-case latencies, but additionally reduce the likelihood of triggering a non-minimal route due to incast congestion. Examples of applications that benefit from high minimal bias are MILC.
If your application is both bandwidth intensive (this generally means message sizes > 16KiB) and communicates across the bisection of the network (think All-to-all operations, 3DFFT, transpose), your application may benefit from a stronger bias towards non-minimal routing. Many applications will not need to specify an alternative value for MPICH_GNI_ROUTING_MODE
since these bandwidth-intensive operations occur in MPI_Alltoall
(which is configured to a non-minimal bias by the separate variable MPICH_GNI_A2A_ROUTING_MODE
), however in some scenarios these operations may be implemented through point-to-point send/recv operations. In these instances better performance is generally achieved by setting MPICH_GNI_ROUTING_MODE=ADAPTIVE_0
. Examples in this category are applications such as HACC.
More information available about these and additional Cray MPI environment variables can be found via man intro_mpi
on Cori.