Selected Vendor Bug Reports¶
Updated on November 21, 2024.
Active Bugs¶
MPI-IO error when using MPI_Type_indexed
¶
- Vendor: HPE
-
Description: Using Cray MPI to compile an MPI program that calls
MPI_Type_indexed
to concatenate multiple subarray MPI data types produces an incorrect result. The concatenated datatype is used to set the MPI fileview. The following error is seen with a test code:$ srun -N1 --ntasks-per-node=4 ./indexed_fsize -f dummy Error: expecting file size 800, but got 1200 srun: error: nid004481: tasks 0-3: Exited with exit code 1 srun: Terminating StepId=3XXXXXXX.0
The same problem happens when calling
MPI_Type_create_hindexed
.This problem is seen with
PrgEnv-gnu
,PrgEnv-cray
andPrgEnv-nvidia
.This turns out to be due to a bug in MPICH which Cray's MPI-IO implementation is based on. The error can be reproduced using MPICH versions 4.0.3 and prior. The user who reported the problem subsequently opened an issue with MPICH.
-
Status: In progress
crayftn
error on unlimited polymorphic assumed rank argument¶
- Vendor: HPE
-
Description: Trying to use the
SELECT TYPE
andSELECT RANK
constructs together in order to make use of an unlimited polymorphic assumed rank argument generates a compile error. A reproducer is provided.$ module load PrgEnv-cray $ ftn combined.f90 module a ^ ftn-855 ftn: ERROR A, File = combined.f90, Line = 1, Column = 8 The compiler has detected errors in module "A". No module information file will be created for this module. select type (x) ^ ftn-1871 ftn: ERROR FOO, File = combined.f90, Line = 7, Column = 20 The selector in a SELECT TYPE statement must be polymorphic. print *, x ^ ftn-620 ftn: ERROR FOO, File = combined.f90, Line = 9, Column = 18 This reference to assumed-rank variable "X" is not valid. use a, only: foo ^ ftn-894 ftn: ERROR MAIN, File = combined.f90, Line = 15, Column = 7 Module "A" has compile errors, therefore declarations obtained from the module via the USE statement may be incomplete. Cray Fortran : Version 15.0.1 (20230120205242_66f7391d6a03cf932f321b9f6b1d8612ef5f362c) Cray Fortran : Compile time: 0.0027 seconds Cray Fortran : 17 source lines Cray Fortran : 4 errors, 0 warnings, 0 other messages, 0 ansi Cray Fortran : "explain ftn-message number" gives more information about each message.
-
Status: In progress (Reopened)
E3SM codes hang at large scales during startup with slingshot 11.0.1 on CPU nodes¶
- Vendor: HPE
- Description: E3SM codes hang consistently at 225 compute nodes. It still hangs sometimes at lower concurrencies.
- Status: In progress
-
Workaround: Set the
FI_MR_CACHE_MONITOR
environment variable as follows:export FI_MR_CACHE_MONITOR=kdreg2
CMake config files for cray-fftw
module¶
- Vendor: HPE
-
Description: The CMake files in
cray-fftw
module (e.g.,/opt/cray/pe/fftw/3.3.10.6/x86_milan/lib/cmake/fftw3
) have an incorrect path:set (FFTW3f_INCLUDE_DIRS /tmp/tmp.VFDTh6VYy8/rpm/BUILDROOT/opt/cray/pe/fftw/3.3.10.3/x86_genoa/include)
This causes a problem with
find_package(fftw3 REQUIRED)
. -
Status: In progress
crayftn
passing procedure pointer derived type components to associated function call¶
- Vendor: HPE
-
Description: An ICE (internal compiler error) is triggered when there is an associated function call whose arguments are two procedure pointer derived type components. A reproducer is available.
$ crayftn --version Cray Fortran : Version 17.0.0 $ ftn -c example.f90 Struct_Opr idx = 17 Cray parcel pointer rank = 0; line = 49, col = 34 Left opnd is IR_Tbl_Idx; line = 49, col = 31 Dv_Deref_Opr idx = 81 type(TYPE_T) typ_idx (587) dim = 0 rank = 0; line = 49, col = 31 Left opnd is AT_Tbl_Idx; line = 49, col = 31 LHS idx = 701 derived-type * 587 Right operand is NO_Tbl_Idx; Right operand is AT_Tbl_Idx; line = 49, col = 35 TEST_FUNCTION_ idx = 591 Cray parcel pointer * Cray_Parcel_Ptr_8 ftn-1716 ftn: INTERNAL EQUALS, File = example.f90, Line = 49, Column = 34 A multiparented node was encountered. ftn-2116 ftn: INTERNAL "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/ftnfe" was terminated due to receipt of signal 06: Aborted.
-
Status: Fixed in CCE 19.0.0
- Availability: CCE 19.0.0 not available yet
crayftn
argument that has pointer
attribute causes a compile-time error about needing pointer
attribute¶
- Vendor: HPE
-
Description: When there is a procedure argument to a derived type user defined constructor that has the
pointer
attribute, the interface for the constructor is in amodule
and the definition is in asubmodule
, a compiler error occurs which says that the procedure argument needs thepointer
attribute, even though it already has it. See the error message with a reproducer code:$ crayftn --version Cray Fortran : Version 17.0.0 $ ftn -c example.f90 submodule(pointer_attribute_bug_m) pointer_attribute_bug_s ^ ftn-1800 ftn: ERROR CONSTRUCT, File = example.f90, Line = 32, Column = 11 Procedure "TEST_FUNCTION" has the INTENT attribute, so it must be a procedure pointer. Add the POINTER attribute.
-
Status: In progress
Cray Fortran 17.0.0 dummy argument type not recognized for module procedure in same module¶
- Vendor: HPE
-
Description: A valid Fortran code produces the error shown below when compiled with the Cray Fortran compiler.
$ ftn -c unimported-dummy-arg-type.f90 module foo_m ^ ftn-855 ftn: ERROR FOO_M, File = unimported-dummy-arg-type.f90, Line = 1, Column = 8 The compiler has detected errors in module "FOO_M". No module information file will be created for this module. module function construct(bar) result(foo) ^ ftn-1279 ftn: ERROR CONSTRUCT, File = unimported-dummy-arg-type.f90, Line = 11, Column = 31 Procedure "CONSTRUCT" is defined at line 19 (unimported-dummy-arg-type.f90). The type of this argument does not agree with dummy argument "BAR". ^ ftn-287 ftn: WARNING CONSTRUCT, File = unimported-dummy-arg-type.f90, Line = 11, Column = 43 The result of function name "FOO" in the function subprogram is not defined. Cray Fortran : Version 17.0.0 (20231107223020_b59b7a8e9169719529cf5ab440f3c301e515d047) Cray Fortran : Compile time: 0.0039 seconds Cray Fortran : 21 source lines Cray Fortran : 2 errors, 1 warnings, 0 other messages, 0 ansi Cray Fortran : "explain ftn-message number" gives more information about each message
-
Status: Fixed in CCE 18.0.1
- Availability: CCE 18.0.1 not available yet
Apps instrumented with perftools-lite-gpu
get an MPI error¶
- Vendor: HPE
-
Description: When MPI apps offloading with OpenMP are instrumented with
perftools-lite-gpu
, they get an MPI error:$ srun -n 4 -c 32 --cpu-bind=cores --gpus-per-task=1 --gpu-bind=none ./a.out CrayPat/X: Version 23.12.0 Revision 67ffc52e7 sles15.4_x86_64 11/13/23 21:04:20 (GTL DEBUG: 1) cuPointerGetAttribute: (null), (null), line no 327 MPICH ERROR [Rank 1] [job id 28173016.1] [Mon Jul 15 18:10:36 2024] [nid001161] - Abort(606152194) (rank 1 in comm 0): Fatal error in PMPI_Barrier: Invalid count, error stack: PMPI_Barrier(280)....................: MPI_Barrier(comm=comm=0x84000001) failed PMPI_Barrier(265)....................: MPIR_CRAY_Barrier(124)...............: MPIDI_Cray_shared_mem_coll_bcast(518): MPIR_Localcopy(95)...................: (unknown)(): Invalid count aborting job: Fatal error in PMPI_Barrier: Invalid count, error stack: PMPI_Barrier(280)....................: MPI_Barrier(comm=comm=0x84000001) failed PMPI_Barrier(265)....................: MPIR_CRAY_Barrier(124)...............: MPIDI_Cray_shared_mem_coll_bcast(518): MPIR_Localcopy(95)...................: (unknown)(): Invalid count (GTL DEBUG: 2) cuPointerGetAttribute: (null), (null), line no 327 MPICH ERROR [Rank 2] [job id 28173016.1] [Mon Jul 15 18:10:36 2024] [nid001161] - Abort(606152194) (rank 2 in comm 0): Fatal error in PMPI_Barrier: Invalid count, error stack: PMPI_Barrier(280)....................: MPI_Barrier(comm=comm=0x84000001) failed PMPI_Barrier(265)....................: MPIR_CRAY_Barrier(124)...............: MPIDI_Cray_shared_mem_coll_bcast(463): MPIR_Localcopy(95)...................: (unknown)(): Invalid count ...
Uninstrumented executables run fine.
-
Status: In progress
cray-libsci
segfaults when using multiple OpenMP threads¶
- Vendor: HPE
-
Description: An OpenMP code using
cray-libsci
segfaults when run with multiple threads on a single node:$ srun -n 1 -c 16 --cpu_bind=cores -G 1 --gpu-bind=none ./code/STRUMPACK/build/examples/sparse/testPoisson3d 50 --sp_disable_gpu ... srun: error: nid001036: task 0: Segmentation fault srun: Terminating StepId=27511778.0
-
Status: In progress
Valgrind4hpc needs to drop support for exp-sgcheck¶
- Vendor: HPE
- Description: Valgrind doesn't support exp-sgcheck any more but Valgrind4hpc lists it as a supported tool.
- Status: Fixed in CPE 24.11
- Availability: CPE 24.11 not available yet
MPI_Allgatherv
fails for device buffers within a node¶
- Vendor: HPE
-
Description: A user code fails with the function call where the device buffer returned by the
omp_get_mapped_ptr
function is used in a single-node job in thePrgEnv-cray
andPrgEnv-nvidia
environments:MPICH ERROR [Rank 1] [job id 26614636.0] [Sun Jun 9 07:53:26 2024] [nid001236] - Abort(86580482) (rank 1 in comm 0): Fatal error in PMPI_Allgatherv: Invalid count, error stack: PMPI_Allgatherv(491)......................: MPI_Allgatherv(sbuf=MPI_IN_PLACE, scount=0, MPI_DATATYPE_NULL, rbuf=0x7ff2db800000, rcounts=0xe34b3e0, displs=0xe339770, datatype=MPI_DOUBLE_COMPLEX, comm=MPI_COMM_WORLD) failed MPIR_CRAY_Allgatherv(466).................: MPIR_Allgatherv_impl(277).................: MPIR_Allgatherv_intra_auto(191)...........: Failure during collective MPIR_Allgatherv_intra_auto(186)...........: MPIR_Allgatherv_intra_ring(166)...........: MPIC_Sendrecv(338)........................: MPIC_Wait(71).............................: MPIR_Wait_impl(41)........................: MPID_Progress_wait(201)...................: MPIDI_Progress_test(105)..................: MPIDI_SHMI_progress(118)..................: MPIDI_POSIX_progress(412).................: MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64): MPIDI_CRAY_Common_lmt_handle_recv(44).....: MPIDI_CRAY_Common_lmt_import_mem(218).....: (unknown)(): Invalid count ...
-
Status: In progress
Apps instrumented with perftools-lite
or perftools
hang or fail¶
- Vendor: HPE
-
Description: When apps are instrumented with
perftools-lite
orperftools
, they hang, segfault or fail for a unknown reason:# Hang $ ls -lrt | tail -2; sacct -o jobid,jobname,start,end,elapsed,state -j 26530829; date -rw------- 1 elvis elvis 5819 Jun 6 13:04 rsl.out.0000 -rw------- 1 elvis elvis 5831 Jun 6 13:04 rsl.error.0000 JobID JobName Start End Elapsed State ------------ ---------- ------------------- ------------------- ---------- ---------- ... 26530829.0 wrf.exe 2024-06-06T13:04:11 Unknown 00:29:02 RUNNING Thu 06 Jun 2024 01:33:13 PM PDT # Segfault srun: error: nid004203: task 9: Segmentation fault srun: Terminating StepId=26197475.0 # Fail for a unknown reason srun: error: nid006953: tasks 0-127: Exited with exit code 255 srun: Terminating StepId=26230127.0
-
Status: In progress
-
Workaround: Set the
PAT_RT_CALLSTACK_MODE
environment variable before yoursrun
command:export PAT_RT_CALLSTACK_MODE=frames
Codes fail with 'cxil_map: write error
'¶
- Vendor: HPE
-
Description: When built with the
-O2
or-O3
flag in thePrgEnv-gnu
environment, a CPU code called CROCO (Coastal and Regional Ocean Community model) fails with the error:cxil_map: write error cxil_map: write error cxil_map: write error ... cxil_map: write error MPICH ERROR [Rank 132] [job id 26513451.0] [Thu Jun 6 04:27:18 2024] [nid005207] - Abort(538553615) (rank 132 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack: PMPI_Irecv(166)........: MPI_Irecv(buf=0x7ffda83b8580, count=630, MPI_DOUBLE_PRECISION, src=115, tag=6, MPI_COMM_WORLD, request=0x7ffda83b84d0) failed MPID_Irecv(529)........: MPIDI_irecv_unsafe(163): MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address) aborting job: Fatal error in PMPI_Irecv: Other MPI error, error stack: PMPI_Irecv(166)........: MPI_Irecv(buf=0x7ffda83b8580, count=630, MPI_DOUBLE_PRECISION, src=115, tag=6, MPI_COMM_WORLD, request=0x7ffda83b84d0) failed MPID_Irecv(529)........: MPIDI_irecv_unsafe(163): MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address) MPICH ERROR [Rank 137] [job id 26513451.0] [Thu Jun 6 04:27:18 2024] [nid005207] - Abort(941206799) (rank 137 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack: PMPI_Irecv(166)........: MPI_Irecv(buf=0x7ffdc3523ea0, count=630, MPI_DOUBLE_PRECISION, src=122, tag=8, MPI_COMM_WORLD, request=0x7ffdc35215f8) failed MPID_Irecv(529)........: MPIDI_irecv_unsafe(163): MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address) ...
With a lower optimization level, the code runs fine. A similar error is observed with an old version of WRF.
-
Status: In progress
cray-netcdf
and cray-parallel-netcdf
module issues with wrong lib directories¶
- Vendor: HPE
-
Description: Many model build systems rely on
nc-config
andpnetcdf-config
to get the correct compiler flags and link libraries for compiling with these tools. However the modulescray-parallel-netcdf/1.12.2.1
,cray-netcdf/cray-netcdf/4.8.1.1
, andcray-netcdf-hdf5parallel/4.8.1.1
all give incorrect results. For example,$ pnetcdf-config --libdir /opt/cray/pe/parallel-netcdf/1.12.2.1/gnu/8.2/lib
when the correct path is
/opt/cray/pe/parallel-netcdf/1.12.2.1/INTEL/19.1/lib
. Withcray-netcdf-hdf5parallel/4.8.1.1
,nc-config
gives several wrong results:--cflags -> -DpgiFortran --cxx4flags -> -DpgiFortran --libdir -> /opt/cray/pe/netcdf-hdf5parallel/4.8.1.1/gnu/8.2/lib
The same problem is observed in later versions.
-
Status: In progress
crayftn
ICE when trying to build dftd4¶
- Vendor: HPE
-
Description: An ICE occurs when building
dftd4
with the Cray compilerftn-1795 ftn: INTERNAL MULTICHARGE_MODEL, File = /global/homes/e/elvis/Repositories/multicharge/src/multicharge/model.F90, Line = 30, Column = 4 FORTRAN FE ASSERT: "new_attr_idx" failed. ( /home/jenkins/crayftn/fe90/sources/module.c at line 1224). ftn-2116 ftn: INTERNAL "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/ftnfe" was terminated due to receipt of signal 06: Aborted.
-
Status: Fixed in CCE 18.0.0
- Availability: CCE 18.0.0 not available yet
crayftn
ICE on WHERE
statement with defined assignment¶
- Vendor: HPE
-
Description: The Cray Fortran compiler hits an ICE with a test code when it encounters a
WHERE
statement that uses defined assignment.... Creating internal compiler error backtrace (please wait): [0x000000012d4e69] linux_backtrace /home/jenkins/crayftn/pdgcs/v_util.c:186 [0x000000012d53a1] pdgcs_internal_error(char const*, char const*, int) /home/jenkins/crayftn/pdgcs/v_util.c:663 [0x00000001ddcbc5] verify_binary_args(EXP_OP, EXP_INFO, EXP_INFO, TYPE, char const*, EXP_T_TYPE, bool, bool, bool) [clone .constprop.0] /home/jenkins/crayftn/pdgcs/v_expr_tbl.c:413 ... [0x000000008a9fc9] _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ftn-7991 ftn: INTERNAL EXAMPLE, File = example.f90, Line = 34 INTERNAL COMPILER ERROR: "Array syntax flags do not match" (/home/jenkins/crayftn/pdgcs/v_expr_tbl.c, line 413, version b59b7a8e9169719529cf5ab440f3c301e515d047) ftn-2116 ftn: INTERNAL "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/optcg" was terminated due to receipt of signal 06: Aborted. ...
-
Status: In progress
Code fails with 'MPIDI_OFI_send_normal:Resource temporarily unavailable)
'¶
- Vendor: HPE
-
Description: A user app with GPU-aware MPI gets the following error with
cray-mpich/8.1.25
in multi-node jobs when GDRCopy is used:... MPICH ERROR [Rank 10] [job id 24282816.0] [Thu Apr 11 15:49:07 2024] [nid008436] - Abort(739891471) (rank 10 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack: PMPI_Send(163)............: MPI_Send(buf=0x7fc1755b4f90, count=6300, MPI_DOUBLE, dest=12, tag=1, MPI_COMM_WORLD) failed MPID_Send(499)............: MPIDI_send_unsafe(58).....: MPIDI_OFI_send_normal(368): OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Resource temporarily unavailable) ...
The user was using
PrgEnv-gnu
. -
Status: In progress
crayftn
ICE when trying to build MERGE
Function call ICE when the MASK
argument is a derived type component¶
- Vendor: HPE
-
Description: An ICE occurs when there is a function call to the intrinsic
MERGE
and when theMASK
argument is a component of the derived type argument to a type bound procedure and when the result of theMERGE
call is then passed to the intrinsicTRIM
. The error message with a Fortran code is as follows:Creating internal compiler error backtrace (please wait): [0x000000012d4e69] linux_backtrace /home/jenkins/crayftn/pdgcs/v_util.c:186 [0x000000012d53a1] pdgcs_internal_error(char const*, char const*, int) /home/jenkins/crayftn/pdgcs/v_util.c:663 [0x000000015e3c52] llvm_cg::get_string_address(EXP_INFO) /home/jenkins/crayftn/pdgcs/llvm-substr.c:493 ... [0x007f35e3e3e24c] ?? ??:0 [0x000000008a9fc9] _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 ftn-7991 ftn: INTERNAL MYPROC, File = example.f90, Line = 15 INTERNAL COMPILER ERROR: "get_string_address - the string is not a lsbstr" (/home/jenkins/crayftn/pdgcs/llvm-substr.c, line 493, version b59b7a8e9169719529cf5ab440f3c301e515d047) ftn-2116 ftn: INTERNAL "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/optcg" was terminated due to receipt of signal 06: Aborted.
-
Status: Fixed in CCE 18.0.1
- Availability: CCE 18.0.1 not available yet
sanitizers4hpc
's output aggregation with ThreadSanitizer¶
- Vendor: HPE
- Description: Aggregation of ThreadSanitizer output by
sanitizers4hpc
needs improvement. - Status: Fixed in CPE 24.11
- Availability: CPE 24.11 not available yet
sanitizers4hpc
with Compute Sanitizer's memcheck produces output that is not aggregated¶
- Vendor: HPE
- Description: Compute Sanitizer output aggregation needs improvement.
- Status: Fixed in PE 24.07; closed
- Availability: PE 24.07 not available yet
sanitizers4hpc
produces stack traces for 'Program hit CUDA_ERROR_INVALID_VALUE error
'¶
- Vendor: HPE
-
Description: The following error message appears when using
sanitizers4hpc
with Compute Sanitizer's Memcheck although the desired output is produced:Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuPointerGetAttribute. Saved host backtrace up to driver entry point at error #0 0x2eae6f in /usr/local/cuda-12.2/compat/libcuda.so.1 #1 0xda19 in /home/jenkins/src/gtlt/cuda/gtlt_cuda_query.c:344:gtlt_cuda_pointer_type /opt/cray/pe/lib64/libmpi_gtl_cuda.so.0 #2 0x4bd9 in /home/jenkins/src/comx/gtlx_query.c:25:mpix_gtl_pointer_type /opt/cray/pe/lib64/libmpi_gtl_cuda.so.0 #3 0x1fa2475 in MPIR_Cray_Memcpy_wrapper /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #4 0x1841ee9 in MPIDIG_handle_unexp_mrecv /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #5 0x18ac030 in MPIC_Sendrecv /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #6 0x17d6bcf in MPIR_Barrier_intra_dissemination /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #7 0x232900 in MPIR_Barrier_intra_auto /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #8 0x232ab5 in MPIR_Barrier_impl /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #9 0x1a1f15d in MPIR_CRAY_Barrier /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #10 0x1a15772 in MPIDI_Cray_shared_mem_coll_opt_cleanup /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #11 0x18cc799 in MPIDI_Cray_coll_finalize /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #12 0x1b7a65f in MPID_Finalize /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #13 0x605ab4 in MPI_Finalize /opt/cray/pe/lib64/libmpi_gnu_123.so.12 #14 0xf94 in /pscratch/sd/e/elvis/Memcheck/main.cc:14:main /pscratch/sd/e/elvis/Memcheck/./a.out #15 0x3524d in __libc_start_main /lib64/libc.so.6 #16 0xe9a in ../sysdeps/x86_64/start.S:122:_start /pscratch/sd/e/elvis/Memcheck/./a.out
This error doesn't occur when the code is run without
sanitizers4hpc
. -
Status: In progress
No source line number displayed when run with MemorySanitizer in PrgEnv-cray
¶
- Vendor: HPE
-
Description: No info is provided by MemorySanitizer where in source code an error occurs or where the memory was allocated.
==1068200==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x3007bc (/pscratch/sd/e/elvis/a.out+0x3007bc) #1 0x7fc5c262e24c (/lib64/libc.so.6+0x3524c) (BuildId: ddc393ac74ed8f90d4fdfff796432fbafd281e1b) #2 0x26a849 (/pscratch/sd/e/elvis/a.out+0x26a849) ...
The same problem is seen with
PrgEnv-intel
, as it turned out. -
Status: Fixed in CCE 18.0.0 for
PrgEnv-cray
; thePrgEnv-intel
problem won't be fixed - Availability: CCE 18.0.0 not available yet
-
Workaround: Set the environment variable:
export MSAN_OPTIONS="allow_addr2line=true"
disable_sanitizer_instrumentation
attribute doesn't work with PrgEnv-aocc
¶
- Vendor: HPE
- Description: The
__attribute__((disable_sanitizer_instrumentation))
attribute doesn't disable sanitizer instrumentation in the AOCC compilers although the compilers are Clang-based. - Status: In progress
Segfaulting with calls to MPI_Win_allocate_shared
function on multiple CPU nodes¶
- Vendor: HPE
-
Description: The FHI-aims code uses the MPI-3 Shared Memory model. When CPU nodes (as opposed to GPU nodes) are used and the number of MPI tasks per node goes over a certain threshold (46 in case of the problem tested), a multi-node run segfaults at a function call to
MPI_Win_allocate_shared
.... Program received signal SIGSEGV: Segmentation fault - invalid memory reference. #0 0x14e493c23372 in ??? #1 0x14e493c22505 in ??? #2 0x14e493253dbf in ??? ... #11 0x14e49424a3d4 in ??? #12 0x2121b80 in __mpi_shm_MOD_allocate_shm_arr_1dr_diml at /pscratch/sd/e/elvis/FHIaims/src/mpi_shm.f90:152 ...
A different task-per-node count triggers a segfault at a different
MPI_Win_allocate_shared
call in the code. -
Status: In progress
- Status: Fixed in
cray-mpich/8.1.29
- Availability:
cray-mpich
8.1.29 or later not available yet
Error occurs when MPI window object is not freed¶
- Vendor: HPE
-
Description: Messages about a fatal MPI finalize error are generated when a MPI window object is not freed before
MPI_Finalize
.MPICH ERROR [Rank 0] [job id 23533657.21] [Tue Mar 26 16:56:36 2024] [nid006635] - Abort(806971663) (rank 0 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack: PMPI_Finalize(214)...............: MPI_Finalize failed PMPI_Finalize(161)...............: MPID_Finalize(710)...............: MPIDI_OFI_mpi_finalize_hook(1046): OFI endpoint close failed (ofi_init.c:1046:MPIDI_OFI_mpi_finalize_hook:Device or resource busy) ...
-
Status: A fix in
cray-mpich/8.1.29.34
- Availability:
cray-mpich
8.1.29.34 or later not available yet
CCE 17.0.0 Fortran compiler fails four Smart-Pointers tests¶
- Vendor: HPE
- Description: The Cray Fortran compiler fails four tests in the Smart-Pointers test suite.
- Status: In progress
crayftn
runtime error with user defined operator on associate name¶
- Vendor: HPE
-
Description: A segmentation fault occurs in a code when calling a user defined operator on a name associated with a function/expression result.
lib-4968 : WARNING An unallocated allocatable array 'STRING_' is referenced at at line 20 in file 'example.f90'. Segmentation fault
-
Status: In progress
Valid coarray code rejected by crayftn
¶
- Vendor: HPE
-
Description: A coarray code is incorrectly rejected with errors by the Cray Fortran compiler.
$ ftn coarrays.f90 -o coarrays.exe call assign_and_synchronize(lhs=u_half, rhs=u + (dt/2)*(nu*d_dx2(u,dx) - d_dx(half_uu,dx))) ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 74 Coarray t$14 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 74 Coarray t$14 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 87 Coarray t$19 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 87 Coarray t$19 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. call assign_and_synchronize(lhs=u, rhs=u + dt*(nu*d_dx2(u_half,dx) - d_dx(half_uu,dx))) ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 65 Coarray t$35 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 65 Coarray t$35 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 83 Coarray t$40 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. ^ ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 83 Coarray t$40 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions. Cray Fortran : Version 17.0.0 (20231107223020_b59b7a8e9169719529cf5ab440f3c301e515d047) Cray Fortran : Compile time: 0.1774 seconds Cray Fortran : 135 source lines Cray Fortran : 8 errors, 0 warnings, 0 other messages, 0 ansi Cray Fortran : "explain ftn-message number" gives more information about each message.
-
Status: In progress
Incorrect results and poor performance with do concurrent
reduction¶
- Vendor: HPE
- Description: A code that does a do concurrent reduce operation gives incorrect results when built with the Cray Fortran compiler. When compiled with the
-h thread_do_concurrent
flag, the code shows poor performance. - Status: In progress
TCP BTL fails to collect all interface addresses (when interfaces are on different subnets)¶
- Vendor: Open MPI
-
Description: Multi-node Open MPI point-to-point communications using the
tcp
BTL component fail because, although one NIC on a Perlmutter node has two IP interfaces for different subnets (one private and one public), only one IP is used per peer kernel interface.Open MPI detected an inbound MPI TCP connection request from a peer that appears to be part of this MPI job (i.e., it identified itself as part of this Open MPI job), but it is from an IP address that is unexpected. This is highly unusual. The inbound connection has been dropped, and the peer should simply try again with a different IP interface (i.e., the job should hopefully be able to continue). Local host: nid002292 Local PID: 1273838 Peer hostname: nid002293 ([[9279,0],1]) Source IP of socket: 10.249.13.210 Known IPs of peer: 10.100.20.22 128.55.69.127 10.249.13.209 10.249.36.5 10.249.34.5
-
Status: In progress
CP2K container builds with Open MPI with networking bug¶
- Vendor: Nvidia
- Description: CP2K container images on NGC that NERSC suggests to use were built with old Open MPI versions (4.x), and bugs there contribute to multi-node job failures. Requesting new images with Open MPI 5.x built against libfabric.
- Status: In progress
cray-mpich
module does not set LD_LIBRARY_PATH
¶
- Vendor: HPE
-
Description: Loading the module doesn't update the environment variable and this has to be done manually.
$ export MPICH_VERSION_DISPLAY=1 # Print the MPI version number that is being used $ ml -t ... cray-mpich/8.1.25 ... $ echo $CRAY_MPICH_VERSION 8.1.25 $ ml cray-mpich/8.1.27 # Load a different version $ echo $CRAY_MPICH_VERSION 8.1.27 $ srun -n 1 ./a.out # Still using the previous version MPI VERSION : CRAY MPICH version 8.1.25.17 (ANL base 3.4a2) ... $ export LD_LIBRARY_PATH=${CRAY_MPICH_DIR}/lib:$LD_LIBRARY_PATH $ srun -n 1 ./a.out # Now using the intended version MPI VERSION : CRAY MPICH version 8.1.27.26 (ANL base 3.4a2) ...
-
Status: In progress
PrgEnv-nvhpc
conflicts with cudatoolkit
module¶
- Vendor: HPE
-
Description: The
PrgEnv-nvhpc
environment loads thenvhpc
module which is listed as a conflict for thecudatoolkit
module.$ ml PrgEnv-nvhpc $ ml -t ... cpe/23.12 cudatoolkit/12.2 craype-accel-nvidia80 gpu/1.0 nvhpc/23.9 ... PrgEnv-nvhpc/8.5.0 $ ml rm cudatoolkit $ ml cudatoolkit Lmod has detected the following error: Cannot load module "cudatoolkit/12.2" because these module(s) are loaded: nvhpc While processing the following module(s): Module fullname Module Filename --------------- --------------- cudatoolkit/12.2 /opt/cray/pe/lmod/modulefiles/core/cudatoolkit/12.2.lua $ ml -t # cudatoolkit not in the list ... cpe/23.12 craype-accel-nvidia80 gpu/1.0 nvhpc/23.9 ... PrgEnv-nvhpc/8.5.0
-
Status: The
nvhpc
andPrgEnv-nvhpc
modules will be removed in CPE/24.11, in favor ofnvidia
andPrgEnv-nvidia
- Workaround: Use the
nvidia
andPrgEnv-nvidia
modules instead - Availability: CPE 24.11 not available yet
Regression in device memory growth issue with GPU-Aware MPI for XGC code¶
- Vendor: HPE
- Description: The code runs into a problem of device memory growth when GPU-Aware MPI is enabled.
- Status: In progress
-
Workaround: Disable the memory registration (MR) cache:
export FI_MR_CACHE_MAX_COUNT=0
OpenACC reduction with worker gives wrong answers¶
- Vendor: Nvidia
- Description: A procedure declared with an OpenACC
routine worker
directive returns wrong reduction values in thePrgEnv-nvidia
environment when called from within a loop wherenum_workers
andvector_length
are set to 32. - Status: In progress
Performance issue with fi_write()
to GPU memory on Perlmutter¶
- Vendor: HPE
- Description: The GASNet-EX networking library implements RMA APIs with the vendor-provided libfabric and its cxi provider. RMA Put operations between two GPU nodes when the destination address is in remote GPU memory show unexpectedly much lower performance than MPI. For other source/destination memory and Put/Get mode combinations, the GASNet-EX and MPI benchmarks show similar performance or GASNet-EX performs better.
- Status: In progress
RMA performance problems on Perlmutter with GASNet Codes¶
- Vendor: HPE
- Description: With the GASNet-EX networking library implementing RMA (Remote Memory Access) APIs with
fi_read()
andfi_write()
functions of the vendor-provided libfabric and its cxi provider, it is observed that RMA operations perform very well under ideal conditions. When conditions are not ideal, the performance decreases significantly for both host and GPU memory. - Status: In progress
crayftn
overloaded constructor with polymorphic argument in array constructor¶
- Vendor: HPE
-
Description: The Cray Fortran compiler generates an internal compiler error for a code that passes a child type to an overloaded structure constructor within an array constructor, where the parent type has a deferred procedure.
Creating internal compiler error backtrace (please wait): [0x00000000c75a43] linux_backtrace ??:? [0x00000000c76931] pdgcs_internal_error(char const*, char const*, int) ??:? [0x0000000125c2d0] _expr_type(EXP_INFO) ??:? ... [0x007f86c32e129c] ?? ??:0 [0x00000000729d09] _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120 Note: This is a non-debug compiler. Technical support should continue problem isolation using a compiler built for debugging. ftn-7991 ftn: INTERNAL EXAMPLE, File = example.f90, Line = 63 INTERNAL COMPILER ERROR: "_expr_type: Invalid table type" (/home/jenkins/crayftn/pdgcs/v_expr_utl.c, line 7360, version 66f7391d6a03cf932f321b9f6b1d8612ef5f362c)
-
Status: Fixed in CCE 18.0.0
- Availability: CCE 18.0.0 not available yet
Internal Compiler Error¶
- Vendor: HPE
- Description: An internal compiler error occurs when compiling the E3SM code with the AMD compilers.
- Status: Fixed in CCE 19.0.0
- Availability: CCE 19.0.0 not available yet
Code hangs when run on multiple nodes, sometimes showing the 'xpmem_attach error: : Cannot allocate memory
' message¶
- Vendor: HPE
-
Description: A code that runs fine with 128 MPI tasks on a single CPU node hangs when running on multiple nodes, sometimes generating the following message but not always.
xpmem_attach error: : Cannot allocate memory
-
Status: In progress
-
Workaround: Set the
FI_MR_CACHE_MONITOR
environment variable as follows:export FI_MR_CACHE_MONITOR=kdreg2
cray-mpich
with GTL not recognising pointer to device memory, that was returned by OpenCL clSVMAlloc
¶
- Vendor: HPE
- Description: When a pointer returned by the OpenCL
clSVMAlloc
function is used in one-sided MPI communication, it is not getting the correct data. A workaround of wrapping MPI RMA exposure epoch inclEnqueueSVMMap
/clEnqueueSVMUnmap
causes a large amount of data to be unnecessarily moved between the host and device memory. Asking for advice for using OpenCL withMPICH_GPU_SUPPORT_ENABLED
. - Status: In progress
Resolved Bugs¶
Apps instrumented with perftools-lite-gpu
get an MPI error¶
- Vendor: HPE
-
Description: WRF fails when instrumented with
perftools-lite
orperftools
... srun: error: nid004366: tasks 0-63: Exited with exit code 255 srun: Terminating StepId=25322400.0 ...
-
Status: Closed as this is a duplicate of the active case 'Apps instrumented with
perftools-lite
orperftools
hang or fail'
Sudo permission issue for cuquantum-appliance:23.10
container¶
- Vendor: cuQuantum
-
Description: Users cannot access the
/home/cuquantum
directory in a container.$ cd /home/cuquantum/ bash: cd: /home/cuquantum/: Permission denied
-
Status: Fixed in
neilmehta87/cuquantum-appliance:23.10
- Availability:
neilmehta87/cuquantum-appliance:23.10
available on Perlmutter
NCCL workload hitting node failures from network link flaps¶
- Vendor: HPE
- Description: Certain apps (not limited to those using the NCCL library, as it turned out) may expose a system condition as a link flap, making them susceptible to a job failure.
- Status: Closed
No scope analysis window pops up with Perftools/Reveal¶
- Vendor: HPE
- Description: The 'Scope Loop' button on the Reveal tool doesn't open a window that would normally show the scoping result for a selected loop.
- Status: Closed
crayftn
bug in assignment to unlimited polymorphic variable¶
- Vendor: HPE
-
Description: The Cray Fortran compiler generates an error with allocation on assigmnent to an unlimited polymorphic variable:
$ cat example1.f90 class(*), allocatable :: anything anything =.true. end $ ftn example1.f90 anything =.true. ^ ftn-356 ftn: ERROR $MAIN, File = example1.f90, Line = 2, Column = 10 Assignment of a LOGICAL expression to a unlimited polymorphic variable is not allowed. Cray Fortran : Version 15.0.1 (20230120205242_66f7391d6a03cf932f321b9f6b1d8612ef5f362c) Cray Fortran : Compile time: 0.0020 seconds Cray Fortran : 3 source lines Cray Fortran : 1 errors, 0 warnings, 0 other messages, 0 ansi Cray Fortran : "explain ftn-message number" gives more information about each message.
-
Status: Closed
Cray HDF5 parallel modules used in CMake fails to configure C++ project but not Fortran¶
- Vendor: HPE
-
Description: Both (C++ and Fortran) work with the
cray-hdf5
module. Only the Fortran reproducer works withcray-hdf5-parallel
. The C++ code throws the following error forcray-hfd5-parallel
only:$ cmake -B build . -- The C compiler identification is GNU 11.2.0 -- The CXX compiler identification is GNU 11.2.0 -- Cray Programming Environment 2.7.19 C -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /opt/cray/pe/craype/2.7.19/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Cray Programming Environment 2.7.19 CXX -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /opt/cray/pe/craype/2.7.19/bin/CC - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done CMake Error at /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find HDF5 (missing: HDF5_INCLUDE_DIRS) (found version "") Call Stack (most recent call first): /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindHDF5.cmake:1009 (find_package_handle_standard_args) CMakeLists.txt:4 (find_package)
-
Status: Closed
Issues when linking with PrgEnv-nvidia
and cuFFTMp
¶
- Vendor: HPE
- Description: Undefined reference to
MPI_Comm_f2c
reported at link time. - Status: Closed
OFI segfault, and intermittent loss of messages with GASNet¶
- Vendor: HPE
- Description: The segfault problem has since been fixed. Applications occasionally hang, possibly due to loss of messages sent with
fi_send()
to be received in buffers posted usingfi_recvmsg()
. This has been observed with the cxi-provider libfabric. A suggested workaround of setting certain environment variables doesn't appear to be fully effective, and yet induces a waste of large memory. - Status: Closed
nvfortran
does not support the VALUE
attribute for arrays which are not assumed-size¶
- Vendor: HPE
- Status: Nvidia will not support F2008; closed
nvfortran
does not support intrinsic elemental functions BGE
, BGT
, BLE
, BLT
¶
- Vendor: HPE
- Status: Nvidia will not support F2008; closed
nvfortran
does not support intrinsic elemental functions DSHIFTL
, DSHIFTR
¶
- Vendor: HPE
- Status: Nvidia will not support F2008; closed
nvfortran
does not support function references in a variable definition context¶
- Vendor: HPE
- Status: Nvidia will not support F2008; closed
nvfortran
does not support intrinsic assignment to allocatable polymorphic variables¶
- Vendor: HPE
- Status: Nvidia will not support F2008; closed
nvfortran
does not support %RE
and %IM
complex-part-designators in variables of COMPLEX
type¶
- Vendor: HPE
- Status: Nvidia will not support F2008; closed
User-defined reduction code segfaults with Intel compiler¶
- Vendor: HPE
- Description: A Fortran code using the
mpi_f08
interface of theMPI_User_function
fails to compile due to a problem in thempi_f08_callbacks
module. (The title refers to runtime behavior of an example code in the MPI standard manual when the bug was initially reported. A corrected version shown in later releases of the manual fails to compile.) - Status: Closed
Missing mpi_f08
module¶
- Vendor: HPE
-
Description: A Fortran code that uses the module fails to compile with
PrgEnv-nvidia
.$ ftn main.f90 NVFORTRAN-F-0004-Unable to open MODULE file mpi_f08.mod (main.f90: 2) NVFORTRAN/x86-64 Linux 23.9-0: compilation aborted
-
Status: Nvidia will not support F2008; closed