Skip to content

Selected Vendor Bug Reports

Updated on November 21, 2024.

Active Bugs

MPI-IO error when using MPI_Type_indexed

  • Vendor: HPE
  • Description: Using Cray MPI to compile an MPI program that calls MPI_Type_indexed to concatenate multiple subarray MPI data types produces an incorrect result. The concatenated datatype is used to set the MPI fileview. The following error is seen with a test code:

    $ srun -N1 --ntasks-per-node=4 ./indexed_fsize -f dummy
    Error: expecting file size 800, but got 1200
    srun: error: nid004481: tasks 0-3: Exited with exit code 1
    srun: Terminating StepId=3XXXXXXX.0
    

    The same problem happens when calling MPI_Type_create_hindexed.

    This problem is seen with PrgEnv-gnu, PrgEnv-cray and PrgEnv-nvidia.

    This turns out to be due to a bug in MPICH which Cray's MPI-IO implementation is based on. The error can be reproduced using MPICH versions 4.0.3 and prior. The user who reported the problem subsequently opened an issue with MPICH.

  • Status: In progress

crayftn error on unlimited polymorphic assumed rank argument

  • Vendor: HPE
  • Description: Trying to use the SELECT TYPE and SELECT RANK constructs together in order to make use of an unlimited polymorphic assumed rank argument generates a compile error. A reproducer is provided.

    $ module load PrgEnv-cray
    $ ftn combined.f90
    
    module a
           ^
    ftn-855 ftn: ERROR A, File = combined.f90, Line = 1, Column = 8
      The compiler has detected errors in module "A".  No module information file will be created for this module.
    
          select type (x)
                       ^
    ftn-1871 ftn: ERROR FOO, File = combined.f90, Line = 7, Column = 20
      The selector in a SELECT TYPE statement must be polymorphic.
    
            print *, x
                 ^
    ftn-620 ftn: ERROR FOO, File = combined.f90, Line = 9, Column = 18
      This reference to assumed-rank variable "X" is not valid.
    
      use a, only: foo
          ^
    ftn-894 ftn: ERROR MAIN, File = combined.f90, Line = 15, Column = 7
      Module "A" has compile errors, therefore declarations obtained from the module via the USE statement may be incomplete.
    
    Cray Fortran : Version 15.0.1 (20230120205242_66f7391d6a03cf932f321b9f6b1d8612ef5f362c)
    Cray Fortran : Compile time:  0.0027 seconds
    Cray Fortran : 17 source lines
    Cray Fortran : 4 errors, 0 warnings, 0 other messages, 0 ansi
    Cray Fortran : "explain ftn-message number" gives more information about each message.
    
  • Status: In progress (Reopened)

E3SM codes hang at large scales during startup with slingshot 11.0.1 on CPU nodes

  • Vendor: HPE
  • Description: E3SM codes hang consistently at 225 compute nodes. It still hangs sometimes at lower concurrencies.
  • Status: In progress
  • Workaround: Set the FI_MR_CACHE_MONITOR environment variable as follows:

    export FI_MR_CACHE_MONITOR=kdreg2
    

CMake config files for cray-fftw module

  • Vendor: HPE
  • Description: The CMake files in cray-fftw module (e.g., /opt/cray/pe/fftw/3.3.10.6/x86_milan/lib/cmake/fftw3) have an incorrect path:

    set (FFTW3f_INCLUDE_DIRS /tmp/tmp.VFDTh6VYy8/rpm/BUILDROOT/opt/cray/pe/fftw/3.3.10.3/x86_genoa/include)
    

    This causes a problem with find_package(fftw3 REQUIRED).

  • Status: In progress

crayftn passing procedure pointer derived type components to associated function call

  • Vendor: HPE
  • Description: An ICE (internal compiler error) is triggered when there is an associated function call whose arguments are two procedure pointer derived type components. A reproducer is available.

    $ crayftn --version
    Cray Fortran : Version 17.0.0
    
    $ ftn -c example.f90
       Struct_Opr  idx = 17  Cray parcel pointer   rank = 0; line = 49, col = 34
       Left opnd is IR_Tbl_Idx;  line = 49, col = 31
          Dv_Deref_Opr  idx = 81  type(TYPE_T)    typ_idx (587)   dim = 0 rank = 0; line = 49, col = 31
          Left opnd is AT_Tbl_Idx;  line = 49, col = 31
             LHS  idx = 701  derived-type * 587
          Right operand is NO_Tbl_Idx;
       Right operand is AT_Tbl_Idx;  line = 49, col = 35
          TEST_FUNCTION_  idx = 591  Cray parcel pointer * Cray_Parcel_Ptr_8
    ftn-1716 ftn: INTERNAL EQUALS, File = example.f90, Line = 49, Column = 34
      A multiparented node was encountered.
    ftn-2116 ftn: INTERNAL
      "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/ftnfe" was terminated due to receipt of signal 06:  Aborted.
    
  • Status: Fixed in CCE 19.0.0

  • Availability: CCE 19.0.0 not available yet

crayftn argument that has pointer attribute causes a compile-time error about needing pointer attribute

  • Vendor: HPE
  • Description: When there is a procedure argument to a derived type user defined constructor that has the pointer attribute, the interface for the constructor is in a module and the definition is in a submodule, a compiler error occurs which says that the procedure argument needs the pointer attribute, even though it already has it. See the error message with a reproducer code:

    $ crayftn --version
    Cray Fortran : Version 17.0.0
    
    $ ftn -c example.f90
    
    submodule(pointer_attribute_bug_m) pointer_attribute_bug_s
              ^
    ftn-1800 ftn: ERROR CONSTRUCT, File = example.f90, Line = 32, Column = 11
      Procedure "TEST_FUNCTION" has the INTENT attribute, so it must be a procedure pointer.  Add the POINTER attribute.
    
  • Status: In progress

Cray Fortran 17.0.0 dummy argument type not recognized for module procedure in same module

  • Vendor: HPE
  • Description: A valid Fortran code produces the error shown below when compiled with the Cray Fortran compiler.

    $ ftn -c unimported-dummy-arg-type.f90
    module foo_m
    ^
    ftn-855 ftn: ERROR FOO_M, File = unimported-dummy-arg-type.f90, Line = 1, Column = 8
    The compiler has detected errors in module "FOO_M". No module information file will be created for this module.
    
    module function construct(bar) result(foo)
    ^
    ftn-1279 ftn: ERROR CONSTRUCT, File = unimported-dummy-arg-type.f90, Line = 11, Column = 31
    Procedure "CONSTRUCT" is defined at line 19 (unimported-dummy-arg-type.f90). The type of this argument does not agree with dummy argument "BAR".
    ^
    ftn-287 ftn: WARNING CONSTRUCT, File = unimported-dummy-arg-type.f90, Line = 11, Column = 43
    The result of function name "FOO" in the function subprogram is not defined.
    
    Cray Fortran : Version 17.0.0 (20231107223020_b59b7a8e9169719529cf5ab440f3c301e515d047)
    Cray Fortran : Compile time: 0.0039 seconds
    Cray Fortran : 21 source lines
    Cray Fortran : 2 errors, 1 warnings, 0 other messages, 0 ansi
    Cray Fortran : "explain ftn-message number" gives more information about each message
    
  • Status: Fixed in CCE 18.0.1

  • Availability: CCE 18.0.1 not available yet

Apps instrumented with perftools-lite-gpu get an MPI error

  • Vendor: HPE
  • Description: When MPI apps offloading with OpenMP are instrumented with perftools-lite-gpu, they get an MPI error:

    $ srun -n 4 -c 32 --cpu-bind=cores --gpus-per-task=1 --gpu-bind=none ./a.out
    CrayPat/X:  Version 23.12.0 Revision 67ffc52e7 sles15.4_x86_64  11/13/23 21:04:20
    (GTL DEBUG: 1) cuPointerGetAttribute: (null), (null), line no 327
    MPICH ERROR [Rank 1] [job id 28173016.1] [Mon Jul 15 18:10:36 2024] [nid001161] - Abort(606152194) (rank 1 in comm 0): Fatal error in PMPI_Barrier: Invalid count, error stack:
    PMPI_Barrier(280)....................: MPI_Barrier(comm=comm=0x84000001) failed
    PMPI_Barrier(265)....................:
    MPIR_CRAY_Barrier(124)...............:
    MPIDI_Cray_shared_mem_coll_bcast(518):
    MPIR_Localcopy(95)...................:
    (unknown)(): Invalid count
    
    aborting job:
    Fatal error in PMPI_Barrier: Invalid count, error stack:
    PMPI_Barrier(280)....................: MPI_Barrier(comm=comm=0x84000001) failed
    PMPI_Barrier(265)....................:
    MPIR_CRAY_Barrier(124)...............:
    MPIDI_Cray_shared_mem_coll_bcast(518):
    MPIR_Localcopy(95)...................:
    (unknown)(): Invalid count
    (GTL DEBUG: 2) cuPointerGetAttribute: (null), (null), line no 327
    MPICH ERROR [Rank 2] [job id 28173016.1] [Mon Jul 15 18:10:36 2024] [nid001161] - Abort(606152194) (rank 2 in comm 0): Fatal error in PMPI_Barrier: Invalid count, error stack:
    PMPI_Barrier(280)....................: MPI_Barrier(comm=comm=0x84000001) failed
    PMPI_Barrier(265)....................:
    MPIR_CRAY_Barrier(124)...............:
    MPIDI_Cray_shared_mem_coll_bcast(463):
    MPIR_Localcopy(95)...................:
    (unknown)(): Invalid count
    ...
    

    Uninstrumented executables run fine.

  • Status: In progress

cray-libsci segfaults when using multiple OpenMP threads

  • Vendor: HPE
  • Description: An OpenMP code using cray-libsci segfaults when run with multiple threads on a single node:

    $ srun -n 1 -c 16 --cpu_bind=cores -G 1 --gpu-bind=none ./code/STRUMPACK/build/examples/sparse/testPoisson3d 50 --sp_disable_gpu
    ...
    srun: error: nid001036: task 0: Segmentation fault
    srun: Terminating StepId=27511778.0
    
  • Status: In progress

Valgrind4hpc needs to drop support for exp-sgcheck

  • Vendor: HPE
  • Description: Valgrind doesn't support exp-sgcheck any more but Valgrind4hpc lists it as a supported tool.
  • Status: Fixed in CPE 24.11
  • Availability: CPE 24.11 not available yet

MPI_Allgatherv fails for device buffers within a node

  • Vendor: HPE
  • Description: A user code fails with the function call where the device buffer returned by the omp_get_mapped_ptr function is used in a single-node job in the PrgEnv-cray and PrgEnv-nvidia environments:

    MPICH ERROR [Rank 1] [job id 26614636.0] [Sun Jun  9 07:53:26 2024] [nid001236] - Abort(86580482) (rank 1 in comm 0): Fatal error in PMPI_Allgatherv: Invalid count, error stack:
    PMPI_Allgatherv(491)......................: MPI_Allgatherv(sbuf=MPI_IN_PLACE, scount=0, MPI_DATATYPE_NULL, rbuf=0x7ff2db800000, rcounts=0xe34b3e0, displs=0xe339770, datatype=MPI_DOUBLE_COMPLEX, comm=MPI_COMM_WORLD) failed
    MPIR_CRAY_Allgatherv(466).................:
    MPIR_Allgatherv_impl(277).................:
    MPIR_Allgatherv_intra_auto(191)...........: Failure during collective
    MPIR_Allgatherv_intra_auto(186)...........:
    MPIR_Allgatherv_intra_ring(166)...........:
    MPIC_Sendrecv(338)........................:
    MPIC_Wait(71).............................:
    MPIR_Wait_impl(41)........................:
    MPID_Progress_wait(201)...................:
    MPIDI_Progress_test(105)..................:
    MPIDI_SHMI_progress(118)..................:
    MPIDI_POSIX_progress(412).................:
    MPIDI_CRAY_Common_lmt_ctrl_send_rts_cb(64):
    MPIDI_CRAY_Common_lmt_handle_recv(44).....:
    MPIDI_CRAY_Common_lmt_import_mem(218).....:
    (unknown)(): Invalid count
    ...
    
  • Status: In progress

Apps instrumented with perftools-lite or perftools hang or fail

  • Vendor: HPE
  • Description: When apps are instrumented with perftools-lite or perftools, they hang, segfault or fail for a unknown reason:

    # Hang
    $ ls -lrt | tail -2; sacct -o jobid,jobname,start,end,elapsed,state -j 26530829; date
    -rw------- 1 elvis elvis       5819 Jun  6 13:04 rsl.out.0000
    -rw------- 1 elvis elvis       5831 Jun  6 13:04 rsl.error.0000
    JobID           JobName               Start                 End    Elapsed      State
    ------------ ---------- ------------------- ------------------- ---------- ----------
    ...
    26530829.0      wrf.exe 2024-06-06T13:04:11             Unknown   00:29:02    RUNNING
    Thu 06 Jun 2024 01:33:13 PM PDT
    
    # Segfault
    srun: error: nid004203: task 9: Segmentation fault
    srun: Terminating StepId=26197475.0
    
    # Fail for a unknown reason
    srun: error: nid006953: tasks 0-127: Exited with exit code 255
    srun: Terminating StepId=26230127.0
    
  • Status: In progress

  • Workaround: Set the PAT_RT_CALLSTACK_MODE environment variable before your srun command:

    export PAT_RT_CALLSTACK_MODE=frames
    

Codes fail with 'cxil_map: write error'

  • Vendor: HPE
  • Description: When built with the -O2 or -O3 flag in the PrgEnv-gnu environment, a CPU code called CROCO (Coastal and Regional Ocean Community model) fails with the error:

    cxil_map: write error
    cxil_map: write error
    cxil_map: write error
    ...
    cxil_map: write error
    MPICH ERROR [Rank 132] [job id 26513451.0] [Thu Jun  6 04:27:18 2024] [nid005207] - Abort(538553615) (rank 132 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack:
    PMPI_Irecv(166)........: MPI_Irecv(buf=0x7ffda83b8580, count=630, MPI_DOUBLE_PRECISION, src=115, tag=6, MPI_COMM_WORLD, request=0x7ffda83b84d0) failed
    MPID_Irecv(529)........:
    MPIDI_irecv_unsafe(163):
    MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address)
    
    aborting job:
    Fatal error in PMPI_Irecv: Other MPI error, error stack:
    PMPI_Irecv(166)........: MPI_Irecv(buf=0x7ffda83b8580, count=630, MPI_DOUBLE_PRECISION, src=115, tag=6, MPI_COMM_WORLD, request=0x7ffda83b84d0) failed
    MPID_Irecv(529)........:
    MPIDI_irecv_unsafe(163):
    MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address)
    MPICH ERROR [Rank 137] [job id 26513451.0] [Thu Jun  6 04:27:18 2024] [nid005207] - Abort(941206799) (rank 137 in comm 0): Fatal error in PMPI_Irecv: Other MPI error, error stack:
    PMPI_Irecv(166)........: MPI_Irecv(buf=0x7ffdc3523ea0, count=630, MPI_DOUBLE_PRECISION, src=122, tag=8, MPI_COMM_WORLD, request=0x7ffdc35215f8) failed
    MPID_Irecv(529)........:
    MPIDI_irecv_unsafe(163):
    MPIDI_OFI_do_irecv(356): OFI tagged recv failed (ofi_recv.h:356:MPIDI_OFI_do_irecv:Bad address)
    ...
    

    With a lower optimization level, the code runs fine. A similar error is observed with an old version of WRF.

  • Status: In progress

cray-netcdf and cray-parallel-netcdf module issues with wrong lib directories

  • Vendor: HPE
  • Description: Many model build systems rely on nc-config and pnetcdf-config to get the correct compiler flags and link libraries for compiling with these tools. However the modules cray-parallel-netcdf/1.12.2.1, cray-netcdf/cray-netcdf/4.8.1.1, and cray-netcdf-hdf5parallel/4.8.1.1 all give incorrect results. For example,

    $ pnetcdf-config --libdir
    /opt/cray/pe/parallel-netcdf/1.12.2.1/gnu/8.2/lib
    

    when the correct path is /opt/cray/pe/parallel-netcdf/1.12.2.1/INTEL/19.1/lib. With cray-netcdf-hdf5parallel/4.8.1.1, nc-config gives several wrong results:

    --cflags    -> -DpgiFortran
    --cxx4flags -> -DpgiFortran
    --libdir    -> /opt/cray/pe/netcdf-hdf5parallel/4.8.1.1/gnu/8.2/lib
    

    The same problem is observed in later versions.

  • Status: In progress

crayftn ICE when trying to build dftd4

  • Vendor: HPE
  • Description: An ICE occurs when building dftd4 with the Cray compiler

    ftn-1795 ftn: INTERNAL MULTICHARGE_MODEL, File = /global/homes/e/elvis/Repositories/multicharge/src/multicharge/model.F90, Line = 30, Column = 4
      FORTRAN FE ASSERT: "new_attr_idx" failed. ( /home/jenkins/crayftn/fe90/sources/module.c at line 1224).
    ftn-2116 ftn: INTERNAL
      "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/ftnfe" was terminated due to receipt of signal 06:  Aborted.
    
  • Status: Fixed in CCE 18.0.0

  • Availability: CCE 18.0.0 not available yet

crayftn ICE on WHERE statement with defined assignment

  • Vendor: HPE
  • Description: The Cray Fortran compiler hits an ICE with a test code when it encounters a WHERE statement that uses defined assignment.

    ...
    Creating internal compiler error backtrace (please wait):
    [0x000000012d4e69] linux_backtrace /home/jenkins/crayftn/pdgcs/v_util.c:186
    [0x000000012d53a1] pdgcs_internal_error(char const*, char const*, int) /home/jenkins/crayftn/pdgcs/v_util.c:663
    [0x00000001ddcbc5] verify_binary_args(EXP_OP, EXP_INFO, EXP_INFO, TYPE, char const*, EXP_T_TYPE, bool, bool, bool) [clone .constprop.0] /home/jenkins/crayftn/pdgcs/v_expr_tbl.c:413
    ...
    [0x000000008a9fc9] _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
    ftn-7991 ftn: INTERNAL EXAMPLE, File = example.f90, Line = 34
      INTERNAL COMPILER ERROR:  "Array syntax flags do not match" (/home/jenkins/crayftn/pdgcs/v_expr_tbl.c, line 413, version b59b7a8e9169719529cf5ab440f3c301e515d047)
    ftn-2116 ftn: INTERNAL
      "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/optcg" was terminated due to receipt of signal 06:  Aborted.
    ...
    
  • Status: In progress

Code fails with 'MPIDI_OFI_send_normal:Resource temporarily unavailable)'

  • Vendor: HPE
  • Description: A user app with GPU-aware MPI gets the following error with cray-mpich/8.1.25 in multi-node jobs when GDRCopy is used:

    ...
    MPICH ERROR [Rank 10] [job id 24282816.0] [Thu Apr 11 15:49:07 2024] [nid008436] - Abort(739891471) (rank 10 in comm 0): Fatal error in PMPI_Send: Other MPI error, error stack:
    PMPI_Send(163)............: MPI_Send(buf=0x7fc1755b4f90, count=6300, MPI_DOUBLE, dest=12, tag=1, MPI_COMM_WORLD) failed
    MPID_Send(499)............:
    MPIDI_send_unsafe(58).....:
    MPIDI_OFI_send_normal(368): OFI tagged senddata failed (ofi_send.h:368:MPIDI_OFI_send_normal:Resource temporarily unavailable)
    ...
    

    The user was using PrgEnv-gnu.

  • Status: In progress

crayftn ICE when trying to build MERGE Function call ICE when the MASK argument is a derived type component

  • Vendor: HPE
  • Description: An ICE occurs when there is a function call to the intrinsic MERGE and when the MASK argument is a component of the derived type argument to a type bound procedure and when the result of the MERGE call is then passed to the intrinsic TRIM. The error message with a Fortran code is as follows:

    Creating internal compiler error backtrace (please wait):
    [0x000000012d4e69] linux_backtrace /home/jenkins/crayftn/pdgcs/v_util.c:186
    [0x000000012d53a1] pdgcs_internal_error(char const*, char const*, int) /home/jenkins/crayftn/pdgcs/v_util.c:663
    [0x000000015e3c52] llvm_cg::get_string_address(EXP_INFO) /home/jenkins/crayftn/pdgcs/llvm-substr.c:493
    ...
    [0x007f35e3e3e24c] ?? ??:0
    [0x000000008a9fc9] _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
    ftn-7991 ftn: INTERNAL MYPROC, File = example.f90, Line = 15 
      INTERNAL COMPILER ERROR:  "get_string_address - the string is not a lsbstr" (/home/jenkins/crayftn/pdgcs/llvm-substr.c, line 493, version b59b7a8e9169719529cf5ab440f3c301e515d047)
    ftn-2116 ftn: INTERNAL  
      "/opt/cray/pe/cce/17.0.0/cce/x86_64/bin/optcg" was terminated due to receipt of signal 06:  Aborted.
    
  • Status: Fixed in CCE 18.0.1

  • Availability: CCE 18.0.1 not available yet

sanitizers4hpc's output aggregation with ThreadSanitizer

  • Vendor: HPE
  • Description: Aggregation of ThreadSanitizer output by sanitizers4hpc needs improvement.
  • Status: Fixed in CPE 24.11
  • Availability: CPE 24.11 not available yet

sanitizers4hpc with Compute Sanitizer's memcheck produces output that is not aggregated

  • Vendor: HPE
  • Description: Compute Sanitizer output aggregation needs improvement.
  • Status: Fixed in PE 24.07; closed
  • Availability: PE 24.07 not available yet

sanitizers4hpc produces stack traces for 'Program hit CUDA_ERROR_INVALID_VALUE error'

  • Vendor: HPE
  • Description: The following error message appears when using sanitizers4hpc with Compute Sanitizer's Memcheck although the desired output is produced:

    Program hit CUDA_ERROR_INVALID_VALUE (error 1) due to "invalid argument" on CUDA API call to cuPointerGetAttribute.
    Saved host backtrace up to driver entry point at error
        #0 0x2eae6f in /usr/local/cuda-12.2/compat/libcuda.so.1
        #1 0xda19 in /home/jenkins/src/gtlt/cuda/gtlt_cuda_query.c:344:gtlt_cuda_pointer_type /opt/cray/pe/lib64/libmpi_gtl_cuda.so.0
        #2 0x4bd9 in /home/jenkins/src/comx/gtlx_query.c:25:mpix_gtl_pointer_type /opt/cray/pe/lib64/libmpi_gtl_cuda.so.0
        #3 0x1fa2475 in MPIR_Cray_Memcpy_wrapper /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #4 0x1841ee9 in MPIDIG_handle_unexp_mrecv /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #5 0x18ac030 in MPIC_Sendrecv /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #6 0x17d6bcf in MPIR_Barrier_intra_dissemination /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #7 0x232900 in MPIR_Barrier_intra_auto /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #8 0x232ab5 in MPIR_Barrier_impl /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #9 0x1a1f15d in MPIR_CRAY_Barrier /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #10 0x1a15772 in MPIDI_Cray_shared_mem_coll_opt_cleanup /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #11 0x18cc799 in MPIDI_Cray_coll_finalize /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #12 0x1b7a65f in MPID_Finalize /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #13 0x605ab4 in MPI_Finalize /opt/cray/pe/lib64/libmpi_gnu_123.so.12
        #14 0xf94 in /pscratch/sd/e/elvis/Memcheck/main.cc:14:main /pscratch/sd/e/elvis/Memcheck/./a.out
        #15 0x3524d in __libc_start_main /lib64/libc.so.6
        #16 0xe9a in ../sysdeps/x86_64/start.S:122:_start /pscratch/sd/e/elvis/Memcheck/./a.out
    

    This error doesn't occur when the code is run without sanitizers4hpc.

  • Status: In progress

No source line number displayed when run with MemorySanitizer in PrgEnv-cray

  • Vendor: HPE
  • Description: No info is provided by MemorySanitizer where in source code an error occurs or where the memory was allocated.

    ==1068200==WARNING: MemorySanitizer: use-of-uninitialized-value
        #0 0x3007bc  (/pscratch/sd/e/elvis/a.out+0x3007bc)
        #1 0x7fc5c262e24c  (/lib64/libc.so.6+0x3524c) (BuildId: ddc393ac74ed8f90d4fdfff796432fbafd281e1b)
        #2 0x26a849  (/pscratch/sd/e/elvis/a.out+0x26a849)
    ...
    

    The same problem is seen with PrgEnv-intel, as it turned out.

  • Status: Fixed in CCE 18.0.0 for PrgEnv-cray; the PrgEnv-intel problem won't be fixed

  • Availability: CCE 18.0.0 not available yet
  • Workaround: Set the environment variable:

    export MSAN_OPTIONS="allow_addr2line=true"
    

disable_sanitizer_instrumentation attribute doesn't work with PrgEnv-aocc

  • Vendor: HPE
  • Description: The __attribute__((disable_sanitizer_instrumentation)) attribute doesn't disable sanitizer instrumentation in the AOCC compilers although the compilers are Clang-based.
  • Status: In progress

Segfaulting with calls to MPI_Win_allocate_shared function on multiple CPU nodes

  • Vendor: HPE
  • Description: The FHI-aims code uses the MPI-3 Shared Memory model. When CPU nodes (as opposed to GPU nodes) are used and the number of MPI tasks per node goes over a certain threshold (46 in case of the problem tested), a multi-node run segfaults at a function call to MPI_Win_allocate_shared.

    ...
    Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
    
    #0  0x14e493c23372 in ???
    #1  0x14e493c22505 in ???
    #2  0x14e493253dbf in ???
    ...
    #11  0x14e49424a3d4 in ???
    #12  0x2121b80 in __mpi_shm_MOD_allocate_shm_arr_1dr_diml
            at /pscratch/sd/e/elvis/FHIaims/src/mpi_shm.f90:152
    ...
    

    A different task-per-node count triggers a segfault at a different MPI_Win_allocate_shared call in the code.

  • Status: In progress

  • Status: Fixed in cray-mpich/8.1.29
  • Availability: cray-mpich 8.1.29 or later not available yet

Error occurs when MPI window object is not freed

  • Vendor: HPE
  • Description: Messages about a fatal MPI finalize error are generated when a MPI window object is not freed before MPI_Finalize.

    MPICH ERROR [Rank 0] [job id 23533657.21] [Tue Mar 26 16:56:36 2024] [nid006635] - Abort(806971663) (rank 0 in comm 0): Fatal error in PMPI_Finalize: Other MPI error, error stack:
    PMPI_Finalize(214)...............: MPI_Finalize failed
    PMPI_Finalize(161)...............:
    MPID_Finalize(710)...............:
    MPIDI_OFI_mpi_finalize_hook(1046): OFI endpoint close failed (ofi_init.c:1046:MPIDI_OFI_mpi_finalize_hook:Device or resource busy)
    ...
    
  • Status: A fix in cray-mpich/8.1.29.34

  • Availability: cray-mpich 8.1.29.34 or later not available yet

CCE 17.0.0 Fortran compiler fails four Smart-Pointers tests

crayftn runtime error with user defined operator on associate name

  • Vendor: HPE
  • Description: A segmentation fault occurs in a code when calling a user defined operator on a name associated with a function/expression result.

    lib-4968 : WARNING
      An unallocated allocatable array 'STRING_' is referenced at
      at line 20 in file 'example.f90'.
    Segmentation fault
    
  • Status: In progress

Valid coarray code rejected by crayftn

  • Vendor: HPE
  • Description: A coarray code is incorrectly rejected with errors by the Cray Fortran compiler.

    $ ftn coarrays.f90 -o coarrays.exe
           call assign_and_synchronize(lhs=u_half, rhs=u + (dt/2)*(nu*d_dx2(u,dx) - d_dx(half_uu,dx)))
                                                                             ^                          
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 74 
      Coarray t$14 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
                                                                             ^                          
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 74 
      Coarray t$14 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
                                                                                          ^             
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 87 
      Coarray t$19 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
                                                                                          ^             
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 15, Column = 87 
      Coarray t$19 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
    
            call assign_and_synchronize(lhs=u, rhs=u + dt*(nu*d_dx2(u_half,dx) - d_dx(half_uu,dx)))
                                                                    ^                               
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 65 
      Coarray t$35 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
                                                                    ^                               
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 65 
      Coarray t$35 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
                                                                                      ^             
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 83 
      Coarray t$40 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
                                                                                      ^             
    ftn-1587 ftn: ERROR COARRAY_BURGERS_SOLVER, File = coarrays.f90, Line = 17, Column = 83 
      Coarray t$40 must have the ALLOCATABLE attribute in order to have a deferred shape in the coarray dimensions.
    
    Cray Fortran : Version 17.0.0 (20231107223020_b59b7a8e9169719529cf5ab440f3c301e515d047)
    Cray Fortran : Compile time:  0.1774 seconds
    Cray Fortran : 135 source lines
    Cray Fortran : 8 errors, 0 warnings, 0 other messages, 0 ansi
    Cray Fortran : "explain ftn-message number" gives more information about each message.
    
  • Status: In progress

Incorrect results and poor performance with do concurrent reduction

  • Vendor: HPE
  • Description: A code that does a do concurrent reduce operation gives incorrect results when built with the Cray Fortran compiler. When compiled with the -h thread_do_concurrent flag, the code shows poor performance.
  • Status: In progress

TCP BTL fails to collect all interface addresses (when interfaces are on different subnets)

  • Vendor: Open MPI
  • Description: Multi-node Open MPI point-to-point communications using the tcp BTL component fail because, although one NIC on a Perlmutter node has two IP interfaces for different subnets (one private and one public), only one IP is used per peer kernel interface.

    Open MPI detected an inbound MPI TCP connection request from a peer
    that appears to be part of this MPI job (i.e., it identified itself as
    part of this Open MPI job), but it is from an IP address that is
    unexpected.  This is highly unusual.
    
    The inbound connection has been dropped, and the peer should simply
    try again with a different IP interface (i.e., the job should
    hopefully be able to continue).
    
      Local host:          nid002292
      Local PID:           1273838
      Peer hostname:       nid002293 ([[9279,0],1])
      Source IP of socket: 10.249.13.210
      Known IPs of peer:
        10.100.20.22
        128.55.69.127
        10.249.13.209
        10.249.36.5
        10.249.34.5
    
  • Status: In progress

CP2K container builds with Open MPI with networking bug

  • Vendor: Nvidia
  • Description: CP2K container images on NGC that NERSC suggests to use were built with old Open MPI versions (4.x), and bugs there contribute to multi-node job failures. Requesting new images with Open MPI 5.x built against libfabric.
  • Status: In progress

cray-mpich module does not set LD_LIBRARY_PATH

  • Vendor: HPE
  • Description: Loading the module doesn't update the environment variable and this has to be done manually.

    $ export MPICH_VERSION_DISPLAY=1  # Print the MPI version number that is being used
    
    $ ml -t
    ...
    cray-mpich/8.1.25
    ...
    $ echo $CRAY_MPICH_VERSION
    8.1.25
    
    $ ml cray-mpich/8.1.27            # Load a different version
    
    $ echo $CRAY_MPICH_VERSION
    8.1.27
    
    $ srun -n 1 ./a.out               # Still using the previous version
    MPI VERSION    : CRAY MPICH version 8.1.25.17 (ANL base 3.4a2)
    ...
    
    $ export LD_LIBRARY_PATH=${CRAY_MPICH_DIR}/lib:$LD_LIBRARY_PATH
    
    $ srun -n 1 ./a.out               # Now using the intended version
    MPI VERSION    : CRAY MPICH version 8.1.27.26 (ANL base 3.4a2)
    ...
    
  • Status: In progress

PrgEnv-nvhpc conflicts with cudatoolkit module

  • Vendor: HPE
  • Description: The PrgEnv-nvhpc environment loads the nvhpc module which is listed as a conflict for the cudatoolkit module.

    $ ml PrgEnv-nvhpc
    $ ml -t
    ...
    cpe/23.12
    cudatoolkit/12.2
    craype-accel-nvidia80
    gpu/1.0
    nvhpc/23.9
    ...
    PrgEnv-nvhpc/8.5.0
    
    $ ml rm cudatoolkit
    $ ml cudatoolkit
    Lmod has detected the following error:  Cannot load module "cudatoolkit/12.2" because these module(s) are
    loaded:
       nvhpc
    
    While processing the following module(s):
        Module fullname   Module Filename
        ---------------   ---------------
        cudatoolkit/12.2  /opt/cray/pe/lmod/modulefiles/core/cudatoolkit/12.2.lua
    
    $ ml -t        # cudatoolkit not in the list
    ...
    cpe/23.12
    craype-accel-nvidia80
    gpu/1.0
    nvhpc/23.9
    ...
    PrgEnv-nvhpc/8.5.0
    
  • Status: The nvhpc and PrgEnv-nvhpc modules will be removed in CPE/24.11, in favor of nvidia and PrgEnv-nvidia

  • Workaround: Use the nvidia and PrgEnv-nvidia modules instead
  • Availability: CPE 24.11 not available yet

Regression in device memory growth issue with GPU-Aware MPI for XGC code

  • Vendor: HPE
  • Description: The code runs into a problem of device memory growth when GPU-Aware MPI is enabled.
  • Status: In progress
  • Workaround: Disable the memory registration (MR) cache:

    export FI_MR_CACHE_MAX_COUNT=0
    

OpenACC reduction with worker gives wrong answers

  • Vendor: Nvidia
  • Description: A procedure declared with an OpenACC routine worker directive returns wrong reduction values in the PrgEnv-nvidia environment when called from within a loop where num_workers and vector_length are set to 32.
  • Status: In progress

Performance issue with fi_write() to GPU memory on Perlmutter

  • Vendor: HPE
  • Description: The GASNet-EX networking library implements RMA APIs with the vendor-provided libfabric and its cxi provider. RMA Put operations between two GPU nodes when the destination address is in remote GPU memory show unexpectedly much lower performance than MPI. For other source/destination memory and Put/Get mode combinations, the GASNet-EX and MPI benchmarks show similar performance or GASNet-EX performs better.
  • Status: In progress

RMA performance problems on Perlmutter with GASNet Codes

  • Vendor: HPE
  • Description: With the GASNet-EX networking library implementing RMA (Remote Memory Access) APIs with fi_read() and fi_write() functions of the vendor-provided libfabric and its cxi provider, it is observed that RMA operations perform very well under ideal conditions. When conditions are not ideal, the performance decreases significantly for both host and GPU memory.
  • Status: In progress

crayftn overloaded constructor with polymorphic argument in array constructor

  • Vendor: HPE
  • Description: The Cray Fortran compiler generates an internal compiler error for a code that passes a child type to an overloaded structure constructor within an array constructor, where the parent type has a deferred procedure.

    Creating internal compiler error backtrace (please wait):
    [0x00000000c75a43] linux_backtrace ??:?
    [0x00000000c76931] pdgcs_internal_error(char const*, char const*, int) ??:?
    [0x0000000125c2d0] _expr_type(EXP_INFO) ??:?
    ...
    [0x007f86c32e129c] ?? ??:0
    [0x00000000729d09] _start /home/abuild/rpmbuild/BUILD/glibc-2.31/csu/../sysdeps/x86_64/start.S:120
    
    Note:  This is a non-debug compiler.  Technical support should
           continue problem isolation using a compiler built for
           debugging.
    
    ftn-7991 ftn: INTERNAL EXAMPLE, File = example.f90, Line = 63
      INTERNAL COMPILER ERROR:  "_expr_type: Invalid table type" (/home/jenkins/crayftn/pdgcs/v_expr_utl.c, line 7360, version 66f7391d6a03cf932f321b9f6b1d8612ef5f362c)
    
  • Status: Fixed in CCE 18.0.0

  • Availability: CCE 18.0.0 not available yet

Internal Compiler Error

  • Vendor: HPE
  • Description: An internal compiler error occurs when compiling the E3SM code with the AMD compilers.
  • Status: Fixed in CCE 19.0.0
  • Availability: CCE 19.0.0 not available yet

Code hangs when run on multiple nodes, sometimes showing the 'xpmem_attach error: : Cannot allocate memory' message

  • Vendor: HPE
  • Description: A code that runs fine with 128 MPI tasks on a single CPU node hangs when running on multiple nodes, sometimes generating the following message but not always.

    xpmem_attach error: : Cannot allocate memory
    
  • Status: In progress

  • Workaround: Set the FI_MR_CACHE_MONITOR environment variable as follows:

    export FI_MR_CACHE_MONITOR=kdreg2
    

cray-mpich with GTL not recognising pointer to device memory, that was returned by OpenCL clSVMAlloc

  • Vendor: HPE
  • Description: When a pointer returned by the OpenCL clSVMAlloc function is used in one-sided MPI communication, it is not getting the correct data. A workaround of wrapping MPI RMA exposure epoch in clEnqueueSVMMap/clEnqueueSVMUnmap causes a large amount of data to be unnecessarily moved between the host and device memory. Asking for advice for using OpenCL with MPICH_GPU_SUPPORT_ENABLED.
  • Status: In progress

Resolved Bugs

Apps instrumented with perftools-lite-gpu get an MPI error

Sudo permission issue for cuquantum-appliance:23.10 container

  • Vendor: cuQuantum
  • Description: Users cannot access the /home/cuquantum directory in a container.

    $ cd /home/cuquantum/
    bash: cd: /home/cuquantum/: Permission denied
    
  • Status: Fixed in neilmehta87/cuquantum-appliance:23.10

  • Availability: neilmehta87/cuquantum-appliance:23.10 available on Perlmutter
  • Vendor: HPE
  • Description: Certain apps (not limited to those using the NCCL library, as it turned out) may expose a system condition as a link flap, making them susceptible to a job failure.
  • Status: Closed

No scope analysis window pops up with Perftools/Reveal

  • Vendor: HPE
  • Description: The 'Scope Loop' button on the Reveal tool doesn't open a window that would normally show the scoping result for a selected loop.
  • Status: Closed

crayftn bug in assignment to unlimited polymorphic variable

  • Vendor: HPE
  • Description: The Cray Fortran compiler generates an error with allocation on assigmnent to an unlimited polymorphic variable:

    $ cat example1.f90 
    class(*), allocatable :: anything
    anything =.true.
    end
    
    $ ftn example1.f90 
    
    anything =.true.
             ^       
    ftn-356 ftn: ERROR $MAIN, File = example1.f90, Line = 2, Column = 10 
      Assignment of a LOGICAL expression to a unlimited polymorphic variable is not allowed.
    
    Cray Fortran : Version 15.0.1 (20230120205242_66f7391d6a03cf932f321b9f6b1d8612ef5f362c)
    Cray Fortran : Compile time:  0.0020 seconds
    Cray Fortran : 3 source lines
    Cray Fortran : 1 errors, 0 warnings, 0 other messages, 0 ansi
    Cray Fortran : "explain ftn-message number" gives more information about each message.
    
  • Status: Closed

Cray HDF5 parallel modules used in CMake fails to configure C++ project but not Fortran

  • Vendor: HPE
  • Description: Both (C++ and Fortran) work with the cray-hdf5 module. Only the Fortran reproducer works with cray-hdf5-parallel. The C++ code throws the following error for cray-hfd5-parallel only:

    $ cmake -B build .
    -- The C compiler identification is GNU 11.2.0
    -- The CXX compiler identification is GNU 11.2.0
    -- Cray Programming Environment 2.7.19 C
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Check for working C compiler: /opt/cray/pe/craype/2.7.19/bin/cc - skipped
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Cray Programming Environment 2.7.19 CXX
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /opt/cray/pe/craype/2.7.19/bin/CC - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    CMake Error at /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
     Could NOT find HDF5 (missing: HDF5_INCLUDE_DIRS) (found version "")
    Call Stack (most recent call first):
     /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
     /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/FindHDF5.cmake:1009 (find_package_handle_standard_args)
     CMakeLists.txt:4 (find_package)
    
  • Status: Closed

Issues when linking with PrgEnv-nvidia and cuFFTMp

  • Vendor: HPE
  • Description: Undefined reference to MPI_Comm_f2c reported at link time.
  • Status: Closed

OFI segfault, and intermittent loss of messages with GASNet

  • Vendor: HPE
  • Description: The segfault problem has since been fixed. Applications occasionally hang, possibly due to loss of messages sent with fi_send() to be received in buffers posted using fi_recvmsg(). This has been observed with the cxi-provider libfabric. A suggested workaround of setting certain environment variables doesn't appear to be fully effective, and yet induces a waste of large memory.
  • Status: Closed

nvfortran does not support the VALUE attribute for arrays which are not assumed-size

  • Vendor: HPE
  • Status: Nvidia will not support F2008; closed

nvfortran does not support intrinsic elemental functions BGE, BGT, BLE, BLT

  • Vendor: HPE
  • Status: Nvidia will not support F2008; closed

nvfortran does not support intrinsic elemental functions DSHIFTL, DSHIFTR

  • Vendor: HPE
  • Status: Nvidia will not support F2008; closed

nvfortran does not support function references in a variable definition context

  • Vendor: HPE
  • Status: Nvidia will not support F2008; closed

nvfortran does not support intrinsic assignment to allocatable polymorphic variables

  • Vendor: HPE
  • Status: Nvidia will not support F2008; closed

nvfortran does not support %RE and %IM complex-part-designators in variables of COMPLEX type

  • Vendor: HPE
  • Status: Nvidia will not support F2008; closed

User-defined reduction code segfaults with Intel compiler

  • Vendor: HPE
  • Description: A Fortran code using the mpi_f08 interface of the MPI_User_function fails to compile due to a problem in the mpi_f08_callbacks module. (The title refers to runtime behavior of an example code in the MPI standard manual when the bug was initially reported. A corrected version shown in later releases of the manual fails to compile.)
  • Status: Closed

Missing mpi_f08 module

  • Vendor: HPE
  • Description: A Fortran code that uses the module fails to compile with PrgEnv-nvidia.

    $ ftn main.f90
    NVFORTRAN-F-0004-Unable to open MODULE file mpi_f08.mod (main.f90: 2)
    NVFORTRAN/x86-64 Linux 23.9-0: compilation aborted
    
  • Status: Nvidia will not support F2008; closed