Example Dockerfiles for Shifter¶

The easiest way to run your software in a Shifter container is to create a Docker image, push it to Docker Hub then pull it to Shifter ImageGateway which creates the corresponding Shifter image. More details.

HEP/HENP Software Stacks¶

DESI¶

DESI jobs use community standard publicly available software, independent of the Linux distro flavor. This example is built with an Ubuntu base.

This Dockerfile is also available:

GitHub
Docker hub - mmustafa/desi

You need the following prerequesites:

Start with an Ubuntu base image
Install all the needed standard packages from Ubuntu repositories
Compile Astrometry.net, Tractor, TMV and Galsim.
Setup the needed environment variables along the way

Build the image:

# Build DESI software environment ontop an Ubuntu base
FROM ubuntu:16.04
MAINTAINER Mustafa Mustafa <mmustafa@lbl.gov>

# install astrometry and tractor dependencies
RUN apt-get update && \
    apt-get install -y wget make git python python-dev python-matplotlib \
                       gcc swig python-numpy libgsl2 gsl-bin pkg-config \
                       zlib1g-dev libcairo2-dev libnetpbm10-dev netpbm \
                       libpng12-dev libjpeg-dev python-pyfits zlib1g-dev \
                       libbz2-dev libcfitsio3-dev python-photutils python-pip && \
    pip install fitsio

ENV PYTHONPATH /desi_software/astrometry_net/lib/python:$PYTHONPATH
ENV PATH /desi_software/astrometry_net/lib/python/astrometry/util:$PATH
ENV PATH /desi_software/astrometry_net/lib/python/astrometry/blind:$PATH

# ------- install astrometry
RUN mkdir -p /desi_software/astrometry_net && \
         git clone https://github.com/dstndstn/astrometry.net.git && \
         cd astrometry.net && \
         make install INSTALL_DIR=/desi_software/astrometry_net &&\
         cd / && \
         rm -rf astrometry.net

# ------- install tractor
RUN mkdir -p /desi_software/tractor && \
         git clone https://github.com/dstndstn/tractor.git && \
         cd tractor && \
         make && \
         python setup.py install --prefix=/desi_software/tractor/

ENV PYTHONPATH /desi_software/tractor/lib/python2.7/site-packages:$PYTHONPATH

# ------- install missing GalSim dependencies (others have been installed above)
RUN apt-get install -y python-future python-yaml python-pandas scons fftw3-dev libboost-all-dev

# ------- install TMV
RUN wget https://github.com/rmjarvis/tmv/archive/v0.73.tar.gz -O tmv.tar.gz && \
         gunzip tmv.tar.gz && \
         mkdir tmv && tar xf tmv.tar -C tmv --strip-components 1 && \
         cd tmv && \
         scons && \
         scons install && \
         cd / && \
         rm -rf tmv.tar tmv

# ------- install GalSim
RUN wget https://github.com/GalSim-developers/GalSim/archive/v1.4.2.tar.gz -O GalSim.tar.gz && \
         gunzip GalSim.tar.gz && \
         mkdir GalSim && tar xf GalSim.tar -C GalSim --strip-components 1 && \
         cd GalSim && \
         scons && \
         scons install && \
         cd / && \
         rm -rf GalSim.tar GalSim

STAR¶

The STAR experiment software stack is typically built and run on Scientific Linux.

There are two ways we can build the STAR image, the first is to compile all the stack components one by one. The other is to install the compiled libraries by copying them into the image. We chose to do the latter in this example.

We use an SL6.4 docker base image that is publicly available, install the needed rpms, extract pre-compiled binaries tarballs into the image and finally install some software that needed to run STAR jobs.

# Example Dockerfile to show how to build STAR
# environment image from binuaries tarballs. Not necessarily
# the one currently used for STAR docker image build
FROM ringo/scientific:6.4
MAINTAINER Mustafa Mustafa <mmustafa@lbl.gov>

# RPMs
RUN yum -y install libxml2 tcsh libXpm.i686 libc.i686 libXext.i686 \
                   libXrender.i686 libstdc++.i686 fontconfig.i686 \
                   zlib.i686 libgfortran.i686 libSM.i686 mysql-libs.i686 \
                   gcc-c++ gcc-gfortran glibc-devel.i686 xorg-x11-xauth \
                   wget make libxml2.so.2 gdb libXtst.{i686,x86_64} \
                   libXt.{i686,x86_64} glibc glibc-devel gcc-c++# Dev Tools
RUN wget -O /etc/yum.repos.d/slc6-devtoolset.repo \
     https://linuxsoft.cern.ch/cern/devtoolset/slc6-devtoolset.repo && \
 yum -y install devtoolset-2-toolchain
COPY enable_scl /usr/local/star/group/templates/

# untar STAR OPT
COPY optstar.sl64_gcc482.tar.gz /opt/star/
COPY installstar /
RUN python installstar SL16c && \
 rm -f installstar &&         \
 rm -f optstar.sl64_gcc482.tar.gz

# untar ROOT
COPY rootdeb-5.34.30.sl64_gcc482.tar.gz /usr/local/star/
COPY installstar /
RUN python installstar SL16c && \
 rm -f installstar && \
 rm -f rootdeb-5.34.30.sl64_gcc482.tar.gz

# untar STAR library
COPY SL16d.tar.gz /usr/local/star/packages/
COPY installstar /
RUN python installstar SL16d && \
 rm -f installstar && \
 rm -f /usr/local/star/packages/SL16d.tar.gz

# DB load balancer
COPY dbLoadBalancerLocalConfig_generic.xml /usr/local/star/packages/SL16d/StDb/servers/

# production pipeline utility macros
COPY Hadd.C /usr/local/star/packages/SL16d/StRoot/macros/
COPY lMuDst.C /usr/local/star/packages/SL16d/StRoot/macros/
COPY checkProduction.C /usr/local/star//packages/SL16d/StRoot/macros/

# Special RPMs for production at NERSC; Open MPI, mysql-server
RUN yum -y install libibverbs.x86_64 environment-modules infinipath-psm-devel.x86_64 \
 librdmacm.x86_64 opensm.x86_64 papi.x86_64 && \
 wget https://mirror.centos.org/centos/6.8/os/x86_64/Packages/openmpi-1.10-1.10.2-2.el6.x86_64.rpm && \
 rpm -i openmpi-1.10-1.10.2-2.el6.x86_64.rpm && \
 rm -f openmpi-1.10-1.10.2-2.el6.x86_64.rpm && \
 yum -y install glibc-devel devtoolset-2-libstdc++-devel.i686 && \
 yum -y install mysql-server mysql && \

# add open mpi library to LD Path
ENV LD_LIBRARY_PATH /usr/lib64/openmpi-1.10/lib/

STAR MySQL DB¶

STAR jobs need access to a read-only MySQL server which provides conditions and calibration tables.

We have found that job scalability is not ideal if the MySQL server is outside the system network. Our solution was to run a local MySQL server on each node, the server services all the threads running on the node (e.g. 32 core threads). We chose to overcommit the cores, i.e. 32 production threads + 1 mysql server running on 32 cores.

The DB payload (~30GB) resides on Lustre. We have found out that the server with the payload accessed directly from Lustre FS doesn't perform well for this I/O pattern, it takes more than 30 minutes to cache the first few requests. In this case the XFS image mount capability (perCacheNode) came in handy. As soon as the job starts we copy the payload from Lustre FS into an XFS file mount, then we set the DB server to use this copy. Copying the 30 GB payload takes 1-3 minutes. The performance was stunning, caching time dropped down from 30 minutes to less than 1 minute, it also provided us with trivial scalability of the number of concurrent jobs.

Below are the relevant lines from our slurm batch file:

Request a perCacheNode of 50GB and mount it to /mnt in the Shifter image.

#!/bin/bash
#SBATCH --image=mmustafa/sl64_sl16d:v1_pdsf6
#SBATCH --volume=/global/cscratch1/sd/mustafa/:/mnt:perNodeCache=size=50G

Launch the Shifter container:

shifter /bin/csh <<EOF

Copy the payload to the /mnt then launch the DB sever:

#Prepare DB...
cd /mnt
cp -r -p /global/cscratch1/sd/mustafa/mysql51VaultStar6/ .
/usr/bin/mysqld_safe --defaults-file=/mnt/mysql51VaultStar6/my.cnf --skip-grant-tables &
sleep 30

Using Open MPI - example ORCA chemistry package¶

Some applications are hard coded to require a certain version of Open MPI. One example is the ORCA chemistry package. These applications can be run on our system in a Shifter image.

Here's an example Dockerfile to build an image with Open MPI. It downloads the Open MPI tarball, installs it in /usr, and configures MPI to communicate via ssh. Note that ORCA requires the --disable-builtin-atomics flag.

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && apt-get install -y build-essential apt-utils ssh gfortran

RUN cd / && wget https://www.open-mpi.org/software/ompi/v4.1/downloads/openmpi-4.1.1.tar.bz2 \
    && tar xvjf openmpi-4.1.1.tar.bz2 && cd openmpi-4.1.1 \
    && ./configure --prefix=/usr --disable-builtin-atomics && make && make install \
    && rm -rf /openmpi-4.1.1 && rm -rf openmpi-4.1.1.tar.bz2

RUN echo "--mca plm ^slurm" > /usr/etc/openmpi-mca-params.conf

You can build and upload this image yourself, or you can use our copy at stephey/orca:3.0.

We will demonstrate how to use this image to run ORCA.

The ORCA binaries are precompiled and released in several tarballs. Typically we suggest that you install your software stack in your image, but in this case we will not do this. Since the total size of the ORCA package is so large (~40 GB), we suggest that you install it on our filesystems rather than inside the image itself. To make sure your ORCA binaries aren't purged, we suggest you install them on our /global/common/software filesystem. Since the default quotas can be small on this filesystem, you may need to request a quota increase first. However, we almost always grant these requests, so please don't hesitate to open a ticket to ask for more space.

Here is an example jobscript which uses the ORCA package installed on our /global/common/software filesystem inside our Open MPI Shifter image. Note that you'll need to adjust this to include your project directory and your username.

#!/bin/bash
#SBATCH -q debug
#SBATCH -N 2
#SBATCH -t 00:10:00
#SBATCH -C cpu
#SBATCH --image=stephey/orca:3.0

#populate the node list
scontrol show hostnames $SLURM_JOB_NODELIST > parent_salen.nodes
shifter /global/common/software/<your project>/<your username>/orca/orca parent_salen.inp

Combining `PrgEnv` software with Shifter¶

Launching executables which depend on shared libraries that live on parallel file systems can be problematic when running at scale, e.g., at high MPI concurrency. Attempts to srun these kinds of applications often encounter Slurm timeouts, since each task in the job is attempting to dlopen() the same set of ~20 shared libraries on the parallel file system, which can take a very long time. These timeouts often manifest in job logs as error messages like the following:

Tue Apr 21 20:02:50 2020: [PE_517381]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=58, pes_this_node=64, timeout=180 secs

Shifter solves this scalability problem by storing shared libraries in memory on each compute node, but has the disadvantage of isolating the user from the full Cray programming environment on which the user may depend for compilation and linking of applications. E.g., the application may depend on a shared library provided by a NERSC module, which is located in /usr/common/software, a parallel file system, rather than in /opt or /usr, which are encoded directly into the compute node's boot image.

This example works around this limitation, by combining the scalability of Shifter with the existing Cray programming environment that many users rely on for compilation and linking.

First, one must gather together all of the shared libraries on which a dynamically linked executable depends. This can be accomplished with the lddtree utility, which recurses through the tree of shared libraries linked dynamically to an executable. After one has compiled lddtree and put it in $PATH, one can run the following script, which finds all shared libraries that the application is linked to, copies them into a temporary dir, and then creates a tarball from that temporary dir.

#!/bin/bash

# Script to gather shared libs for a parallel code compiled on a Cray XC
# system. The libs and exe can be put into a Shifter image. Depends on the
# `lddtree` utility provided by Gentoo 'pax-utils':
# https://gitweb.gentoo.org/proj/pax-utils.git/tree/lddtree.sh

if [ -z "$1" ]; then
  printf "%s\n" "Usage: "$0" <file>"
  exit 1
elif [ "$#" -ne 1 ]; then
  printf "%s\n" "ERROR: this script takes exactly 1 argument"
  exit 1
fi

if [ ! $(command -v lddtree.sh) ]; then
  printf "%s\n" "ERROR: lddtree.sh not in \$PATH"
  exit 1
fi

exe="$1"

if [ ! -f $exe ]; then
  printf "%s\n" "ERROR: file ${exe} does not exist"
  exit 1
fi

# Ensure file has dynamic symbols before we proceed. If the file is a static
# executable, we don't need this script anyway; the user can simply sbcast the
# executable to /tmp on each compute node.
dyn_sym=$(objdump -T ${exe})
if [ $? -ne 0 ]; then
  printf "%s\n" "ERROR: file has no dynamic symbols"
  exit 1
fi

target_dir=$(mktemp -d -p ${PWD} $(basename $exe)-tmpdir.XXXXXXXXXX)
tar_file="$(basename $exe).tar.bz2"
printf "%s\n" "Putting libs into this dir: ${target_dir}"

# First copy all of the shared libs which are dynamically linked to the exe.

lddtree.sh ${exe} | while read f; do
  file_full=$(echo $f | grep -Po "=> (.*)" | cut -d" " -f2)
  cp ${file_full} ${target_dir}
done

# Then find the network libs that Cray manually dlopen()s. (Just running
# lddtree on the compiled executable won't find these.) Shifter has to do this
# step manually too: see
# https://github.com/NERSC/shifter/blob/master/extra/prep_cray_mpi_libs.py#L282.
# Also copy all of their dependent shared libs.

for f in $(find /opt \
  -name '*wlm_detect*\.so' -o \
  -name '*rca*\.so' -o \
  -name '*alps*\.so' 2>/dev/null); do
  lddtree.sh $f | while read f; do
    file_full=$(echo $f | grep -Po "=> (.*)" | cut -d" " -f2)
    cp ${file_full} ${target_dir}
  done
done

tar cjf ${tar_file} -C ${target_dir} .
mv ${tar_file} ${target_dir}
printf "%s\n" "Combined shared libraries into this tar file: ${target_dir}/${tar_file}"

cat << EOF > ${target_dir}/Dockerfile
FROM opensuse/leap:15.2
ADD ${tar_file} /my_dynamic_exe
ENV PATH="/my_dynamic_exe:\${PATH}"
ENV LD_LIBRARY_PATH="/my_dynamic_exe:\${LD_LIBRARY_PATH}
EOF
printf "%s\n\n" "Created this Dockerfile: ${target_dir}/Dockerfile"

printf "%s\n" "Now copy the following files to your workstation:"
printf "%s\n" ${target_dir}/${tar_file}
printf "%s\n" ${target_dir}/Dockerfile
printf "\n"
printf "%s\n" "Then create a Docker image and push it to the NERSC Shifter registry."
printf "%s\n" "Instructions for doing this are provided here:"
printf "%s\n" "https://docs.nersc.gov/languages/shifter/how-to-use/#using-nerscs-private-registry"
printf "%s\n" "After your image is in the NERSC Shifter registry, you can execute your code"
printf "%s\n" "using a script like the following:"
printf "\n"
printf "%s\n" "#!/bin/bash"
printf "%s\n" "#SBATCH -C <arch>"
printf "%s\n" "#SBATCH -N <num_nodes>"
printf "%s\n" "#SBATCH -t <time>"
printf "%s\n" "#SBATCH -q <queue>"
printf "%s\n" "#SBATCH -J <job_name>"
printf "%s\n" "#SBATCH -o <job_log_file>"
printf "%s\n" "#SBATCH --image=registry.services.nersc.gov/${USER}/<image_name>:<version>"
printf "\n"
printf "%s\n" "srun <args> shifter $(basename ${exe}) <inputs>"

One can test this script on a simple MPI example:

program main
  use mpi
  implicit none

  integer :: ierr, world_size, world_rank

  ! Initialize the MPI environment
  call MPI_Init(ierr)

  ! Get the number of processes
  call MPI_Comm_size(MPI_COMM_WORLD, world_size, ierr)

  ! Get the rank of the process
  call MPI_Comm_rank(MPI_COMM_WORLD, world_rank, ierr)

  ! Print off a hello world message
  print *, "Hello world from rank ", world_rank, " out of ", world_size, " processors"

  ! Finalize the MPI environment.
  call MPI_Finalize(ierr)

end program main

Then one can run the script shown above to gather the relevant shared libraries into a new directory in $PWD, and create a tarball from those shared libraries:

user@login03:~> ftn -o mpi-hello-world.ex main.f90
user@login03:~> ./shifterize.sh mpi-hello-world.ex
Putting libs into this dir: /global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX
Combined shared libraries into this tar file: /global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/mpi-hello-world.ex.tar.bz2
Created this Dockerfile: /global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/Dockerfile

Now copy the following files to your workstation:
/global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/mpi-hello-world.ex.tar.bz2
/global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/Dockerfile

Then create a Docker image and push it to the NERSC Shifter registry.
Instructions for doing this are provided here:
https://docs.nersc.gov/languages/shifter/how-to-use/#using-nerscs-private-registry
After your image is in the NERSC Shifter registry, you can execute your code
using a script like the following:

#!/bin/bash
#SBATCH -C <arch>
#SBATCH -N <num_nodes>
#SBATCH -t <time>
#SBATCH -q <queue>
#SBATCH -J <job_name>
#SBATCH -o <job_log_file>
#SBATCH --image=registry.services.nersc.gov/user/<image_name>:<version>

srun <args> shifter mpi-hello-world.ex <inputs>
user@login03:~>

where the new file my_dynamic_exe.tar.bz2 contains the contents of the dynamic executable and all of its dependent shared libraries:

user@login03:~> tar tjf my_dynamic_exe.tar.bz2|head
./
./libalps.so
./libjob.so.0
./libmunge.so.2
./libudreg.so.0
./libintlc.so.5
./libnodeservices.so.0
./libugni.so.0
./libalpsutil.so
./libalpslli.so

and the newly generated Dockerfile is already configured to build a working Shifter image:

FROM opensuse/leap:15.2
ADD mpi-hello-world.ex.tar.bz2 /my_dynamic_exe
ENV PATH="/my_dynamic_exe:${PATH}"
ENV LD_LIBRARY_PATH="/my_dynamic_exe:${LD_LIBRARY_PATH}

Development in Shifter using VSCode¶

Here's how to do remote development at NERSC, inside of Shifter containers, using Visual Studio Code on your local machine. This will make the remote VS-Code server instance at NERSC run inside the the container instance, so the container's file system will be fully visible to VS-Code.

VS code doesn't natively support non-Docker-like container runtimes yet, the workaround described below works well in practice though. The procedure is a bit involved, but the result is worth the effort.

Requirements¶

You'll need VS-Code >= v1.64 (older versions don't support the SSH RemoteCommand setting)

Step 1¶

At NERSC, create a script $HOME/.local/bin/run-shifter that looks like this

#!/bin/sh
export XDG_RUNTIME_DIR="${TMPDIR:-/tmp}/`whoami`/run"
exec shifter --image="$1"

This is necessary since VS-Code tries to access the $XDG_RUNTIME_DIR. At NERSC, $XDG_RUNTIME_DIR points to /run/user/YOUR_UID by default, which is not accessible from within Shifter container instances, so we need to override the default location.

Step 2¶

In your "$HOME/.ssh/config" on your local system, add something like

Host someimage~*
  RemoteCommand ~/.local/bin/run-shifter someorg/someimage:latest
  RequestTTY yes

Host otherimage~*
  RemoteCommand ~/.local/bin/run-shifter someorg/someimage:latest
  RequestTTY yes

Host perlmutter*.nersc.gov
  IdentityFile ~/.ssh/nersc

Host perlmutter someimage~perlmutter otherimage~perlmutter
  HostName perlmutter.nersc.gov
  User YOUR_NERSC_USERNAME

Test whether this works by running ssh someimage~perlmutter on your local system. This should drop you into an SSH session running inside of an instance of the someorg/someimage:latest container image at NERSC.

Step 3¶

In your VS-Code settings on your local system, set

"remote.SSH.enableRemoteCommand": true

Step 4¶

Since VS-Code reuses remote server instances, the above is not sufficient to run multiple container images on the same NERSC host at the same time. To get separate (per container image) VS-Code server instances on the same host, add something like this to your VS-Code settings on your local system:

"remote.SSH.serverInstallPath": {
  "someimage~perlmutter": "~/.vscode-container/someimage",
  "otherimage~perlmutter": "~/.vscode-container/otherimage"
}

Step 5¶

Connect to NERSC from with VS-Code running your local system:

F1 > "Connect to Host" > "someimage~perlmutter” should now start a remote VS-Code session with the VS-Code server component running inside a Shifter container instance at NERSC. The same for "otherimage~perlmutter".

Tips and tricks¶

If things don't work, try "Kill server on remote" from VS-Code and reconnect.

You can also try starting over from scratch with brute force: Close the VS-Code remote connection. Then, from an external terminal, kill the remote VS-Code server instance (and everything else):

ssh perlmutter
pkill -9 node

(This will kill all Node.jl processes you own on the remote host.)

Remove the ~/.vscode-server directory in your NERSC home directory.