Skip to content

Example Dockerfiles for Shifter

The easiest way to run your software in a Shifter container is to create a Docker image, push it to Docker Hub then pull it to Shifter ImageGateway which creates the corresponding Shifter image. More details.

HEP/HENP Software Stacks

DESI

DESI jobs use community standard publicly available software, independent of the Linux distro flavor. This example is built with an Ubuntu base.

This Dockerfile is also available: * GitHub * Docker hub - mmustafa/desi * Shifter on Cori under mmustafa/desi:v0

You need the following prerequesites:

  1. Start with an Ubuntu base image
  2. Install all the needed standard packages from Ubuntu repositories
  3. Compile Astrometry.net, Tractor, TMV and Galsim.
  4. Setup the needed environment variables along the way

Build the image:

# Build DESI software environment ontop an Ubuntu base
FROM ubuntu:16.04
MAINTAINER Mustafa Mustafa <mmustafa@lbl.gov>

# install astrometry and tractor dependencies
RUN apt-get update && \
    apt-get install -y wget make git python python-dev python-matplotlib \
                       gcc swig python-numpy libgsl2 gsl-bin pkg-config \
                       zlib1g-dev libcairo2-dev libnetpbm10-dev netpbm \
                       libpng12-dev libjpeg-dev python-pyfits zlib1g-dev \
                       libbz2-dev libcfitsio3-dev python-photutils python-pip && \
    pip install fitsio

ENV PYTHONPATH /desi_software/astrometry_net/lib/python:$PYTHONPATH
ENV PATH /desi_software/astrometry_net/lib/python/astrometry/util:$PATH
ENV PATH /desi_software/astrometry_net/lib/python/astrometry/blind:$PATH

# ------- install astrometry
RUN mkdir -p /desi_software/astrometry_net && \
         git clone https://github.com/dstndstn/astrometry.net.git && \
         cd astrometry.net && \
         make install INSTALL_DIR=/desi_software/astrometry_net &&\
         cd / && \
         rm -rf astrometry.net

# ------- install tractor
RUN mkdir -p /desi_software/tractor && \
         git clone https://github.com/dstndstn/tractor.git && \
         cd tractor && \
         make && \
         python setup.py install --prefix=/desi_software/tractor/

ENV PYTHONPATH /desi_software/tractor/lib/python2.7/site-packages:$PYTHONPATH

# ------- install missing GalSim dependencies (others have been installed above)
RUN apt-get install -y python-future python-yaml python-pandas scons fftw3-dev libboost-all-dev

# ------- install TMV
RUN wget https://github.com/rmjarvis/tmv/archive/v0.73.tar.gz -O tmv.tar.gz && \
         gunzip tmv.tar.gz && \
         mkdir tmv && tar xf tmv.tar -C tmv --strip-components 1 && \
         cd tmv && \
         scons && \
         scons install && \
         cd / && \
         rm -rf tmv.tar tmv

# ------- install GalSim
RUN wget https://github.com/GalSim-developers/GalSim/archive/v1.4.2.tar.gz -O GalSim.tar.gz && \
         gunzip GalSim.tar.gz && \
         mkdir GalSim && tar xf GalSim.tar -C GalSim --strip-components 1 && \
         cd GalSim && \
         scons && \
         scons install && \
         cd / && \
         rm -rf GalSim.tar GalSim

STAR

The STAR experiment software stack is typically built and run on Scientific Linux.

There are two ways we can build the STAR image, the first is to compile all the stack components one by one. The other is to install the compiled libraries by copying them into the image. We chose to do the latter in this example.

We use an SL6.4 docker base image that is publicly available, install the needed rpms, extract pre-compiled binaries tarballs into the image and finally install some software that needed to run STAR jobs at Cori. The latest image is available at Cori (mmustafa/sl64_sl16d:v1_pdsf6).

# Example Dockerfile to show how to build STAR
# environment image from binuaries tarballs. Not necessarily
# the one currently used for STAR docker image build
FROM ringo/scientific:6.4
MAINTAINER Mustafa Mustafa <mmustafa@lbl.gov>

# RPMs
RUN yum -y install libxml2 tcsh libXpm.i686 libc.i686 libXext.i686 \
                   libXrender.i686 libstdc++.i686 fontconfig.i686 \
                   zlib.i686 libgfortran.i686 libSM.i686 mysql-libs.i686 \
                   gcc-c++ gcc-gfortran glibc-devel.i686 xorg-x11-xauth \
                   wget make libxml2.so.2 gdb libXtst.{i686,x86_64} \
                   libXt.{i686,x86_64} glibc glibc-devel gcc-c++# Dev Tools
RUN wget -O /etc/yum.repos.d/slc6-devtoolset.repo \
     https://linuxsoft.cern.ch/cern/devtoolset/slc6-devtoolset.repo && \
 yum -y install devtoolset-2-toolchain
COPY enable_scl /usr/local/star/group/templates/

# untar STAR OPT
COPY optstar.sl64_gcc482.tar.gz /opt/star/
COPY installstar /
RUN python installstar SL16c && \
 rm -f installstar &&         \
 rm -f optstar.sl64_gcc482.tar.gz

# untar ROOT
COPY rootdeb-5.34.30.sl64_gcc482.tar.gz /usr/local/star/
COPY installstar /
RUN python installstar SL16c && \
 rm -f installstar && \
 rm -f rootdeb-5.34.30.sl64_gcc482.tar.gz

# untar STAR library
COPY SL16d.tar.gz /usr/local/star/packages/
COPY installstar /
RUN python installstar SL16d && \
 rm -f installstar && \
 rm -f /usr/local/star/packages/SL16d.tar.gz

# DB load balancer
COPY dbLoadBalancerLocalConfig_generic.xml /usr/local/star/packages/SL16d/StDb/servers/

# production pipeline utility macros
COPY Hadd.C /usr/local/star/packages/SL16d/StRoot/macros/
COPY lMuDst.C /usr/local/star/packages/SL16d/StRoot/macros/
COPY checkProduction.C /usr/local/star//packages/SL16d/StRoot/macros/

# Special RPMs for production at Cori; OpenMpi, mysql-server
RUN yum -y install libibverbs.x86_64 environment-modules infinipath-psm-devel.x86_64 \
 librdmacm.x86_64 opensm.x86_64 papi.x86_64 && \
 wget https://mirror.centos.org/centos/6.8/os/x86_64/Packages/openmpi-1.10-1.10.2-2.el6.x86_64.rpm && \
 rpm -i openmpi-1.10-1.10.2-2.el6.x86_64.rpm && \
 rm -f openmpi-1.10-1.10.2-2.el6.x86_64.rpm && \
 yum -y install glibc-devel devtoolset-2-libstdc++-devel.i686 && \
 yum -y install mysql-server mysql && \

# add open mpi library to LD Path
ENV LD_LIBRARY_PATH /usr/lib64/openmpi-1.10/lib/

STAR MySQL DB

STAR jobs need access to a read-only MySQL server which provides conditions and calibration tables.

We have found that job scalability is not ideal if the MySQL server is outside Cori's network. Our solution was to run a local MySQL server on each node, the server services all the threads running on the node (e.g. 32 core threads). We chose to overcommit the cores, i.e. 32 production threads + 1 mysql server running on 32 cores.

The DB payload (~30GB) resides on Lustre. We have found out that the server with the payload accessed directly from Lustre FS doesn't perform well for this IO pattern, it takes more than 30 minutes to cache the first few requests. In this case the XFS image mount capability (perCacheNode) came in handy. As soon as the job starts we copy the payload from Lustre FS into an XFS file mount, then we set the DB server to use this copy. Copying the 30 GB payload takes 1-3 minutes. The performance was stunning, caching time dropped down from 30 minutes to less than 1 minute, it also provided us with trivial scalability of the number of concurrent jobs.

Below are the relevant lines from our slurm batch file:

Request a perCacheNode of 50GB and mount it to /mnt in the Shifter image.

#!/bin/bash
#SBATCH --image=mmustafa/sl64_sl16d:v1_pdsf6
#SBATCH --volume=/global/cscratch1/sd/mustafa/:/mnt:perNodeCache=size=50G

Launch the Shifter container:

shifter /bin/csh <<EOF

Copy the payload to the /mnt then launch the DB sever:

#Prepare DB...
cd /mnt
cp -r -p /global/cscratch1/sd/mustafa/mysql51VaultStar6/ .
/usr/bin/mysqld_safe --defaults-file=/mnt/mysql51VaultStar6/my.cnf --skip-grant-tables &
sleep 30

Using Non-system MPI

Some applications are hard coded to require a certain version of openMPI (e.g. ORCA). These applications can be run on our system in a Shifter image. However, please keep in mind that this will not perform as well as the system MPI. This is because it must use ssh to communicate, which is not as fast as the native libraries. Where ever possible it is recommended you recompile your executable to use the system libraries.

Here's an example Dockerfile to build an image with openmpi. It downloads the openmpi tarball, installs it in /usr, and configures MPI to communicate via ssh.

FROM ubuntu:16.10

RUN apt-get update && apt-get install -y build-essential apt-utils ssh

RUN cd / && wget https://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.1.tar.bz2 \
    && tar xvjf openmpi-2.1.1.tar.bz2 && cd openmpi-2.1.1 \
    && ./configure --prefix=/usr && make && make install \
    && rm -rf /openmpi-2.1.1 && rm -rf openmpi-2.1.1.tar.bz2

RUN echo "--mca plm ^slurm" > /usr/etc/openmpi-mca-params.conf

Build and upload this image to NERSC. Below is an example script for running ORCA (a chemistry package). In this particular case, the ORCA binaries are precompiled and released in a tar ball. Since this tarball is nearly 20GB, we chose to install this onto the scratch directory instead of into the image. For smaller packages, we recommend installing them into the shifter image.

#!/bin/bash
#SBATCH -q debug
#SBATCH -N 2
#SBATCH -t 00:10:00
#SBATCH -C haswell
#SBATCH -L SCRATCH
#SBATCH --image=lgerhardt/openmpi_2.1.1:v1 (or your openmpi image)

#populate the node list
scontrol show hostnames $SLURM_JOB_NODELIST > your_input_file_name.nodes
shifter $SCRATCH/orca_4_0_0_2_linux_x86-64/orca <your_input_file_name>.inp

Combining PrgEnv software with Shifter

Launching executables which depend on shared libraries that live on parallel file systems can be problematic when running at scale, e.g., at high MPI concurrency. Attempts to srun these kinds of applications often encounter Slurm timeouts, since each task in the job is attempting to dlopen() the same set of ~20 shared libraries on the parallel file system, which can take a very long time. These timeouts often manifest in job logs as error messages like the following:

Tue Apr 21 20:02:50 2020: [PE_517381]:_pmi_mmap_tmp: Warning bootstrap barrier failed: num_syncd=58, pes_this_node=64, timeout=180 secs

Shifter solves this scalability problem by storing shared libraries in memory on each compute node, but has the disadvantage of isolating the user from the full Cray programming environment on which the user may depend for compilation and linking of applications. E.g., the application may depend on a shared library provided by a NERSC module, which is located in /usr/common/software, a parallel file system, rather than in /opt or /usr, which are encoded directly into the compute node's boot image.

This example works around this limitation, by combining the scalability of Shifter with the existing Cray programming environment that many users rely on for compilation and linking.

First, one must gather together all of the shared libraries on which a dynamically linked executable depends. This can be accomplished with the lddtree utility, which recurses through the tree of shared libraries linked dynamically to an executable. After one has compiled lddtree and put it in $PATH, one can run the following script, which finds all shared libraries that the application is linked to, copies them into a temporary dir, and then creates a tarball from that temporary dir.

#!/bin/bash

# Script to gather shared libs for a parallel code compiled on a Cray XC
# system. The libs and exe can be put into a Shifter image. Depends on the
# `lddtree` utility provided by Gentoo 'pax-utils':
# https://gitweb.gentoo.org/proj/pax-utils.git/tree/lddtree.sh

if [ -z "$1" ]; then
  printf "%s\n" "Usage: "$0" <file>"
  exit 1
elif [ "$#" -ne 1 ]; then
  printf "%s\n" "ERROR: this script takes exactly 1 argument"
  exit 1
fi

if [ ! $(command -v lddtree.sh) ]; then
  printf "%s\n" "ERROR: lddtree.sh not in \$PATH"
  exit 1
fi

exe="$1"

if [ ! -f $exe ]; then
  printf "%s\n" "ERROR: file ${exe} does not exist"
  exit 1
fi

# Ensure file has dynamic symbols before we proceed. If the file is a static
# executable, we don't need this script anyway; the user can simply sbcast the
# executable to /tmp on each compute node.
dyn_sym=$(objdump -T ${exe})
if [ $? -ne 0 ]; then
  printf "%s\n" "ERROR: file has no dynamic symbols"
  exit 1
fi

target_dir=$(mktemp -d -p ${PWD} $(basename $exe)-tmpdir.XXXXXXXXXX)
tar_file="$(basename $exe).tar.bz2"
printf "%s\n" "Putting libs into this dir: ${target_dir}"

# First copy all of the shared libs which are dynamically linked to the exe.

lddtree.sh ${exe} | while read f; do
  file_full=$(echo $f | grep -Po "=> (.*)" | cut -d" " -f2)
  cp ${file_full} ${target_dir}
done

# Then find the network libs that Cray manually dlopen()s. (Just running
# lddtree on the compiled executable won't find these.) Shifter has to do this
# step manually too: see
# https://github.com/NERSC/shifter/blob/master/extra/prep_cray_mpi_libs.py#L282.
# Also copy all of their dependent shared libs.

for f in $(find /opt \
  -name '*wlm_detect*\.so' -o \
  -name '*rca*\.so' -o \
  -name '*alps*\.so' 2>/dev/null); do
  lddtree.sh $f | while read f; do
    file_full=$(echo $f | grep -Po "=> (.*)" | cut -d" " -f2)
    cp ${file_full} ${target_dir}
  done
done

tar cjf ${tar_file} -C ${target_dir} .
mv ${tar_file} ${target_dir}
printf "%s\n" "Combined shared libraries into this tar file: ${target_dir}/${tar_file}"

cat << EOF > ${target_dir}/Dockerfile
FROM opensuse/leap:15.2
ADD ${tar_file} /my_dynamic_exe
ENV PATH="/my_dynamic_exe:\${PATH}"
ENV LD_LIBRARY_PATH="/my_dynamic_exe:\${LD_LIBRARY_PATH}
EOF
printf "%s\n\n" "Created this Dockerfile: ${target_dir}/Dockerfile"

printf "%s\n" "Now copy the following files to your workstation:"
printf "%s\n" ${target_dir}/${tar_file}
printf "%s\n" ${target_dir}/Dockerfile
printf "\n"
printf "%s\n" "Then create a Docker image and push it to the NERSC Shifter registry."
printf "%s\n" "Instructions for doing this are provided here:"
printf "%s\n" "https://docs.nersc.gov/languages/shifter/how-to-use/#using-nerscs-private-registry"
printf "%s\n" "After your image is in the NERSC Shifter registry, you can execute your code"
printf "%s\n" "using a script like the following:"
printf "\n"
printf "%s\n" "#!/bin/bash"
printf "%s\n" "#SBATCH -C <arch>"
printf "%s\n" "#SBATCH -N <num_nodes>"
printf "%s\n" "#SBATCH -t <time>"
printf "%s\n" "#SBATCH -q <queue>"
printf "%s\n" "#SBATCH -J <job_name>"
printf "%s\n" "#SBATCH -o <job_log_file>"
printf "%s\n" "#SBATCH --image=registry.services.nersc.gov/${USER}/<image_name>:<version>"
printf "\n"
printf "%s\n" "srun <args> shifter $(basename ${exe}) <inputs>"

One can test this script on a simple MPI example:

program main
  use mpi
  implicit none

  integer :: ierr, world_size, world_rank

  ! Initialize the MPI environment
  call MPI_Init(ierr)

  ! Get the number of processes
  call MPI_Comm_size(MPI_COMM_WORLD, world_size, ierr)

  ! Get the rank of the process
  call MPI_Comm_rank(MPI_COMM_WORLD, world_rank, ierr)

  ! Print off a hello world message
  print *, "Hello world from rank ", world_rank, " out of ", world_size, " processors"

  ! Finalize the MPI environment.
  call MPI_Finalize(ierr)

end program main

Then one can run the script shown above to gather the relevant shared libraries into a new directory in $PWD, and create a tarball from those shared libraries:

user@cori03:~> ftn -o mpi-hello-world.ex main.f90
user@cori03:~> ./shifterize.sh mpi-hello-world.ex
Putting libs into this dir: /global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX
Combined shared libraries into this tar file: /global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/mpi-hello-world.ex.tar.bz2
Created this Dockerfile: /global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/Dockerfile

Now copy the following files to your workstation:
/global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/mpi-hello-world.ex.tar.bz2
/global/homes/u/user/mpi-hello-world.ex-tmpdir.GbITsvCZZX/Dockerfile

Then create a Docker image and push it to the NERSC Shifter registry.
Instructions for doing this are provided here:
https://docs.nersc.gov/languages/shifter/how-to-use/#using-nerscs-private-registry
After your image is in the NERSC Shifter registry, you can execute your code
using a script like the following:

#!/bin/bash
#SBATCH -C <arch>
#SBATCH -N <num_nodes>
#SBATCH -t <time>
#SBATCH -q <queue>
#SBATCH -J <job_name>
#SBATCH -o <job_log_file>
#SBATCH --image=registry.services.nersc.gov/user/<image_name>:<version>

srun <args> shifter mpi-hello-world.ex <inputs>
user@cori03:~>

where the new file my_dynamic_exe.tar.bz2 contains the contents of the dynamic executable and all of its dependent shared libraries:

user@cori03:~> tar tjf my_dynamic_exe.tar.bz2|head
./
./libalps.so
./libjob.so.0
./libmunge.so.2
./libudreg.so.0
./libintlc.so.5
./libnodeservices.so.0
./libugni.so.0
./libalpsutil.so
./libalpslli.so

and the newly generated Dockerfile is already configured to build a working Shifter image:

FROM opensuse/leap:15.2
ADD mpi-hello-world.ex.tar.bz2 /my_dynamic_exe
ENV PATH="/my_dynamic_exe:${PATH}"
ENV LD_LIBRARY_PATH="/my_dynamic_exe:${LD_LIBRARY_PATH}