Podman at NERSC¶

Podman (Pod Manager) is an open-source, OCI-compliant container framework that is under active development by Red Hat. In many ways Podman can be treated as a drop-in replacement for Docker.

Since "out of the box" Podman currently lacks several key capabilities for HPC users, NERSC has been working with Red Hat to adapt Podman for HPC use-cases and has developed an add-on called podman-hpc. podman-hpc is now available to all users on Perlmutter. podman-hpc enables improved performance, especially at large scale, and makes using common HPC tools like Cray MPI and NVIDIA CUDA capabilities easy.

Users may be interested in using Podman Desktop on their local machines. It is a free alternative to Docker Desktop.

Why `podman-hpc`?¶

Users who are comfortable with Shifter, the current NERSC production container runtime, may wonder what advantages Podman offers over Shifter. Here are a few:

podman-hpc doesn't impose many of the restrictions that Shifter does:
- No container modules will be loaded by default.
- Most environment variables will not be automatically propagated into the container.
- Applications that require root permission inside the container will be allowed to run. This is securely enabled via Podman's rootless mode.
- Users can modify the contents of their containers at runtime.
Users can build images directly on Perlmutter.
Users can choose to run these images directly via podman-hpc without uploading to an intermediate repository.
Podman is an OCI-compliant framework (like Docker). Users who are familiar with Docker will find that Podman has very similar syntax and can often be used as a drop-in replacement for Docker. Users may also find that this makes their workflow more portable.
podman-hpc is a transparent wrapper around Podman. Users will find that they can pass standard unprivileged Podman commands to podman-hpc.
Podman is a widely used tool that is not specific to NERSC.

How to use `podman-hpc`¶

podman-hpc is available on Perlmutter.

To see all available commands, users can issue the podman-hpc --help command:

elvis@nid001036:~> podman-hpc --help
Manage pods, containers and images ... on HPC!

Description:
  The podman-hpc utility is a wrapper script around the Podman container
  engine. It provides additional subcommands for ease of use and
  configuration of Podman in a multi-node, multi-user high performance
  computing environment.

Usage: podman-hpc [options] COMMAND [ARGS]...

Options:
  --additional-stores TEXT  Specify other storage locations
  --squash-dir TEXT         Specify alternate squash directory location
  --help                    Show this message and exit.

Commands:
  infohpc     Dump configuration information for podman_hpc.
  migrate     Migrate an image to squashed.
  pull        Pulls an image to a local repository and makes a squashed...
  rmsqi       Removes a squashed image.
  shared-run  Launch a single container and exec many threads in it This is...
...

Users can issue the podman-hpc images to see any images that they have built or pulled.

elvis@nid001036:~> podman-hpc images
REPOSITORY                                     TAG         IMAGE ID      CREATED       SIZE        R/O
elvis@nid001036:~>

This should show there are no images yet.

Building images¶

Users should generate a Containerfile or Dockerfile. (A Containerfile is a more general form of a Dockerfile- they follow the same syntax and usually can be used interchangeably.) Users can build and tag the image in the same directory via a command like:

podman-hpc build -t elvis:test .

podman-hpc images and caches are stored in local storage

podman-hpc build artifacts and cache files will be stored on the login node where the issue performed the build. If a user logs onto a new node, they will not have access to these cached files and will need to build from scratch. At the moment we have no purge policy for the local image build storage, although users can likely expect one in the future.

If a user would like their image to be usable in a job, they will need to issue the

podman-hpc migrate elvis:test

command. This will convert the image into a suitable squashfile format for podman-hpc. These images can be directly accessed and used in a job. If you migrate your image, you will notice that there are two kinds of images listed by podman-hpc images:

elvis@perlmutter:login01:/> podman-hpc images
REPOSITORY                               TAG         IMAGE ID      CREATED         SIZE        R/O
localhost/elvis                          test        f55898589b7a  11 seconds ago  80.3 MB     false
elvis@perlmutter:login01:/> podman-hpc migrate elvis:test
elvis@perlmutter:login01:/> podman-hpc images
REPOSITORY                               TAG         IMAGE ID      CREATED         SIZE        R/O
localhost/elvis                          test        f55898589b7a  45 seconds ago  80.3 MB     false
localhost/elvis                          test        f55898589b7a  45 seconds ago  80.3 MB     true
elvis@perlmutter:login01:/>

The migrated squashfile is listed as read-only (R/O) in this display. However, you will be able to modify the image at runtime since podman-hpc adds an overlay filesystem on top of the squashed image.

Pulling images¶

Users can pull public images via podman-hpc with no additional configuration.

elvis@perlmutter:login01:/> podman-hpc pull ubuntu:22.04
Trying to pull docker.io/library/ubuntu:22.04...
Getting image source signatures
Copying blob 2ab09b027e7f skipped: already exists  
Copying config 08d22c0ceb done  
Writing manifest to image destination
Storing signatures
08d22c0ceb150ddeb2237c5fa3129c0183f3cc6f5eeb2e7aa4016da3ad02140a
INFO: Migrating image to /pscratch/sd/e/elvis/storage
elvis@perlmutter:login01:/>

Images that a user pulls from a registry will be automatically converted into a suitable squashfile format for podman-hpc. These images can be directly accessed and used in a job.

If a user needs to pull an image in a private registry, they must first log in to their registry via podman-hpc. In this case we are logging into Dockerhub.

elvis@nid001036:~> podman-hpc login docker.io
Username: elvis
Password: 
Login Succeeded!

The user can then pull the image

elvis@nid001036:~> podman-hpc pull elvis/hello-world:1.0
Trying to pull docker.io/elvis/hello-world:1.0...
Getting image source signatures
Copying blob sha256:7b1a6ab2e44dbac178598dabe7cff59bd67233dba0b27e4fbd1f9d4b3c877a54
Copying config sha256:0849b79544d682e6149e46977033706b17075be384215ef8a69b5a37037c7231
Writing manifest to image destination
Storing signatures
0849b79544d682e6149e46977033706b17075be384215ef8a69b5a37037c7231
elvis@nid001036:~> podman-hpc images
REPOSITORY                              TAG         IMAGE ID      CREATED        SIZE        R/O
docker.io/elvis/hello-world             1.0         0849b79544d6  16 months ago  75.2 MB     true

Using `podman-hpc` as a container runtime¶

Unlike Shifter, the Slurm --image flag is not required

podman-hpc can be used on a login node or in a job. Unlike Shifter, no Slurm flags are needed to use a podman-hpc image in a job. The only requirement is that the user has pulled an image, which is automatically migrated, or built and migrated an image.

Users can use podman-hpc as a container runtime. Early benchmarking has shown that in many cases, performance is comparable to Shifter and bare metal.

Our goal has been to design podman-hpc so that standard Podman commands still work. Please check out this page for a full list of podman run capabilities.

Users can use podman-hpc in both interactive and batch jobs without requesting any special resources. They only need to have previously built or pulled an image via podman-hpc. Users may chose to run a container in interactive mode, like in this example:

elvis@nid001036:~> podman-hpc run --rm -it registry.nersc.gov/library/nersc/mpi4py:3.1.3 /bin/bash
root@d23b3ea141ed:/opt# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.5 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
root@d23b3ea141ed:/opt# exit
exit
elvis@nid001036:~>

Here we see that the container is using the Ubuntu Jammy OS.

Users may also chose to run a container in standard run mode:

elvis@nid001036:~> podman-hpc run --rm registry.nersc.gov/library/nersc/mpi4py:3.1.3 echo $SLURM_JOB_ID
198507
elvis@nid001036:~>

Here we print the Slurm job id from inside the container.

Unlike Shifter, podman-hpc does not enable any MPI or GPU capability by default. Users must request the additional utilities they need.

Module Name	Function
--mpi	Uses optimized Cray MPI
--cuda-mpi	Uses CUDA-aware optimized Cray MPI
--gpu	Enable NVIDIA GPU
--cvmfs	Enable the CVMFS filesystem
--openmpi-pmi2	Helper module for PMI2 support in OpenMPI
--openmpi-pmix	Helper module for PMIx support in OpenMPI

Note that the --cuda-mpi flag must be used together with the --gpu flag.

When using the openmpi helper modules, OpenMPI must be installed in the user image. The module helps create the correct configuration, but doesn't provide OpenMPI itself.

Please see the Known Issues section below for advice about using the podman-hpc modules.

More modules will be added soon.

Unlike Shifter, no capabilities are loaded by default

Shifter users may be aware that MPICH and GPU capabilities are loaded by default. In podman-hpc, we take the opposite (and more OCI-compliant approach) in which users must explicitly request all capabilities they need.

Using Cray MPICH in `podman-hpc`¶

Using Cray MPICH in podman-hpc is very similar to what we describe in our MPI in Shifter documentation. To be able to use Cray MPICH at runtime, users must first include a standard implementation of MPICH in their image. If users add the podman-hpc --mpi flag, it will enable our current Cray MPICH to be inserted, replacing with the MPICH in their container at runtime.

Here is an example of running an MPI-enabled task in podman-hpc in an interactive job:

elvis@nid001037:~> srun -n 2 podman-hpc run --rm --mpi registry.nersc.gov/library/nersc/mpi4py:3.1.3 python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001037.
Hello, World! I am process 1 of 2 on nid001041.

Using OpenMPI in `podman-hpc`¶

Our current model for using podman-hpc requires that the user container has its own build of OpenMPI. podman-hpc provides two helper modules, one for pmi2 and one for pmix, to streamline the required settings, but doesn't insert OpenMPI itself.

The OpenMPI build inside the user container should be built with Slurm support, i.e. using the --with-slurm flag. It can be built with either pmi2 or pmix support. The PMI in the container will interface with Slurm PMI outside the container. This will enable the appropriate multinode MPI wireup.

Users will need to instruct Slurm to use either pmi2 or pmix, i.e. srun --mpi=pmix, and this in turn will connect to the PMI inside the container.

elvis@nid001005:~> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001005.
Hello, World! I am process 1 of 2 on nid001005.

Using NVIDIA GPUs in `podman-hpc`¶

Accessing NVIDIA GPUs in a container requires that the NVIDIA CUDA user drivers and other utilities are present in the container at runtime. If users add the podman-hpc --gpu flag, this will ensure all required utilities are enabled at runtime.

Here is an example of running a GPU-enabled task in podman-hpc in an interactive job:

elvis@nid001037:~> srun -n 2 -G 2 podman-hpc run --rm --gpu registry.nersc.gov/library/nersc/mpi4py:3.1.3 nvidia-smi
Sat Jan 14 01:16:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
Sat Jan 14 01:16:06 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   27C    P0    52W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Network options with the container¶

If you would like your container to share its network with the host, try running with the --net host option:

podman-hpc --net host elvis:test ...

The underlying default network tool was changed from slirp4netns to pasta with a recent update of podman. Users have reported performance degredation for some cases, including with wget and git for large files. Please use the --net slirp4netns option to set the network tool back to the previous default if you experience slow data transfers:

podman-hpc build --net slirp4netns -t  elvis:1.1 ...

Graphics forwarding in `podman-hpc`¶

Here is an example of setting up graphics forwarding with an application running inside a podman-hpc container. In this example we pass podman-hpc the DISPLAY environment variable and set the container to share the host network.

podman-hpc run -it --rm --gpu -e DISPLAY -v /tmp:/tmp --net host --volume="$HOME/.Xauthority:/root/.Xauthority:rw" -v $(pwd):/workspace  nvcr.io/hpc/vmd:1.9.4a44

Container images can be shared between users by storing them in a common directory accessible to all intended users.

Creating Shared Images¶

First, identify a shared directory location accessible to all users. This can be either Perlmutter Scratch (note the purge policy) or CFS. Use the --squash-dir option with podman-hpc to pull or migrate images to this shared location.

Before pulling an image to the shared directory, make sure the shared directory is owned by the desired unix group with the setgid bit set, e.g.

ls -l /global/cfs/cdirs/<project>/shared_images
drwxrwx--x 127 <collab_account> <project> 16384 Jul 17 23:14 /global/cfs/cdirs/<project>/shared_images

chmod g+s /global/cfs/cdirs/<project>/shared_images

ls -l /global/cfs/cdirs/<project>/shared_images
drwxrws--x 127 <collab_account> <project> 16384 Jul 17 23:14 /global/cfs/cdirs/<project>/shared_images

To pull an image from a registry:

podman-hpc --squash-dir /global/cfs/cdirs/<project>/shared_images pull ubuntu:latest

To migrate a locally built image:

podman-hpc --squash-dir /global/cfs/cdirs/<project>/shared_images migrate ubuntu:locally_built

Lastly, change permissions of directories and files under the shared directory to make them group-executable (for directories only) and group-readable.

find /global/cfs/cdirs/<project>/shared_images -type f -exec chmod g+r {} \;
find /global/cfs/cdirs/<project>/shared_images -type d -exec chmod g+rx {} \;

Using Shared Images¶

Before running containers with shared images, set the PODMANHPC_ADDITIONAL_STORES environment variable to point to the shared directory:

export PODMANHPC_ADDITIONAL_STORES=/global/cfs/cdirs/<project>/shared_images

I/O Performance Optimization on compute nodes¶

When using CFS for shared image storage and running containers on compute nodes, you can improve initialization performance by using the DVS read-only mount:

export PODMANHPC_ADDITIONAL_STORES=/dvs_ro/cfs/cdirs/<project>/shared_images

Running a container as a user instead of root¶

If you wish to run a container as your user rather than as root, try running with the --userns keep-id option:

podman-hpc run --userns keep-id elvis:test ...

Profiling an application in `podman-hpc`¶

Profiling a containerized application can be more complex than a bare-metal application. There are several possible approaches:

Option 1- Profile a containerized application using a profiling tool already installed in the container. This may be possible using NVIDIA NGC containers which often ship with nsys and related tools. The user will need to bind-mount a directory on the host system to a mount within the running container, so that they can access the output file written by the profiler. For example:

podman-hpc run --gpu --rm -w /work -v $PWD:/work nsys:test nsys profile mytest

Option 2- Profile a containerized application using a profiling tool from the host system mounted into the running container. This may be necessary if the container does not ship with the profiling tool installed and/or the user does not have the source code to build the profiling tool themselves. In this case the user may need to adjust PATH and LD_LIBRARY_PATH in the running container so that the profiling tool and its dependencies can be used. For example:

podman-hpc run --gpu --rm -w /work -v $PWD:/work \
-v $EBROOTNSIGHTMINSYSTEMS/target-linux-x64:$EBROOTNSIGHTMINSYSTEMS/target-linux-x64 \
-v $EBROOTNSIGHTMINSYSTEMS/host-linux-x64:$EBROOTNSIGHTMINSYSTEMS/host-linux-x64 \
nsys:test ./profile.sh

where profile.sh is:

#!/bin/bash

export LD_LIBRARY_PATH=$EBROOTNSIGHTMINSYSTEMS/target-linux-x64:$EBROOTNSIGHTMINSYSTEMS/host-linux-x64:$LD_LIBRARY_PATH
export PATH=$EBROOTNSIGHTMINSYSTEMS/target-linux-x64:$PATH

nsys profile mytest

Note that in this case nsys should be mounted into the container using the original path on the host to ensure that all dependencies can be correctly found.

Option 3- Profile a containerized application using a profiling tool on the host (i.e. outside the container). This means that the container must be launched with additional settings made to enable the collection of various system and kernel metrics. For example:

strace -e trace=file -f -o podman.strace podman-hpc run --rm --net=host --privileged --cap-add SYS_ADMIN --cap-add SYS_PTRACE openmpi:test date

Note that Option 3 is currently known to work with strace, but not currently with nsys.

Known issues¶

The December 18, 2024 maintenance included an update to the underlying podman that breaks --userns=keep-id functionality. Users can continue to use podman-hpc with their own id by using an older version of podman with

export PODMANHPC_PODMAN_BIN=/global/common/shared/das/podman-4.7.0/bin/podman

The upgrade of podman in the December 18, 2024 maintenance included a change in the default networking settings. The network backend was changed from slirp4netns to pasta. Users seeing decreased performance for network heavy tasks such as download speeds for git, wget, or curl are encouraged to use --net slirp4netns for the previous backend, or --network=host. Note that --network=host will work for most podman-hpc modules with the exception being --gpu.
ENTRYPOINT settings will interfere with the mpi and openmpi modules. Users should disable their ENTRYPOINT when using these modules. For example: srun -N 2 podman-hpc run --rm --mpi --entrypoint= ....
We have had reports that the screen command mangles podman-hpc commands. We suggest using podman-hpc in a bare shell.
Setting PYTHONPATH can impact podman-hpc since this wrapper is written in Python. We advise that you do not set PYTHONPATH when using podman-hpc.
Some multinode MPI applications run into an error that looks like:

19:21:49 MPIClusterSvc 0 INFO Initializing MPI
19:23:52 MPICH ERROR [Rank 0] [job id 18444561.0] [Mon Nov 20 19:23:51 2023] [nid005062] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
19:23:52 MPIR_Init_thread(170).......:
19:23:52 MPID_Init(501)..............:
19:23:52 MPIDI_OFI_mpi_init_hook(816):
19:23:52 create_endpoint(1361).......: OFI EP enable failed (ofi_init.c:1361:create_endpoint:No space left on device)

The workaround is to add the Slurm setting --network=no_vni. In an example command this looks like: srun --network=no_vni -N 2 ... podman-hpc run --rm --mpi ....

If you discover what appears to be an issue or bug, please let us know via filing a NERSC ticket or filing an issue in our issue tracker.

Troubleshooting¶

Sometimes podman-hpc can get into a bad configuration state. You can try clearing several storage areas.

On a local login node you can delete:
- /images/<userid>_hpc
On a compute or login node you can delete:
- $SCRATCH/storage
- ~/.local/share/containers
- ~/.config/containers
- /run/user/<userid>/overlay*
On a compute node you can delete:
- /tmp/<userid>_hpc
podman-hpc system reset
podman-hpc system migrate

Note you can determine your userid (uid) using the command id.

Note that you may need to

podman unshare
rm -rf /images/<userid>_hpc
exit

to obtain the file permissions needed to delete files in the user namespace.

If clearing these areas doesn't fix your issue, please contact us at help.nersc.gov so we can help.

Perlmutter CHANGELOG¶

Upgrade to v 1.1.2, December 18, 2024

add convenience modules for volume mounts
improve startup time for shared runs
fixes for migrated images

Upgrade to v 1.1.0, November 29, 2023

crun replaces runc as the default
crun now enables users to mount files that are not owned by them, i.e. files owned by collaboration members on CFS
openmpi-pmi2 and openmpi-pmix helper modules added, along with changes to enable OpenMPI
bug in fuse-overlayfs identified. Will be fixed in next maintenance.

Upgrade to v.1.0.4, October 10, 2023

Fixes additionalimagestore bug introduced in v.1.0.3

Upgrade to v.1.0.3, September 28, 2023

Fixes many issues related to inconsistencies in pulling and building arguments

Upgrade to v.1.0.2, June 16, 2023

Fixes bugs found in early testing

v 1.0.1 added to Perlmutter, February 9, 2023

Podman at NERSC¶

Why podman-hpc?¶

How to use podman-hpc¶

Building images¶

Pulling images¶

Using podman-hpc as a container runtime¶

Using Cray MPICH in podman-hpc¶

Using OpenMPI in podman-hpc¶

Using NVIDIA GPUs in podman-hpc¶