TensorFlow is a deep learning framework developed by Google in 2015. It is maintained and continuously updated by implementing results of recent deep learning research. Therefore, TensorFlow supports a large variety of state-of-the-art neural network layers, activation functions, optimizers and tools for analyzing, profiling and debugging deep neural networks. In order to deliver good performance, the TensorFlow installation at NERSC utilizes the optimized MKL-DNN library from Intel, as well as CUDA and related libraries from NVIDIA. Explaining the full framework is beyond the scope of this website. For users who want to get started we recommend browsing the TensorFlow tutorials. The TensorFlow page also provides a complete API documentation.
TensorFlow at NERSC¶
In order to use TensorFlow at NERSC load the TensorFlow module via
module load tensorflow/<version>
<version> should be replaced with the version string you are trying to load. To see which ones are available use
module avail tensorflow.
Running TensorFlow on a single node is the same as on a local machine, just invoke the script with
Want to integrate your own packages with TensorFlow at NERSC? There are two suggested solutions:
Install your packages on top of our TensorFlow + Python installations - You can use the
$PYTHONUSERBASEenvironment variable (set automatically when you load one of our modules) and user installations with
pip install --user ...to install your own packages on top of our PyTorch installations. For example, to add the
module load tensorflow pip install netCDF --user
Install TensorFlow into your custom conda environments - You can setup a conda environment as described in our Python documentation and install TensorFlow into it. TensorFlow documentation recommends installing via
For TensorFlow's GPU functionality you need to have the requisite CUDA libraries available. One option, as mentioned in the TensorFlow docs, is to install
cudnnvia conda, into your custom conda environment. Alternately, you can simply load the
cudnnmodules on Perlmutter. Note you should take care to match the CUDA version of the module against your TensorFlow version.
Please contact us at
firstname.lastname@example.org you want to build Horovod for your private build.
It is also possible to use your own docker containers with TensorFlow using shifter. Refer to the NERSC shifter documentation for help deploying your own containers.
On Perlmutter, we provide NVIDIA GPU Cloud (NGC) containers, with a few extra packages added for convenience. They are named like
To use interactively, you can run in a container with a command like:
shifter --image=nersc/tensorflow:ngc-22.09-tf2-v0 --module=gpu,nccl-2.15
To run in a container in batch jobs we strongly recommend using Slurm image shifter options for best performance:
#SBATCH --image=nersc/tensorflow:ngc-22.09-tf2-v0 #SBATCH --module=gpu,nccl-2.15 srun shifter python my_python_script.py
On Perlmutter, best performance for multi-node distributed training is achieved via usage of the
nccl-2.15 shifter module, along with the default
gpu shifter module.
To add python packages that are not in the image you can install them under
$PYTHONUSERBASE. You can then use Shifter
--env option to set the path. For example, to add the
shifter --image=nersc/tensorflow:ngc-22.09-tf2-v0 --module gpu,nccl-2.15 --env PYTHONUSERBASE=$HOME/.local/perlmutter/my_tf_ngc-22.09-tf2-v0_env pip install netCDF --user
You also need to set the
$PYTHONUSERBASE in your Slurm batch scripts to use your custom libraries at runtime:
#SBATCH --image=nersc/tensorflow:ngc-22.09-tf2-v0 #SBATCH --module=gpu,nccl-2.15 srun shifter --env PYTHONUSERBASE=$HOME/.local/perlmutter/my_tf_ngc-22.09-tf2-v0_env python my_python_script.py
You can also customize the images further by building your own Docker/Shifter image based on NERSC or NGC images following the standard Shifter image building instructions. The recipes for NERSC NGC images, which are built on top of NVIDIA's NGC images, are a good starting point for building optimized GPU-enabled containers.
NGC tensorflow containers on Perlmutter
Please note that for running multi-node distributed training with horovod in NGC tensorflow containers, you will need to include
--module=gpu,nccl-2.15 as options to
shifter (respectively). The full job step command would look something like
srun --mpi=pmi2 ... shifter --module=gpu,nccl-2.15 ....
Peformance issues may arise in older NGC container versions when running multi-node jobs
Older NGC containers (e.g. 21.08) may experience performance variability when running distributed training with horovod. The issue can be fixed by upgrading to a more recent container image. Please refer to our known issues for additional problems and suggested workarounds.
We recommend using Uber Horovod for distributed data parallel training. The version of Horovod we provide is compiled against the optimized Cray MPI and thus integrates well with Slurm. Check out our example Slurm scripts for running Horovod on Perlmutter, using modules and containers. Also, Horovod provides TensorFlow 1 and 2 examples.
It is important to note that splitting the data among the nodes is up to the user and needs to be done besides the modifications stated above. Here, utility functions can be used to determine the number of independent ranks via
hvd.size() and the local rank id via
hvd.rank(). If multiple ranks are employed per node,
hvd.local_size() return the node-local rank-id's and number of ranks. If the dataset API is being used we recommend using the
dataset.shard option to split the dataset. In other cases, the data sharding needs to be done manually and is application dependent.
Frequently Asked Questions¶
I/O Performance and Data Feeding Pipeline¶
For performance reasons, we recommend storing the data on the scratch filesystem, accessible via the
SCRATCH environment variable. For efficient data feeding we recommend using the
TFRecord data format and using the
dataset API to feed data to the model. Especially, please note that the
TFRecordDataset constructor takes
num_parallel_reads options which allow for prefetching and multi-threaded reads. Those should be tuned for good performance, but please note that a thread is dispatched for every independent read. Therefore, the number of inter-threads needs to be adjusted accordingly (see "Potential Issues" below). The
buffer_size parameter is meant to be in bytes and should be an integer multiple of the node-local batch size for optimal performance.
On Perlmutter, there is 126GB of node-local DRAM temporary storage, mounted at
/tmp. This can be made use of to speed up data pipelines, either by staging data there once, at the start of a job, or by caching dataset elements there via the
cache() option for