TensorFlow is a deep learning framework developed by Google in 2015. It is maintained and continuously updated by implementing results of recent deep learning research. Therefore, TensorFlow supports a large variety of state-of-the-art neural network layers, activation functions, optimizers and tools for analyzing, profiling and debugging deep neural networks. In order to deliver good performance, the TensorFlow installation at NERSC utilizes the optimized MKL-DNN library from Intel. Explaining the full framework is beyond the scope of this website. For users who want to get started we recommend reading the TensorFlow getting started page. The TensorFlow page also provides a complete API documentation.
TensorFlow at NERSC¶
In order to use TensorFlow at NERSC load the TensorFlow module via
module load tensorflow/<version>
<version> should be replaced with the version string you are trying to load. To see which ones are available use
module avail tensorflow. Modules built for Cori GPU have
gpu in the version name, while those built for CPU nodes have
intel in the version name.
Running TensorFlow on a single node is the same as on a local machine, just invoke the script with
Want to integrate your own packages with TensorFlow at NERSC? There are two suggested solutions:
- Install your packages on top of our TensorFlow + Python installations - You can use the
$PYTHONUSERBASEenvironment variable (set automatically when you load one of our modules) and user installations with
pip install --user ...to install your own packages on top of our PyTorch installations. For example, to add
module load tensorflow pip install netCDF --user
- Install TensorFlow into your custom conda environments - You can setup a conda environment as described in our Python documentation and install TensorFlow into it. You can choose builds that target CPU or GPU to install. For example, to choose the CPU optimized version, first look up the available builds:
conda search tensorflow
that would output a long list like this one:
tensorflow 2.1.0 eigen_py37h1a52d58_0 pkgs/main tensorflow 2.1.0 gpu_py37h7a4bb67_0 pkgs/main tensorflow 2.1.0 mkl_py36h23468d9_0 pkgs/main tensorflow 2.1.0 mkl_py37h80a91df_0 pkgs/main
the ones that have
mkl are optimized for CPU. You can now choose one to install:
conda install tensorflow=2.1.0=mkl_py37h80a91df_0
To install the gpu version you need to load the module of the CUDA version against which TensorFlow has been compiled. For example, to install TensorFlow
module load cuda/10.1.243 conda install tensorflow=2.1.0=gpu_py37h7a4bb67_0
Please contact us at
firstname.lastname@example.org if you want to build Horovod for your private build.
It is also possible to use your own docker containers with TensorFlow on Cori with shifter. Refer to the NERSC shifter documentation for help deploying your own containers.
On Cori-GPU, we provide NVIDIA GPU Cloud (NGC) containers. They are named like
nersc/tensorflow:ngc-20.09-tf2-v0 for TF versions 1 and 2, respectively.
To use interactively run in a container:
shifter --module none --image=nersc/tensorflow:ngc-20.09-tf2-v0
To run in a container in batch jobs we strongly recommend using Slurm image shifter options for best performance:
#SBATCH --image=nersc/tensorflow:ngc-20.09-tf2-v0 srun shifter python my_python_script.py
To add python packages that are not in the image you can install them under
$PYTHONUSERBASE. You can also use Shifter
--env option to set the path, for example, to add
shifter --image=nersc/tensorflow:ngc-20.09-tf2-v0 --env PYTHONUSERBASE=$HOME/.local/cori/my_tf_ngc-20.09-tf2-v0_env pip install netCDF --user
You need to set the
$PYTHONUSERPATH in your Slurm batch scripts:
#SBATCH --image=nersc/tensorflow:ngc-20.09-tf2-v0 #SBATCH --volume="/dev/infiniband:/sys/class/infiniband_verbs" srun shifter --env PYTHONUSERBASE=$HOME/cori/my_tf_ngc-20.09-tf2-v0_env python my_python_script.py
You can also customize the images further by building your own Docker/Shifter image based on NERSC or NGC images following the standard Shifter image building instructions.
Running TensorFlow on Perlmutter¶
Running TensorFlow on Perlmutter is currently pretty much the same as running on Cori-GPU.
As of this writing, we have one module available with TensorFlow 2.6 built with NCCL 2.9.8. Note that on Perlmutter we use Lmod for modules, but the syntax is familiar for basic usage:
module load tensorflow/2.6.0
Users can also use tensorflow with custom conda environments, or with NGC containers in shifter by following the instructions above.
For recent NGC tensorflow containers (e.g. 21.XX), for now, you will need to adjust your job step command as follows:
srun --mpi=pmi2 ... shifter --module=gpu ... bash -c "python my_python_script.py"
--mpi=pmi2is needed if you're running Horovod distributed training
bash -c "..."is needed to load the CUDA driver compatibility libraries
Please refer to Perlmutter known issues for additional problems and suggested workarounds.
We recommend using Uber Horovod for data distributed training. The version of Horovod we provide is compiled against the optimized Cray MPI and thus integrates well with SLURM. Checkout our example SLURM scripts for running Horovod on Cori CPU and GPU, using modules and containers. Also, Horovod provides TensorFlow 1 and 2 examples.
It is important to note that splitting the data among the nodes is up to the user and needs to be done besides the modifications stated above. Here, utility functions can be used to determine the number of independent ranks via
hvd.size() and the local rank id via
hvd.rank(). If multiple ranks are employed per node,
hvd.local_size() return the node-local rank-id's and number of ranks. If the dataset API is being used we recommend using the
dataset.shard option to split the dataset. In other cases, the data sharding needs to be done manually and is application dependent.
Frequently Asked Questions¶
I/O Performance and Data Feeding Pipeline¶
For performance reasons, we recommend storing the data on the scratch directory, accessible via the
SCRATCH environment variable. At high concurrency, i.e. when many nodes need to read the files we recommend staging them into burst buffer. For efficient data feeding we recommend using the
TFRecord data format and using the
dataset API to feed data to the CPU. Especially, please note that the
TFRecordDataset constructor takes
num_parallel_reads options which allow for prefetching and multi-threaded reads. Those should be tuned for good performance, but please note that a thread is dispatched for every independent read. Therefore, the number of inter-threads needs to be adjusted accordingly (see "Potential Issues" below). The
buffer_size parameter is meant to be in bytes and should be an integer multiple of the node-local batch size for optimal performance.
On Cori GPU, there is 1TB of node-local temporary storage in a nvme SSD mounted at
/tmp. This can be made use of to speed up data pipelines, either by staging data there once, at the start of a job, or by caching dataset elements there via the
cache() option for
For best MKL-DNN performance, the module already sets a set of OpenMP environment variables and we encourage the user not changing those, especially not changing the
OMP_NUM_THREADS variable. Setting this variable incorrectly can cause a resource starvation error which manifests in TensorFlow telling the user that too many threads are spawned. If that happens, we encourage to adjust the inter- and intra-task parallelism by changing the
NUM_INTRA_THREADS environment variables. Those parameters can also be changed in the TensorFlow python script as well via the
Please note that
num_total_threads is 64 on Haswell or 272 on KNL.