Brief introduction to Python at NERSC¶
The Python we provide at NERSC is Anaconda Python. We believe that Anaconda provides a good compromise between productivity and performance. What does this mean for you?
Our data show that about 80 percent of our NERSC Python users are using custom conda environments- you might find that these are a good solution for you, too. At large scale however, we strongly suggest using a Shifter container. We provide many more details on our using Python at NERSC page.
Python on your laptop vs. Python at NERSC¶
Perhaps the most important thing to understand about running Python on your laptop vs. Python at NERSC are the NERSC shared filesystems. On your laptop, you can run a script that reads data and writes output files without much thought. At NERSC however, you'll need to think about 1) where your Python installation/environment is located 2) where your Python scripts/software are located and 3) where your data are located (both input and output files) and whether or not these filesystems are designed for how you intend to use them. At smaller scale you can often get away with using our filesystems in a non-recommended way, but at larger scale both your performance and your fellow users' performance will suffer. We provide some suggested best practices below.
Best: Shifter If you plan to run your code at any large scale (10+ nodes), our current best practice recommendation is to use Shifter whenever possible to reduce filesystem traffic and improve performance. For optimal performance, you should put as much of your Python stack and your codebase inside the container as possible. We have some Python examples to help you. To improve I/O, Shifter has a local XFS filesystem so users can read or write temporary files on the node itself without using shared filesystems.
Better: /global/common/software (Python installation/environment) and $SCRATCH (data) If you cannot use Shifter, the best and fastest place for your code and Python installation/conda environment is /global/common/software
because it is mounted read-only on the compute nodes. The best and fastest place for your data is $SCRATCH. We don't recommend installing your Python stack/conda environment on $SCRATCH as it will eventually be purged.
Avoid: $HOME Without any intervention, conda package data and environments are installed at $HOME/.conda
. Running a 10+ node job will result in a lot of filesystem traffic across $HOME which is not designed for this type of use. This will slow down your application and other users applications, too. We provide an example of how to install your custom conda environment on /global/common/software.
Avoid: $CFS Similar to $HOME, we also note that our $CFS filesystem is also not meant to be used for many-node compute jobs or heavy I/O. You may cause or experience slowdowns due to filesystem pressure by using $CFS at scale. It is better to stage your data to $SCRATCH if at all possible.
How to run Python jobs at NERSC¶
Please see our general overview of running jobs at NERSC
You have many options for running Python at NERSC:
- Our login nodes (only for very small testing and debugging). Please see our login node policies.
- Jupyter for interactive notebooks well-suited for visualization and machine learning tasks.
- CPU compute nodes or GPU compute nodes for any substantial computation (either interactively or via a batch job)