Skip to content

Frequently Asked Questions and Troubleshooting

If you have questions about Python at NERSC, please take a look at this collection of common user questions and problems.

If this information does not help and you still have a problem, please open a ticket at help.nersc.gov with the following information:

  1. Are you using a Python module? If so which one?
  2. Which of our 4 Python options are you using?
  3. If you are using a custom conda environment, what is its name?
  4. Have you checked your shell resource files for anything that may be causing your issues?
  5. How can we reproduce your error?

If you can provide us this information right away, it will help us find and solve your problem more quickly.

Is Python broken?

If Python seems broken or is exhibiting odd or unexpected behavior, the first thing to do is check your shell resource files (also known as dotfiles).

Some developers like to add things to their shell resource files (i.e. .bashrc and .bash_profile) to avoid having to type things over and over again. Ok we get it, nobody likes unnecessary typing. Dotfiles can be a good resource but you should periodically check them to see if they need to be changed or updated. It is helpful to check these files for conflicting Python versions, modules, or additions to PATH and PYTHONPATH that may be causing unexpected behavior in your Python setup.

If you would like to reset your shell resource files to the NERSC defaults, you can run the command fixdots. This will back up your old files into a folder called KeepDots.<date> and reset your files.

Should I use Python 2 or Python 3?

Python 3! Python 2 reached its end of life on Jan 1, 2020. Python 2 will remain on Cori for now, but will not be available on Perlmutter.

If you are still using Python 2 at NERSC, you may have noticed our warning:

ATTENTION: Python 2 reached end-of-life Jan 1, 2020.
We urge you to transition to Python 3.

Developers of many packages including NumPy, SciPy, Matplotlib, pandas, and scikit-learn pledged to drop support for Python 2 "no later than 2020." You can expect support for all Python 2 libraries to continue to wither away. Using Python 2 past end of life is a risk as new issues will likely go unaddressed by developers. You may already have noticed deprecation warnings from your Python applications' outputs; please do not ignore these warnings.

Can I install my own Anaconda Python "from scratch?"

Yes. One reason you might consider this is that you want to install Anaconda Python on /global/common/software or in a Shifter image to improve launch-time performance for large-scale applications. Or you might want more complete control over what versions of packages are installed and don't want to worry about whether NERSC will upgrade packages to versions that break backwards compatibility you depend on. See here for more information on how you can do this.

How do I use the Intel Distribution for Python at NERSC?

Intel Math Kernel Library (MKL), Data Analytics Acceleration Library (DAAL), Thread Building Blocks (TBB), and Integrated Performance Primitives (IPP) are available through Intel Community Licensing. This enabled both Continuum Analytics and Intel to provide access to Intel's performance libraries through Python for free.

Create a conda environment for your Intel Distribution for Python installation:

module load python
conda create -n idp -c intel intelpython3_core python=3
source activate idp

Can I use virtualenv on Cori?

The virtualenv tool is not compatible with the conda tool used for maintaining Anaconda Python. But this is not necessarily bad news as conda is an excellent replacement for virtualenv and addresses many of its shortcomings. And of course, there is nothing preventing you from doing a from-source installation of Python of your own, and then using virtualenv if you prefer.

Why does my mpi4py time out? Or why is it so slow?

Running mpi4py on a large number of nodes can become slow due to all the metadata that must move across our filesystems. You may experience timeouts that look like this:

srun: job 33116771 has been allocated resources
Mon Aug 3 18:24:50 2020: [PE_224]:inet_connect:inet_connect: connect failed after 301 attempts
Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_inet_setup:inet_connect failed
Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_init:_pmi_inet_setup (full) returned -1
[Mon Aug 3 18:24:50 2020] [c0-0c2s7n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1

Easy (but temporary) fix:

    export PMI_MMAP_SYNC_WAIT_TIME=300
but this doesn't fix the problem, it just gives you more time to start up.

Medium fix: move your software stack to /global/common/software

Hard (but most effective fix): use mpi4py in a Shifter container

Can I use my conda environment in Jupyter?

Yes! Your conda environment can easily become a Jupyter kernel. If you would like to use your custom environment myenv in Jupyter:

source activate myenv
conda install ipykernel
python -m ipykernel install --user --name myenv --display-name MyEnv

Then when you log into jupyter.nersc.gov you should see MyEnv listed as a kernel option.

For more information about using your kernel at NERSC please see our Jupyter docs.

My conda environments have put me over quota-- what do I do?

Conda and all its related files and packages can really add up. If you are installing packages to $HOME an exceed your quota (you can check via myquota), cleaning up your conda files can make a big difference:

conda clean --all

will clean up all unused files and packages. See here for more information about conda clean.

How can I fix my broken conda environment?

Conda environments are disposable. If something goes wrong, it is often faster and easier to build a new environment than to debug the old environment.

Can I use pip at NERSC?

Yes. For more information about using pip at NERSC please see here.

How can I checkpoint my Python code?

Checkpointing your code can make your workflow more robust to:

  • System issues. If your job crashes because of a system issue, you will be able to restart the checkpointed calculation in a resubmitted job later and it can pick up where it left off.
  • User error. The most common use case here is that the calculation takes longer than the user expected when the job was submitted, and doesn't finish before the time limit.
  • Preemption. Some HPC systems offer preemptable queues, where jobs can be run with discount charging because they may be interrupted for higher priority jobs. If your code can be preempted because it can checkpoint, you can take advantage of discount charging or submit shorter jobs. The net effect may be actually faster throughput for your workflow.

This example repo demonstrates one simple way to add graceful error handling and checkpointing to a Python code. Note, mpi4py jobs must be run with srun on Cori. For example:

srun -n 2 ./main.py
is suitable for checkpointing. For checkpointing to work, other Python jobs must be run with exec:
exec ./main.py
so that the SIGINT signal will be forwarded. (Bash will not do this.) The InterruptHandler class in this example demonstrates how to catch SIGINT, checkpoint your work, and shut down if necessary.