Skip to content

Frequently Asked Questions and Troubleshooting

If you have questions about Python at NERSC, please take a look at this collection of common user questions and problems. If this information does not help and you still have a problem, please open a ticket at help.nersc.gov.

Is Python broken?

If Python seems broken or is exhibiting odd or unexpected behavior, the first thing to do is check your shell resource files (also known as dotfiles).

Some developers like to add things to their shell resource files (i.e. .bashrc and .bash_profile) to avoid having to type things over and over again. Dotfiles can be a good resource but you should periodically check them to see if they need to be changed or updated. It is helpful to check these files for conflicting Python versions, modules, or additions to PATH and PYTHONPATH that may be causing unexpected behavior in your Python setup.

conda takes forever to resolve and install packages

Try the mamba tool instead. It's already installed when you module load python. You can use mamba exactly how you'd use conda, but it's usually much faster.

Help, I'm over quota!

To delete unused conda files and packages:

module load python
conda clean -a

To delete unwanted conda environments:

conda env list
conda env remove -n <env>

pip packages are stored in $HOME/.local/cori/<python module version>, so feel free to delete some/all of the directories there to clean up space.

Break adjusted to free malloc space

If you see this error

*** Error in`python': break adjusted to free malloc space: 0x0000010000000000 ***

it most likely means you should rebuild your code and all dependent packages after module unload craype-hugepages2M. If unloading this module doesn't help, please open a ticket so we can help you troubleshoot further.

What is /opt/mods/ and why is it in PYTHONPATH?

You may have noticed that your default PYTHONPATH is

elvis@cori07:~> echo $PYTHONPATH
/opt/mods/lib/python3.6/site-packages:/opt/ovis/lib/python3.6/site-packages

The /opt/mods part of this path enables our system-wide Python monitoring. If you allow PYTHONPATH to remain set, we are able to collect data on your Python job and use it to make more informed decisions to better support Python users at NERSC. To learn more about the data we collect, please visit our MODS webpage.

Can I use my conda environment in Jupyter?

Yes! Your conda environment can easily become a Jupyter kernel. If you would like to use your custom environment myenv in Jupyter:

source activate myenv
conda install ipykernel
python -m ipykernel install --user --name myenv --display-name MyEnv

Then when you log into jupyter.nersc.gov you should see MyEnv listed as a kernel option.

For more information about using your kernel at NERSC please see our Jupyter docs.

How can I fix my broken conda environment?

Conda environments are disposable. If something goes wrong, it is often faster and easier to build a new environment than to debug the old environment.

Can I install my own Anaconda Python "from scratch?"

Yes, you are welcome to build your own Python installation.

Can I use virtualenv on Cori?

The virtualenv tool is not compatible with the conda tool used for maintaining Anaconda Python. But this is not necessarily bad news as conda is an excellent replacement for virtualenv and addresses many of its shortcomings. And of course, there is nothing preventing you from doing a from-source installation of Python of your own, and then using virtualenv if you prefer.

Why does my mpi4py time out? Or why is it so slow?

Running mpi4py on a large number of nodes can become slow due to all the metadata that must move across our filesystems. You may experience timeouts that look like this:

srun: job 33116771 has been allocated resources
Mon Aug 3 18:24:50 2020: [PE_224]:inet_connect:inet_connect: connect failed after 301 attempts
Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_inet_setup:inet_connect failed
Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_init:_pmi_inet_setup (full) returned -1
[Mon Aug 3 18:24:50 2020] [c0-0c2s7n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1

Easy (but temporary) fix:

    export PMI_MMAP_SYNC_WAIT_TIME=300

but this doesn't fix the problem, it just gives you more time to start up.

Medium fix: move your software stack to /global/common/software

Hard (but most effective fix): use mpi4py in a Shifter container

Why is my code slow?

First, please review our brief overview of filesystem best practices at NERSC. Moving to Shifter or a different filesystem may substantially improve your performance. If this doesn't help, you can consider profiling your code.

How can I checkpoint my Python code?

Checkpointing your code can make your workflow more robust to:

  • System issues. If your job crashes because of a system issue, you will be able to restart the checkpointed calculation in a resubmitted job later and it can pick up where it left off.
  • User error. The most common use case here is that the calculation takes longer than the user expected when the job was submitted, and doesn't finish before the time limit.
  • Preemption. Some HPC systems offer preemptable queues, where jobs can be run with discount charging because they may be interrupted for higher priority jobs. If your code can be preempted because it can checkpoint, you can take advantage of discount charging or submit shorter jobs. The net effect may be actually faster throughput for your workflow.

This example repo demonstrates one simple way to add graceful error handling and checkpointing to a Python code. Note, mpi4py jobs must be run with srun on Cori. For example:

srun -n 2 ./main.py

is suitable for checkpointing. For checkpointing to work, other Python jobs must be run with exec:

exec ./main.py

so that the SIGINT signal will be forwarded. (Bash will not do this.) The InterruptHandler class in this example demonstrates how to catch SIGINT, checkpoint your work, and shut down if necessary.