Frequently Asked Questions and Troubleshooting¶

If you have questions about Python at NERSC, please take a look at this collection of common user questions and problems. If this information does not help and you still have a problem, please open a ticket at help.nersc.gov.

Is Python broken?¶

If Python seems broken or is exhibiting odd or unexpected behavior, the first thing to do is check your shell resource files (also known as dotfiles).

Some developers like to add things to their shell resource files (i.e. .bashrc and .bash_profile) to avoid having to type things over and over again. Dotfiles can be a good resource but you should periodically check them to see if they need to be changed or updated. It is helpful to check these files for conflicting Python versions, modules, or additions to PATH and PYTHONPATH that may be causing unexpected behavior in your Python setup.

`conda` takes forever to resolve and install packages¶

Try the mamba tool instead. It's already installed when you module load python. You can use mamba exactly how you'd use conda, but it's usually much faster.

Trouble installing conda packages¶

We occasionally see reports of errors installing packages or creating environments with conda (or mamba) that look like:

RuntimeError: Could not open file '/global/homes/e/elvis/.conda/pkgs/cache/47929eba.json'

It seems that the conda (or mamba) utility can leave certain cache files in a state with bad permissions. This likely occurs when using "crtl+c" to interrupt a conda create or conda install command.

To overcome this situation, we recommend purging your unused conda packages and caches with conda clean -a. If you see file permission errors, you may need to fix the permissions first. For example, you can recursively reset all the files in ~/.conda/pkgs with read+write permission for yourself with chmod -R u+rw ~/.conda/pkgs/cache. After successfully cleaning your package caches, you should no longer see that error when installing conda packages.

Help, I'm over quota¶

Creating many conda environments and/or installing several packages can often use many GB of disk quota.

If you want to see which files and subdirectories are taking up space, you can run this command within a directory, for example $HOME:

du -hs $(ls -A) | sort -h

Cleaning up conda packages¶

The following command can be used to check the size of your conda environments and your conda package cache. If your conda environments or package cache directories are not using the default base path at $HOME/.conda, then you will need to specify your custom paths instead.

du -csh $HOME/.conda/envs/* $HOME/.conda/pkgs

To delete unwanted conda environments:

conda env list
conda env remove -n <env>

To delete unused conda files and packages:

conda clean -a

Warnings during conda clean

You may see many warnings when running conda clean -a such as the following:

WARNING conda.gateways.disk.delete:unlink_or_rename_to_trash(143): Could not remove or rename /global/common/software/nersc/pm-2021q4/sw/python/3.9-anaconda-2021.11/pkgs/curl-7.78.0-h1ccaba5_0/info/about.json.  Please remove this file manually (you may need to reboot to free file handles)

These warnings are probably safe to ignore. conda is attempting to remove packages in the python module's package cache but does not have permission to do so. You can prevent conda from attempting to remove those packages by explicity setting the path to search like so:

CONDA_PKGS_DIRS=$HOME/.conda/pkgs conda clean -a

Cleaning up pip packages¶

pip packages installed inside a conda environment are easily cleaned up when the conda environment is deleted. pip packages installed via pip --user (outisde a conda environment) are stored in $HOME/.local/<system>/<python module version>, so feel free to delete some/all of the directories there to clean up space. This location is controlled by the environment variable PYTHONUSERBASE.

Error `break adjusted to free malloc space`¶

If you see this error

*** Error in`python': break adjusted to free malloc space: 0x0000010000000000 ***

it most likely means you should rebuild your code and all dependent packages after module unload craype-hugepages2M. If unloading this module doesn't help, please open a ticket so we can help you troubleshoot further.

What is `/opt/nersc/pymon` and why is it in `PYTHONPATH`?¶

You may have noticed that the PYTHONPATH environment variable includes /opt/nersc/pymon. Including that path enables measurement of Python module usage at NERSC and allows us to make better informed decisions to support our Python users at NERSC. For more information, see Monitoring Scientific Python Usage on a Supercomputer.

Can I use my conda environment with JupyterLab?¶

Yes! Your conda environment can easily become a Jupyter kernel. If you would like to use your custom environment myenv in JupyterLab:

conda activate myenv
conda install ipykernel
python -m ipykernel install --user --name myenv --display-name MyEnv

Then when you log in to jupyter.nersc.gov you should see MyEnv listed as a kernel option.

For more information about using your kernel at NERSC please see our Jupyter docs.

How can I fix my broken conda environment?¶

Conda environments are disposable. If something goes wrong, it is often faster and easier to build a new environment than to debug the old environment.

Can I install my own Anaconda Python "from scratch?"¶

Yes, you are welcome to build your own Python installation.

Can I use virtualenv at NERSC?¶

The virtualenv tool is not compatible with the conda tool used for maintaining Anaconda Python. But this is not necessarily bad news as conda is an excellent replacement for virtualenv and addresses many of its shortcomings. And of course, there is nothing preventing you from doing a from-source installation of Python of your own, and then using virtualenv if you prefer.

Why does my `mpi4py` time out? Or why is it so slow?¶

Running mpi4py on a large number of nodes can become slow due to all the metadata that must move across our filesystems. You may experience timeouts that look like this:

srun: job 33116771 has been allocated resources
Mon Aug 3 18:24:50 2020: [PE_224]:inet_connect:inet_connect: connect failed after 301 attempts
Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_inet_setup:inet_connect failed
Mon Aug 3 18:24:50 2020: [PE_224]:_pmi_init:_pmi_inet_setup (full) returned -1
[Mon Aug 3 18:24:50 2020] [c0-0c2s7n1] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1

Easy (but temporary) fix:

export PMI_MMAP_SYNC_WAIT_TIME=300

but this doesn't fix the problem, it just gives you more time to start up.

Medium fix: move your software stack to /global/common/software

Hard (but most effective fix): use mpi4py in a Shifter container

Can I use mpi4py.futures?¶

Yes, but only via MPICommExecutor. MPIPoolExecutor is not currently supported by Cray MPICH. You can read more on our Parallel Python page.

Why is my code slow?¶

First, please review our brief overview of filesystem best practices at NERSC. Moving to Shifter or a different filesystem may substantially improve your performance. If this doesn't help, you can consider profiling your code.

Errors `unable to open file`/`unable to lock file`¶

Programs built against the Cray HDF5 or NetCDF libraries may generate error messages mentioning problems with "locking files", which are caused by a known issue on NERSC filesystems that use DVS. See the known issues section of the HDF5 docs page for more information.

How can I checkpoint my Python code?¶

Checkpointing your code can make your workflow more robust to:

System issues. If your job crashes because of a system issue, you will be able to restart the checkpointed calculation in a resubmitted job later and it can pick up where it left off.
User error. The most common use case here is that the calculation takes longer than the user expected when the job was submitted, and doesn't finish before the time limit.
Preemption. Some HPC systems offer preemptable queues, where jobs can be run with discount charging because they may be interrupted for higher priority jobs. If your code can be preempted because it can checkpoint, you can take advantage of discount charging or submit shorter jobs. The net effect may be actually faster throughput for your workflow.

This example repo demonstrates one simple way to add graceful error handling and checkpointing to a Python code. Note, mpi4py jobs should generally be run with srun on NERSC systems. For example:

srun -n 2 ./main.py

is suitable for checkpointing. For checkpointing to work, other Python jobs must be run with exec:

exec ./main.py

so that the SIGINT signal will be forwarded. (Bash will not do this.) The InterruptHandler class in this example demonstrates how to catch SIGINT, checkpoint your work, and shut down if necessary.

Why can't I load the python module?¶

If you see the following message:

Lmod has detected the following error:  CONDA_PREFIX is already set...

This typically means you have already loaded a conda environment or have run conda init to initialize conda when your shell starts up. This may look something like the following in your ~/.bashrc file:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/global/common/software/nersc/pm-2022q3/sw/python/3.9-anaconda-2021.11/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/global/common/software/nersc/pm-2022q3/sw/python/3.9-anaconda-2021.11/etc/profile.d/conda.sh" ]; then
        . "/global/common/software/nersc/pm-2022q3/sw/python/3.9-anaconda-2021.11/etc/profile.d/conda.sh"
    else
        export PATH="/global/common/software/nersc/pm-2022q3/sw/python/3.9-anaconda-2021.11/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

You can run the command conda init --reverse to remove them or delete the lines by hand to stop conda from initializing when your shell starts up.