Skip to content

Using HDF5 in Python (h5py)

You can use h5py for either serial or parallel I/O.

If you would like to use h5py for serial I/O, you can load our default Python module where h5py is already installed via module load python.

You can also conda install h5py into your custom conda environment.

For more general information about HDF5 at NERSC please see this page.

Using h5py-parallel

If you would like to use h5py for parallel I/O, you will have to build h5py against mpi4py in your custom conda environment.

We will provide the directions for building an h5py-parallel enabled conda environment below. These directions are based on those found here and here.

You will first need a conda environment with mpi4py built and installed for NERSC. You can follow our directions here OR you can try cloning our lazy-mpi4py conda environment where we have already built mpi4py for you:

module load python
conda create -n h5pyenv --clone lazy-mpi4py

Activate your environment

source activate h5pyenv

Load and configure your modules:

module load cray-hdf5-parallel
module swap PrgEnv-intel PrgEnv-gnu

Clone the h5py github repository:

cd $SCRATCH
git clone https://github.com/h5py/h5py
cd h5py

Configure your build environment:

export HDF5_MPI="ON"
export CC=/opt/cray/pe/craype/2.6.2/bin/cc

Configure your h5py build:

python setup.py configure

The output should look like:

********************************************************************************
                       Summary of the h5py configuration

HDF5 include dirs: [
  '/opt/cray/pe/hdf5/1.10.5.2/GNU/8.2/include'
]
HDF5 library dirs: [
  '/opt/cray/pe/hdf5/1.10.5.2/GNU/8.2/lib'
]
     HDF5 Version: '1.10.5'
      MPI Enabled: True
 Rebuild Required: True

********************************************************************************

Now build:

python setup.py build

Once the build completes, you'll need to install h5py as a Python package:

pip install --no-binary=h5py h5py

Now we will test our h5py-parallel enabled conda environment on a compute node since mpi4py will not work on login nodes. Get an interactive compute node:

salloc -N 1 -t 20 -C haswell -q interactive
module load python
source activate h5pyenv

We'll use this test program described in the h5py docs:

from mpi4py import MPI
import h5py

rank = MPI.COMM_WORLD.rank  # The process ID (integer 0-3 for 4-process run)

f = h5py.File('parallel_test.hdf5', 'w', driver='mpio', comm=MPI.COMM_WORLD)

dset = f.create_dataset('test', (4,), dtype='i')
dset[rank] = rank

f.close()

We can run this test with 4 mpi ranks:

srun -n 4 python test_h5pyparallel.py

Let's look at the file we wrote with h5dump parallel_test.hdf5. It should look like this:

HDF5 "parallel_test.hdf5" {
GROUP "/" {
   DATASET "test" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 0, 1, 2, 3
      }
   }
}
}

Great! Our 4 mpi ranks each wrote part of this HDF5 file.