Local temporary file system in memory¶
The memory of compute nodes can also be used to store data using a file system-like access. Every Linux OS mounts (part of) the system memory under the path
/dev/shm/, where any user can write into, similarly to the
/tmp/ directory. This kind of file system may help in case of multiple accesses to the same files or in case of very small files, which are usually troublesome for parallel file systems.
/dev/shm reduces the memory available to the OS and may cause compute node to go Out Of Memory (OOM), which will kill your processes and interrupt your job and/or crash the compute node itself.
Each architecture of compute nodes at NERSC ships with different memory layouts, so the advice is to first inspect the memory each architecture reserves to
/dev/shm in an interactive session using
df -h /dev/shm (by default the storage space reserved to tmpfs is half the physical RAM installed). Note that
/dev/shm is a file system local to each node, so no shared file access is possible across multiple nodes of a job.
Since data is purged after every job is completed, users cannot expect their data to persist across jobs, and will be required to manually stage in their data into
/dev/shm before every execution and stage it out before completing. If this data movement involves several small files, the best approach would be to create an archive containing all the files beforehand (e.g. on a DTN node, to avoid wasting precious compute time), then inside the job extract the data from the archive into
/dev/shm: this minimizes the number of accesses to small files on the parallel file systems and produce instead large contiguous file accesses.
For example, let’s assume several small input files are needed to bootstrap your jobs, and are stored in your scratch directory at
$SCRATCH/files/. Here’s how you could produce a compressed archive
input.tar.gz (note that the
$SCRATCH variable is not expanded in the dtn nodes):
ssh dtn03.nersc.gov cd /global/cscratch1/sd/$USER/ tar -czf input.tar.gz files/
Now you can unarchive it in
/dev/shm when inside a job.
Note that there may already be some system directories in
/dev/shm which may cause your process to misbehave: for this reason you may want to create a subdirectory only for you and unarchive your files in there, when inside your job:
mkdir /dev/shm/$USER tar -C /dev/shm/$USER -xf $SCRATCH/input.tar.gz
A similar approach to the stage-in needs to be taken before the job completion, in order to store important files created by the job. For example, if a job created files in
/dev/shm/$USER/, we may want to archive and compress them into a single file with:
cd /dev/shm/$USER/ tar -czf $SCRATCH/output_collection/output.tar.gz .
The user will need to pay attention at the memory usage on the node: storing too much data in a tmpfs on memory may force the kernel to kill running processes and/or cause the node to crash if not enough memory is left available.
Also important to note is that
/dev/shm, being volatile memory, does not offer any fault tolerance solution, and a node crash will cause the data to be lost: see also our documentation on Checkpointing for solutions.
If you're creating large archives (over the GB threshold) please consider striping the scratch directory where you will create the archive.
A similar solution is to use temporary XFS file systems on top of Lustre, when using shifter containers.