Data Virtualization Service (DVS)¶
The HPE Data Virtualization Service (DVS) is an I/O forwarding service that works by projecting a parallel file system, like our Community, Global Common, or Homes File System, to compute nodes. It offers increased stability by limiting the number of clients directly accessing the file system and it can use aggressive caching to dramatically improve performance for read-intensive workloads.
At NERSC, DVS works by brokering the Community, Global Common, and Homes File Systems through 24 nodes called Gateway Nodes. The Gateway Nodes are dedicated I/O nodes made up of 2 16-core 3.0 GHz AMD Rome CPUs, 256 GB Memory, 2 Slingshot-11 NICs, 2 Mellanox CX6 VPI single port HCAs, and 2 480 GB SSDs.
The settings for each NERSC file system are tuned according to their purpose. Since Global Common is intended for complex software stacks that are read and not written during batch jobs, it is mounted read only with an aggressive cache time. Additionaly it is configured so that file access can use any one of the 24 Gateway Nodes. When a job starts up, each node is assigned a Gateway Node to use, which allows it to take advantage of both local cache and the cache on the Gateway Node. Together these settings enable very fast read access of files at full system scales.
The Community and Homes File Systems need to be writable during a job, so the focus is on maintaining a consistent picture across the entire file system. The cache settings are less aggressive. Additionally, each file is assigned to a Gateway Node when it is created, so all future accesses will use that same node. For users who need to read large volumes of data from CFS, we do offer a read-only mount of CFS at /dvs_ro/cfs
which uses many of the same settings as Global Common and should offer better performance, especially if your code reads the same files over and over.
Best Practices for DVS Performance at Scale¶
While there are advantages to DVS, it can demonstrate some different behaviors that can cause issues, especially at scale. Following these best practices will help improve your I/O performance.
Install Your Code in the Right Place¶
For large scale jobs at NERSC, putting your code into a container will always be the most performant option.
If a container won't work for your use case and your software stack is very small, your next best option is to use the Slurm sbcast
command to copy your executable and libraries to local disk on each node.
If a container won't work for your use case and your software stack is very complex (e.g. a conda environment), then you should install into the Global Common File System.
If you run into issues using Global Common, you could try the Scratch File System. However, keep in mind that the file system is purged, which may result in portions of the software stack being removed unexpectedly.
Read Your Data From the Right Place¶
If your job reads large volumes of data, the fastest file system will almost always be Perlmutter Scratch. However, if many of the processes in your jobs repeatedly read in the same file (e.g. a configuration file), you may see a large speedup by using a read-only DVS mount. On Perlmutter CFS has a corresponding read-only mount at /dvs_ro/cfs
, respectively. We recommend using this for data that is being read during a job that is not being actively changed. The DVS mount of this file system will cache data for 30 seconds by default, so if data is being changed, you may see unexpected results.
Things to Avoid With DVS¶
Avoid ACLs¶
DVS is unable to cache extended attributes. Extended attributes are features that enable users to associate computer files with metadata not interpreted by the filesystem. The most common kind of extended attribute is an ACL, which can be used to manage complex access permissions for files. Because DVS is unable to cache these attributes, it must access the file system every time it touches the file, which can be very slow, especially at large scale. It is recommended to not use ACLs on any files or directories you need to access at scale during your batch jobs.
Avoid Loading Libraries from Homes or CFS in Dynamically Linked Applications¶
DVS uses the inode number, a unique identifier for each file and directory, to assign files and directories in Global Homes and CFS to a Gateway Node. This means that if you have processes in a job trying to access a directory or file there, they will all have to wait on a single node to supply the information. This can be an extreme bottleneck as scales increase. It is recommended to not run jobs larger than 10 nodes out of Global Homes or CFS. Keep in mind that this is aggregate across all of your running jobs. So if you have many separate single node jobs that all start at once, this will have the same effect as starting a single large job. They will all be hung up waiting for a single Gateway Node.
Do Not Run Large-Scale Python Jobs Out of Homes or CFS¶
By default python prepends the current working directory to the python module search path (sys.path
). This means that even if your entire python stack is installed in Global Common, you will still be bottlenecked by acessing your submit directory if you submit from Global Homes or CFS. You can get around this behavior by adding the -I
flag to python.
Please see our guide for running python at NERSC for more information about how best to scale up your python jobs.
Do Not Use File Locking¶
DVS doesn't support file locking. It's turned off by default for most codes at NERSC (including HDF5). If you do need to use any kind of file locking, please use Perlmutter Scratch.
Do Not Use Memory Mapping (mmap)¶
DVS at NERSC doesn't support memory mapped file I/O. If you do need to use memory mapping, please use Perlmutter Scratch.