Lustre File Striping¶
Perlmutter uses Lustre as its $SCRATCH file system. For many applications a technique called file striping will increase I/O performance. File striping will primarily improve performance for codes doing serial I/O from a single node or parallel I/O from multiple nodes writing to a single shared file as with MPI-I/O, parallel HDF5 or parallel NetCDF.
The Lustre file system is made up of an underlying set of I/O servers and disks called Object Storage Targets (OSTs). A file is said to be striped when its data is on multiple OSTs. Read and write operations on striped files will access multiple OST's concurrently. File striping is a way to increase I/O performance since writing or reading from multiple OST's simultaneously increases the available I/O bandwidth. Selecting the best striping can be complicated since striping a file over too few OSTs will not take advantage of the system's available bandwidth but striping over too many will cause unnecessary overhead and lead to a loss in performance. The default striping is set to 1 on Perlmutter's $SCRATCH. This means that each file is written to 1 OSTs on Perlmutter by default.
NERSC File Striping Recommendations¶
NERSC has provided striping command shortcuts based on file size and I/O pattern to simplify optimization on Perlmutter.
- Shared file I/O: Either one processor does all the I/O for a simulation in serial or multiple processors write to a single shared file as with MPI-IO and parallel HDF5 or NetCDF
- File per process: Each process writes to its own file resulting in as many files as number of processes
Single Shared-File I/O | File per Process | |
---|---|---|
File size | Command | Command |
< 1 GB | keep default striping | keep default striping |
1 - 10 GB | stripe_small | keep default striping |
10 - 100 GB | stripe_medium | keep default striping |
100 GB - 1 TB | stripe_large | keep default striping |
> 1 TB | stripe_large | stripe_large |
These helper scripts will set the number of OSTs to stripe across to 8, 24, and 72 for stripe_small
, stripe_medium
and stripe_large
, respectively. In all cases, the stripe size is 1MB.
Warning
Do not use a stripe count larger than stripe_large
(72 OSTs). This will result in poor performance and can adversely affect the entire file system.
Striping must be set on a file before is written. For example for a file of about 10-100 GB in size, one could create an empty file and set its striping appropriately with the command:
stripe_medium output_file
This has to be done before running a job that will populate the file. Striping of a file cannot be changed once the file has been written to, aside from manually copying the existing file into a newly created (empty) file with the desired striping.
Files inherit the striping configuration of the directory in which they are created. Again, the desired striping must be set on the directory before creating the files (later changes of the directory striping are not inherited). When copying an existing striped file into a striped directory, the new copy will inherit the directory's striping configuration. This provides another approach to changing the striping of an existing file.
Inheritance of striping provides a convenient way to set the striping on multiple output files at once, if all such files are written to the same output directory. For example, if a job will produce multiple 10-100 GB output files in a known output directory, the striping of the latter can be configured before job submission:
mkdir output_directory
stripe_medium output_directory
Restriping an Existing File¶
To restripe an existing file you can either make a copy of it:
stripe_large tmp_my_big_file
cp my_big_file tmp_my_big_file
mv tmp_my_big_file my_big_file
If there are multiple files, you could create a directory with the desired striping and copy the files into it, to avoid repeating the above procedure for each file.
The alternative is to use lfs_migrate
, and let Lustre manage the migration:
lfs_migrate -c $STRIPE_COUNT -S 1M my_big_file
Where $STRIPE_COUNT
is a sensible amount of OSTs according to the table above.
Check striping of files and directories¶
To obtain the number of OSTs a file or directory is striped on, you can use lfs getstripe
, which works similarly to ls
:
$ mkdir $SCRATCH/test-dir
$ stripe_medium $SCRATCH/test-dir
$ echo > $SCRATCH/test-dir/test-file.txt
$ lfs getstripe $SCRATCH/test-dir
/pcratch/sd/a/adele/test-dir
stripe_count: 24 stripe_size: 1048576 pattern: raid0 stripe_offset: -1
/pscratch/sd/a/adele/test-dir/test-file.txt
lmm_stripe_count: 24
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 138
obdidx objid objid group
138 15345604 0xea27c4 0x19c0000403
... ... ... ...
If you only want to see the details of the directory itself and not its content, use lfs getstripe -d $directory
.
Custom Lustre Striping¶
To set striping for a file or directory use the command lfs setstripe
.
Each file and directory can have a separate striping pattern; a directory's striping setting can be overridden for a particular file by issuing the lfs setstripe
command for individual files within that directory (or by using the commands introduced above). However, as noted above, the striping setting for a file must be set before it is created. If the striping settings for an existing directory are changed, the files need to be copied elsewhere and then copied back to the directory in order to inherit the new settings. The lfs setstripe
syntax is:
$ lfs setstripe \
--size [stripe-size] \
--index [OST-start-index] \
--count [stripe-count] \
filename
Option | Description | Default |
---|---|---|
stripe-size | Number of bytes write on one OST before cycling to the next. Use multiples of 1MB. Default has been most successful. | 1MB |
stripe-count | Number of OSTs a file exists on | 1 on Perlmutter |
OST-start-index | Starting OST. Default highly recommended | -1 (System follows a round robin procedure to optimize creation of files by all users.) |
As noted above, don't use a stripe count greater than 72, as it can reduce I/O performance due to the high metadata requests, and can negatively impact all users of the Lustre file system.