Denovo is the name of the new configuration of the system available exclusively to JGI users on the NERSC Mendel cluster. The previous configuration is referred to as "Old Genepool".
Logging into Denovo¶
Logging into Denovo is as simple as typing
ssh denovo if logged into another NERSC system. JGI users are all enabled there by default, and because your $HOME is on a global filesystem, your files are in the same place as they were on Old Genepool and on Cori. However, Denovo does not have all the same software modules installed that Old Genepool did, since we expect you to use tools like Shifter or Anaconda to manage software in a more robust manner, whenever possible. There are tutorials on Shifter and Anaconda on the Training & Tutorials page, as well as tutorials on Slurm.
Using Slurm on Denovo¶
Like all NERSC systems, Denovo exclusively uses the open-source Slurm scheduler for its job scheduling. You can view NERSC's pages on Slurm, and the complete documentation for Slurm here.
|Genepool UGE command||Slurm equivalent||Description|
|qsub yourscript.sh||sbatch yourscript.sh||Submit the shell script "yourscript.sh" as a job|
|qlogin||salloc||Start an interactive session. On Slurm, salloc creates a node allocation, and automatically logs you into the first of the nodes you were allocated.|
|qs||squeue||View jobs running on the cluster. Uses a cached dataset, and can be utilized with scripts. "squeue --help" will provide a full list of options.|
|qhold $jobnumber||scontrol hold $jobnumber||Hold a specific job|
|qrls $jobnumber||scontrol release $jobnumber||Release a specific job|
|qhost||sinfo||Get information on the configuration of the cluster. "sinfo --help" provides a full list of options|
|scontrol show $ENTITY||also provides detailed info about various cluster configuration. For example "scontrol show partitions" or "scontrol show nodes". See "scontrol show --help" for a full list of options.|
Guide to scheduler options.¶
These options can be placed in the submission command, or the shell script header. You can find more useful info HERE, and a cheat sheet.
|UGE option||Slurm equivalent||Description|
|-q $queue||-q $qos||On NERSC systems you shoudl request a QOS, and the scheduler direct your jobs to the appropriate partition based on QOS and resource requests. A QOS request is NOT necessary on Denovo, but is required on Cori (On Cori, JGI users should ask for |
|-N $count||Number of nodes requested. (In UGE, you would request a total number of cpus, and UGE would allocate an appropriate number of nodes to fill the request, so this option was not available.)|
|-pe $options||-n $count||Number of MPI tasks requested per node. Be careful if requesting values for both -N and -n (they are multiplicative!!!).|
|-c $count||Number of cpus per task. The number of cpus per task multiplied by the number of tasks per node should not exceed the number of cpus per node!|
|-l h_rt=$seconds||-t h:min:sec||Hard run time limit. Note that in Slurm, "-t 30" is requesting 30 seconds of run time.|
|-l mem_free=$value||--mem=$value||Minimum amount of memory, with units. For example: --mem=120G|
|-l ram.c=$value||--mem-per-cpu=$value||Minimum amount of memory per cpu with units. For example: --mem-per-cpu=5G|
|-o $filename||-o $filename||Standard output filename|
|-e $filename||-e $filename||Standard error filename|
|-m abe||--mail-type=$events||Send email message on events. In Slurm, $events can be BEGIN, END, FAIL, or ALL|
|-M $emailaddress||--mail-user=$emailaddress||Email event messages to $emailaddress|
|-P $project||--A=$project||Project under which to charge the job.|
Slurm is much more flexible than UGE, so we don't need to impose strict limits with lots of queues that leads to inefficient use of resources. To allow Slurm to schedule your jobs efficiently, and to make your batch jobs portable between Denovo and Cori, you should not target particular queues or partitions if you can avoid it. Instead, you should specify the number of CPUs you need, the amount of memory, and the runtime that you need (walltime, not multiplied by number of CPUs).
The more accurately you specify those parameters, the faster your job will schedule, because the scheduler can find resources to match your requirements sooner than it otherwise might. Requesting overly long runtimes, in particular, will cause your jobs to be held in the period before scheduled maintenance, since the scheduler will assume there's not enough time to run your job before the machine goes down for maintenance. Run a few pilot jobs first if you're not sure what you need. Workflow managers can handle retries with adjusted limits in the case of outliers.
Since there is currently no per-user job-submission limit, it's entirely possible that one person can commandeer the entire cluster for an indefinite period. We ask that you be responsible with your job submissions. If you use task arrays, you can limit the degree of concurrency so that you can submit large numbers of jobs without triggering complaints. E.g, this example shows how to submit 1000 jobs, but only allow 10 to run at any one time.
denovo> sbatch --array=1-1000%10 ...
You can request exclusive use of a node with the
--exclusive flag to the sbatch command. Slurm enforces job-boundaries more rigidly than UGE, so jobs that attempt to use more memory or CPUs than they have been allocated will be killed by the scheduler. Using
--exclusive overrides the number of CPUs requested and the amount of memory, of course.
Jobs on Denovo have a 7-day walltime limit. There is no limit to the number of jobs that may be submitted at a given time.
You should take care to specify the amount of memory your job needs. By default, if you don't request memory, your job will go to one of the 128 GB nodes. To use 256 GB or more, you must specify the memory you need. The actual amount of memory available for jobs is slightly less than the total installed, since the operating system takes some of it. So if you actually request 128 GB of memory, with
--mem=128G, your job will run on a 256 GB node. The sinfo command in the previous section shows you how to list the available memory, in MB, and the current max limits are shown in the table here.
|Nominal system memory||Maximum available memory (MB)||Slurm options (default unit is MB, use K/M/G/T to specify)|
|128 GB||121042|| |
|256 GB||249239|| |
|512 GB||506097|| |
|1 TB||1012160|| |
Monitoring job progress¶
The squeue command is very flexible, and can give you a lot of information on currently running or queued jobs. For completed jobs, sacct offers much the same functionality. There are simple options to select the set of jobs by username, by partition, by job-ID etc, and options to specify the output information in ways that are easy for scripts to parse. For example, to see all my currently running or queued jobs, I can do this:
denovo> squeue --user wildish JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 33077 productio sleep.sh wildish R 10:50 1 mc1211
If you want to monitor your jobs from a script, you can produce output that is much easier for a script to parse:
denovo> squeue --noheader --user wildish --Format=jobid,state 33077 RUNNING
Or even simpler, if you know the ID of the job you want to monitor:
denovo> squeue --noheader --job 33077 --Format=state RUNNING
For completed jobs, the sacct command is almost completely equivalent:
denovo> sacct --noheader --user wildish --format=jobid,state 33065 COMPLETED 33065.0 COMPLETED 33066 COMPLETED 33066.0 COMPLETED 33067 FAILED 33067.0 FAILED 33068 FAILED 33068.0 FAILED 33069 COMPLETED 33069.0 COMPLETED
Check the man pages for more details of each command's options (e.g.
Shifter on Denovo behaves much the same as on Cori or Edison. You do not need to load any modules for Shifter, it's in the default $PATH. Shifter implements a limited subset of the Docker functionality which can be safely supported at NERSC. Shifter is able to run images, and volume mount to access filesystems. Specifying ports or environment variables is not supported, nor are more advanced features like linked containers. In general, you're expected to simply run a script or binary, with optional arguments, with input and output mapped to the global filesystems.
You can find more information on Shifter on the Using Shifter and Docker page, here are a few simple example commands:
Running an image interactively¶
Denovo currently supports running images on the login nodes. Cori and Edison do not. You should run on the login nodes only to debug or test images, not to run something that takes longer than a few minutes. Once you know that your container runs, please submit a batch job to run it, rather than use the login nodes.
denovo> shifter --image=registry.services.nersc.gov/jgi/hmmer:latest hmmscan -h # hmmscan :: search sequence(s) against a profile database # HMMER 3.1b2 (February 2015); https://hmmer.org/ # Copyright (C) 2015 Howard Hughes Medical Institute. # Freely distributed under the GNU General Public License (GPLv3). # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Usage: hmmscan [-options] <hmmdb> <seqfile> [...]
You have to use the full name of the image,
registry.services.nersc.gov/jgi/hmmer:latest, not just
hmmer, like you can with Docker.
Running an image interactively on a batch node:¶
If you need to run for more than a few minutes for debugging, you can get an interactive node and run there. Use the salloc command, then run shifter. Salloc takes many of the same options that sbatch does, in that you can ask for more than one node, specify the time you want the allocation for etc. Salloc immediately logs you into the first node of your allocation.
denovo> salloc salloc: Granted job allocation 33066 bash-4.1$ shifter --image=registry.services.nersc.gov/jgi/hmmer:latest hmmscan -h # hmmscan :: search sequence(s) against a profile database # HMMER 3.1b2 (February 2015); https://hmmer.org/ # Copyright (C) 2015 Howard Hughes Medical Institute. # Freely distributed under the GNU General Public License (GPLv3). # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Usage: hmmscan [-options] <hmmdb> <seqfile> [...] bash-4.1$ exit salloc: Relinquishing job allocation 33066 denovo>
Running a shifter image in a batch script:¶
Shifter is integrated with Slurm on Cori, Edison and Denovo, which means that you can tell Slurm to pull the shifter image you need, and keep the body of your script cleaner. For example, you can submit the following script to Slurm:
denovo> cat shifter.sh #!/bin/bash #SBATCH -t 00:01:00 #SBATCH --image=alpine:3.5 shifter cat /etc/os-release
and it will produce this output:
denovo> cat slurm-33075.out NAME="Alpine Linux" ID=alpine VERSION_ID=3.5.2 PRETTY_NAME="Alpine Linux v3.5" HOME_URL="https://alpinelinux.org" BUG_REPORT_URL="https://bugs.alpinelinux.org"
So, in this example we specified the image in the #SBATCH directive, and just used
shifter in the script body to run a command from that container. That's a bit cleaner than having
shifter --image=... sprinkled throughout the batch script, but we can go one step further. By specifying the image on the
sbatch submit command, and not in the script, we can make a script that works with several different versions of the container without change. This example simply tells you what OS the container thinks its running, and we can tell the script to run different containers at submit-time:
denovo> cat shifter.sh #!/bin/bash #SBATCH -t 00:01:00 shifter cat /etc/os-release denovo> sbatch --image=alpine:3.5 shifter.sh Submitted batch job 33076 [...] denovo> cat slurm-33076.out NAME="Alpine Linux" ID=alpine VERSION_ID=3.5.2 PRETTY_NAME="Alpine Linux v3.5" HOME_URL="https://alpinelinux.org" BUG_REPORT_URL="https://bugs.alpinelinux.org"
Interactive nodes, 'gpint' replacements¶
We have a few nodes available as replacements for the group-specific gpint nodes, so-called 'dints'. These are configured exactly like the other login nodes, but are not part of the round-robin login, they can only be logged into by connecting first to a standard Denovo login node and then on to the dint. We can allocate these to groups for testing purposes, and once your group is happy that everything works, we will co-ordinate with you to migrate your gpints to dints. Let us know if you need one.
Interactive nodes for batch debugging¶
There are 5 nodes reserved for interactive work, specifically for debugging batch scripts, not for running long-term services or for exploratory analysis.
You can access them by using the
denovo> salloc --qos=interactive --time 01:00:00 salloc: Granted job allocation 283255 salloc: Waiting for resource configuration salloc: Nodes mc1535 are ready for job
Please use these nodes reasonably, don't stay logged in if you're not making proper use of them.