Welcome Home: A Beginner's Guide to using NERSC¶

Welcome to NERSC! This guide is meant for all users, especially those with limited computing experience. In this guide we will walk through how to:

Log in to the system
Navigate directories and storage spaces
Access supported applications
Prepare and submit a job script

These items also represent a typical workflow for many NERSC users, especially those who use our systems for completing large calculations. Calculations may include simulations of physical phenomena, data pre- or post-processing, and other forms of data creation or analysis.

To explain some confusing or new concepts, we will use the analogy of a house. In this case, our house has an entryway after a locked door, many rooms including a kitchen and some bedrooms and closets, and attached to the house is a garage for parking cars or storage. Throughout this tutorial, we'll think about various high performance computing (HPC) components, systems, or actions through analogies of actions that take place in a house, like cooking in the kitchen (computing), storing items in the closets (storage options), and sharing common spaces like the living room with others (how to be a good NERSC citizen).

Before we enter the house, remember that you can always get help. Either look through the rest of our detailed documentation, or join the NERSC User Group (NUG) Slack workspace. If you are not able to find the answers you need in the NERSC documentation or by asking in the NUG Slack workspace, the next best way to ask for help is to submit a ticket. These tickets are routed to NERSC staff members, who will respond promptly (within four business hours) with advice or questions to begin investigating the issue. No question is too small - so feel free to submit a ticket if you need help!

Now, it's time to make yourself at home!

Unlocking the Front Door: Logging on to Perlmutter¶

The current flagship computing system operated and supported by NERSC is called Perlmutter. This system consists of thousands of nodes, which can be thought of as individual "computers." These nodes are connected to each other via a high-speed network, which enables our users to run calculations on multiple nodes at the same time. In fact, this is what makes systems like Perlmutter high performance computers and not just computers. HPC refers not only to the sophisticated hardware and software, but also to the connection between nodes that makes using multiple nodes for a single computation possible.

The node that you enter the system through is called a login node. As implied by the name, the login node is for logging in to the system. In other words, it is time to unlock the front door and gain access to the house.

Lock and Key: Log In Using a Terminal¶

In this next section, we will run through a step-by-step tutorial of the login process, the most important aspects of using an HPC system. Alternatively, feel free to watch the video.

You will need an internet connection, your NERSC username and password, and your multi-factor authentication (MFA) app.

If you are unfamiliar with using a terminal, you can learn more through the HPC Carpentries introduction to Shell.

If you are familiar with using a terminal:

Open a terminal and type: ssh <nerscusername>@perlmutter.nersc.gov and press the Enter key.

If it is your first time signing on to Perlmutter, you will be asked to add the RSA fingerprint. Verify the fingerprint by checking it on our fingerprint documentation page. If the displayed fingerprint matches the one on the NERSC webpage linked above, type yes and press the Enter key.
Type in your password and the one-time password (OTP) which is obtained from the multi-factor authentication app. It will be a six-digit number that changes every thirty seconds. For example, if your password is p@ssw0rd! and the OTP is 919 595, type in: p@ssw0rd!919595 and press the Enter key.

Congratulations, you are now logged on to Perlmutter!

Basic work such as editing or manipulating files and submitting computational jobs to the job scheduler can be done on a login node. Imagine taking off your shoes or hanging up your coat in the entryway of a house. But, there is not enough room to sit down, read a book, or cook a meal. Similarly, you should not use a login node for computation. Before launching any type of memory- or compute-intensive process(es) you will need to request and access a compute node.

Keypad Entry: Log In Using Jupyter¶

Another popular way to log in to Perlmutter is to use Jupyter, which is accessed from your favorite internet browser. Jupyter is often used for working on interactive notebooks, and enables logging in via username, password and OTP fields in your browser.

The documentation in this section will guide you through the login process with Jupyter. Alternatively, feel free to watch the video below.

To log in via the NERSC Jupyter instance:

Open an internet browser and visit jupyter.nersc.gov.
Click the "Sign in" button. You will see a webpage for Federated Identity login. You will need to select "NERSC" as the institution. When prompted, type your username and enter your password. Afterward you will be prompted to type your one-time password, which is obtained from the multi-factor authentication app. It will be a six-digit number that changes every thirty seconds. For example, if your password is p@ssw0rd! and the OTP is 919 595, type in: p@ssw0rd!919595 and press the Enter key.
After entering your OTP, you will see a control panel with several buttons.

The JupyterHub Control Panel, which consists of a blue "start" button in each of five columns. The columns read "Login Node," "Shared GPU Node," "Exclusive CPU Node," "Exclusive GPU Node," and "Configurable Job."

If this is your first time using Jupyter at NERSC, we recommend you click the "start" button in the Perlmutter row under the Login Node column. Be aware that the other options will charge to your project once you click a button to launch the server. See the detailed description in the "Resources" section of the control panel, below each button, to understand when each option is the best for your needs. Selecting the "Login Node" option logs you onto Perlmutter, similar to connecting via SSH in a terminal on your local machine, except that you will find yourself in the Jupyter graphical environment.

Once your browser refreshes, you will see a file browser on the left and many buttons in the main part of the screen.

By clicking on the "terminal" button, you will open a terminal that is running on Perlmutter. If you are unfamiliar with terminal, you can simply use the file browser, which has many point-and-click options that you may be familiar with from your own laptop. You can use this file browser just like you would a Finder window or File Explorer; you can even download small files directly from Perlmutter to your laptop from the web browser.

More information about Jupyter can be found on the Jupyter.org webpage .

This House Has So Many Closets: Navigating Storage Systems¶

When you log in to a login node, you will be in your "home" directory. This is a small area that can be thought of as the entryway in a house. It is a good place to keep a few important things that you access often, but should not be used to store large, bulky items. In fact, you can't! The amount of storage space in your home directory is limited to 40 GB. Instead, it is better to make links (called "symbolic links" or "symlinks") to other storage areas on the system. Some parts of the system have high data transfer speeds, while others are slower but can store large amounts of data for a long time. You may use a combination of storage systems based on different needs.

Think of a home improvement project: when you are actively working on the project, you might want to keep tools and materials readily available inside the house. But once you are done with a project, you may want to put all the tools and materials away in the garage for long-term storage. If you need to access those items again, it would take some more time to go get them, but they will stay out of the way and stay safe from the weather in the garage.

You may also want to know who can access certain closets and storage spaces. We provide a detailed description of default file and folder permissions, as well as information on changing these permissions.

The "Scratch" Storage Space (accessed by `cd $SCRATCH`)¶

The fastest storage space and the most easily accessible while computing is the Scratch file system. However, this is not a permanent storage option - files here are purged if they have been unused for longer than the purge threshold. The Scratch system is intended to support intensive I/O for jobs that are being actively computed on the Perlmutter system. We recommend that you run your jobs, especially data intensive ones, from the Perlmutter Scratch file system. Example usage: running an application that produces several gigabytes or terabytes of simulation results, to be analyzed by a post-processing script.

The "Home" Storage Space (accessed by `cd $HOME`)¶

As mentioned, the home file space is the foyer to your house. It is permanent but limited in space. It is recommended to use this space just for storing things you need for quick reference. Example usage: storing template batch submission scripts, or a few Slurm output files.

The "Common" Storage Space (accessed by `cd /global/common/software`)¶

This is a kitchen cabinet, close to your computing needs! It is mounted read-only, meaning that you can grab stuff but you can't write stuff (put stuff away). This is so you can use things quickly while computing. Install your software into this space so it is available to you when computing. Example usage: storing a compiled application or conda environment that needs to be accessed while running a parallelized job on compute nodes.

The "Community File System" Storage Space (accessed by `cd $CFS/<your_project_name>`)¶

This space is a large and medium-performance file system. Community directories are intended for sharing data within a group of researchers and for storing data that will be accessed in the medium term (i.e. 1 - 2 years). These closets are shared among all household (project) members. Example Usage: Storing data sets to be shared with project collaborators, storing data of varied type and size in organized folders after analysis is completed.

The "Archive" (HPSS) Storage Space (accessed by `hsi`)¶

A high-capacity tape archive, intended for long-term storage of inactive and important data that is accessible from all systems at NERSC. Transferring files to HPSS is best done by grouping small files together. You can learn more about this and how to use HPSS in our HPSS documentation. Example usage: archiving gigabytes or terabytes of data that was produced, analyzed, and used to write a paper that is now published and thus may not need to be accessed regularly.

Time Travel with Snapshots!¶

NERSC stores "snapshots" of the Home, Common, and Community file systems for seven days, which allows you to revert to a previous
version of a file if needed. You can access these via: $HOME/.snapshots, /global/cfs/cdirs/<project>/.snapshots or /global/common/software/.snapshots or view which snapshots are available by running cd .snapshots in the desired directory.

Teleport between Closets: Symlinks and Keyboard Shortcuts¶

Terminal Shortcuts¶

When accessing directories (or folders), you can press the tab key on your keyboard to autocomplete the word or list all of the matching options. Use this often to avoid spelling mistakes!

If you are lost, you can type pwd, which will print out the full path to where you are currently within the directory structure. You can always cd <insert file path> to get from any one directory to another if the filepath is known.

If you are going to be working in a directory often, since you always land in your $HOME directory when you log in to perlmutter, it might be useful to make a symbolic link, or symlink. This is a quick way to allow you to teleport into the other directory without having to type the entire filepath.

The syntax for making a symlink is:

ln -s source_location alias

For example, suppose the files I am working on are in /pscratch/sd/u/userid/physics/project_3/experiment_3/. I could make the following symlink:

ln -s /pscratch/sd/u/userid/physics/project_3/experiment_3/ pro3exp3

So now you can jump to this directory quickly by double-clicking the symlink in the file browser in Jupyter, or cd pro3exp3 in a terminal.

Cooking with Fire: Compute Nodes¶

While most houses have one kitchen, our house has thousands. And all of the kitchens (compute nodes) can be used simultaneously. Specifically, the following are available:

1792 CPU and GPU-accelerated compute nodes
3072 CPU-only compute nodes

At times you may need to use only 1-10 nodes at a time, but any user can request the use of large percentages of the machine.

You may be hungry to begin using the fancy equipment that NERSC offers in its many kitchens. Each of the kitchens (compute nodes) are equipped with powerful cooking appliances for you to use. Let's go through these appliances!

The Stove: a CPU on a node is like a stove in the kitchen. This is where most of the various cooking tasks will take place. A burner on the stove is like a CPU core, the processor that processes data. A stove in one of Perlmutter's kitchens (compute nodes) has 64 burners! Each burner can be used in a variety of ways; you can boil water, you can saute veggies, and you can simmer a stew. If you need to do all three tasks at the same time, you can divide the stove into burners that are stewing, burners that are sauteing, and burners that are boiling.
The Microwave: a GPU on a node is like a microwave in a kitchen. A microwave is really good at one function: making food hot by energizing all its water molecules at once. It can be used to do this very quickly, but it only works well for food with a minimal water content. So if you have several water-bearing foods for your recipe, consider using the microwave - or in our case, microwaves! Compared to a CPU, a GPU can complete simple and repetitive tasks much faster because it can break the task down into smaller components and finish them in parallel. While microwaves may be commonplace in many kitchens, the GPU "microwaves" in Perlmutter are anything but commonplace. In fact, they are designed specifically for scientific computing. And while you may think a consumer-grade microwave heats the food in the same way as a professional-grade microwave, you clearly haven't tried ours yet!
The Countertop: While Perlmutter has thousands of kitchens, there is one large countertop that connects all kitchens and their appliances. The countertop in your kitchen is a great place to assemble the ingredients for your recipe and store the various supplies you need while cooking. This is similar to the Scratch storage space. In the Section “Navigating Storage Systems” you learned that $SCRATCH is the fastest storage space and is set up for use during active computing, which makes it comparable to the countertop in the kitchen. Besides being very fast for moving data, it is also massive! This means you can have all of your supplies (data) nearby for quick access while cooking (computing).

If you find that you need more space for active computation, you can submit a request to request more space for a short time.

If you have applications that could benefit from using large portions of the system, consulting our documentation on best practices for running jobs is a good way to ensure you will be able to run successfully!

A summary of the various types of nodes is shown in the following figure.

Household Calendar: Preparing and Submitting Slurm Job Scripts¶

Busy households with lots of activities sometimes need a calendar to coordinate their schedules. Because NERSC computing resources are shared, we also require scheduling to ensure everyone gets access to our kitchen resources.

Since Perlmutter is a shared system, there are hundreds of people logging on and running computations every day. This means that when you are ready to begin a computation, you may have to wait until resources become available.

There are two main methods for acquiring resources on Perlmutter and running a computation.

Requesting a compute node to use interactively.
Submitting your computation (called a "job") to the job scheduler.

These two methods are closely related, but for the purpose of this introductory guide, we will discuss them separately.

What is Slurm?¶

Slurm is a computational job scheduler, which helps us accomodate thousands of users like yourself, who need to use different amounts of the system for their computations.

Slurm takes care of three key responsibilities:

Allocation of resources.
Executing and monitoring jobs.
Managing a queue of submitted jobs.

For example, when a user wants to use 100 nodes for 4 hours, they submit the job to Slurm, which then schedules it to run on the system as soon as possible. Why doesn't it run immediately? At any given time, hundreds or thousands of different jobs, each requiring anywhere from one to thousands of nodes for several hours, are already running on the system. Because Perlmutter consists of a finite number of nodes, the scheduler keeps track of what is available at what time, and squeezes jobs onto the system in an efficient way.

In order for Slurm to allocate resources efficiently, there are many pieces of information that you must provide. The most common pieces of information users provide are:

Slurm flag (long form)	Slurm flag (short form)	Description
--nodes	-N	Number of nodes requested
--ntasks	-n	MPI ranks/total tasks
--account	-A	Account (NERSC project)
--qos	-q	Quality of service (QOS) to submit to
--constraint	-C	CPU or GPU partition
--time	-t	Walltime requested in HH:MM:SS

What is the Queue?¶

Just like the queue at the grocery store check out, a queue is the line in which your job waits until it is its turn to run. This is how large HPC facilities manage the abundance of jobs that need to run on the system.

Is There a Way to Cut in Line?¶

The way NERSC manages the queue of jobs is more complicated than a simple grocery store check-out line. A more apt analogy, which describes how jobs queue and can even cut in line sometimes, would be the security line at the airport. Most passengers have to wait in a long security line, but passengers who have been pre-screened enter the security line closer to the security screening point than regular passengers. Airport staff and pilots are given even more priority, entering the line at yet another location closest to the screening point. In this way, there are multiple ways that a qualified person could cut in line, but everyone must join the line and pass through screening to proceed.

This is precisely how NERSC manages jobs: offering different “qualities of service” (QOS) reflecting different priorities and constraints. If a job needs to be completed in time for a conference, you can put it in front of other jobs by submitting it to the premium QOS, ensuring that the job gets scheduled to run before similar jobs submitted to the regular QOS at the same time. This convenience comes at the cost of being charged more hours to run, but if you have an impending deadline the trade off could be worth it!

If you just need to test a code snippet, you might want to use the debug QOS. Jobs submitted to the debug QOS tend to run very soon after being submitted, but have a short time limit.

Conversely, once you are ready to run your full computation, you should submit it to the regular QOS. While your job may wait longer in line (take longer to start), the time limit is 48 hours! This is so very large calculations have sufficient time to complete.

To see the available QOSes, visit the QOSes and Charges page of the documentation.

Requesting a Compute Node to Use Interactively¶

If you need to run a computation for the purpose of development or debugging, you can request a single compute node or multiple compute nodes to be used interactively. Instead of submitting your job to the scheduler to run at a later time, you are asking Slurm to find nodes available now so you can use them in real time. Using a kitchen analogy: you are asking Slurm to find a stove or microwave that is available, so you can stand inside the kitchen and cook your meal. You may do this if you are experimenting with ingredients or recipes, so you want to be able to interact with the cooking for rapid iteration.

To begin an interactive job:

Sign into Perlmutter using a terminal on your local machine.
Type and enter salloc --nodes <*> --constraint <*> --qos <*> --time <*> --account <*> where you fill in the appropriate options given the table above. Look at the table above for a description of some important flags to specify.
When your allocation begins, you can run executables or scripts. Often you will use the srun executable to launch your job.

For example: srun hello would run the executable named "hello". You can also run commands without srun, for example to start a matlab session, you type matlab.

Remember to relinquish your interactive job by typing exit if you finish your work before the time limit that you requested via the --time flag in the salloc command.

Submitting your Job to the Scheduler:¶

If you have an executable or script that you want to execute, but you do not need to interactively run it, you can let Slurm know how to run it. Using a kitchen analogy: you hand off the ingredients and recipe to Slurm, so Slurm can cook up your meal and let you know when it is ready.

Feel free to watch the following video to get a step-by-step example of how to create and schedule your first job on Perlmutter. Alternatively, read through the description below.

You can prepare your instruction script by opening a blank text file and providing the options Slurm needs at the top of your script. The script begins with a shell invocation #!/bin/bash which lets Slurm know what type of executable you are submitting. The next part of the script containes a series of directives that Slurm uses to understand what kind of computation to schedule. These directives are equivalent to the values specified on the command line when running an interactive job. For example, you still need to specify how many nodes you need, for how long, and which QOS.

The following example is a fairly generic and basic job script. Based on the values specified to the directives, the computation will run in the debug QOS for up to five minutes on two CPU-only nodes.

#!/bin/bash
#SBATCH --account=<account>
#SBATCH --constraint=cpu
#SBATCH --qos=debug
#SBATCH --time=5
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=128

module load <SupportedApplication>
export MY_ENVIRONMENT_VARIABLE=<number>
srun <prog.exe>

Another example of a basic script is shown below, in which we explicitly load all necessary modules, set environment variables, and activate conda environments. Loading modules and setting up the environment will be covered in the next section of this guide.

#!/bin/bash
#SBATCH --account=<account>
#SBATCH --constraint=gpu
#SBATCH --nodes=1
#SBATCH --time=60
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4

module load <SupportedApplication> <otherModule> <anotherModule>
source activate <condaEnvironmen>
export MY_ENVIRONMENT_VARIABLE=<number> 

srun python <nameOfScript.py>

When you are satisfied with your job script, you can submit it to Slurm for scheduling by typing sbatch <nameofscript> in the terminal.

A few things to notice in order to understand what is needed when submitting a job script:

You must provide information about the architecture you want to use for the computation. Using our kitchen analogy, do you want to use the stove to heat your food, or the microwave? The constraint directive translates to the architecture, and there is no default value. This means that if you do not specify this value, you will encounter an error when submitting the script to Slurm.
Anything that you need in order to run your application or script must be included in the recipe, or loaded by the Slurm script. Slurm will replicate your environment when running your script, so if you need a module to run your computation, you must either load it into your session before submitting the submission script, or include the commands to load the modules within the script itself.

You can learn more about scheduling jobs via Slurm in our detailed documentation.

NERSC also provides an interactive job script generator that can help you write your job script!

How does Jupyter Host the Kitchen?¶

There are a few ways to access compute resources through Jupyter. As shown earlier, there is a screen showing various session options, such as "login node" or "shared GPU node." You can simply click the button in the hub for the resource you require, and you will get access to that resource. The terminal within Jupyter is equivalent to an interactive session on a compute or login node.

Recipe Book: Accessing Supported Applications¶

While you are always welcome to bring new recipes into the house, our cupboards are home to many recipe books as well! In this case, a recipe refers to an application that you would like to run on the compute nodes, such as a simulation or large matrix calculation. Because many of our users are cooking similar types of projects, we provide and support several applications.

A full list of all of our supported applications can be found in our applications documentation.

We use the Lmod software to manage user environments and provide supported software. There are default modules that are loaded into your environment upon logging in, which are optimized for using Perlmutter. You can view these default modules by logging in to Perlmutter and typing module list in the terminal command line.

The default environment when you log in to Perlmutter is set up to facilitate getting the best performance on
our system. In keeping with our analogy, the default modules that we provide you with are like special pots, pans, and cooking utensils that are optimized for our kitchen. If you need to find a specific module, you can use the module spider <name> command to check if the module is available on our system. Occasionally we will update the default modules that are loaded into the user environment. This is sometimes done during system maintenance periods, and it can improve the system's performance or security. If you prefer an older version of a module that is no longer loaded by default, you can specifically module unload <module> the module you no longer need and module load <module>/<version> the version you need. You could also use module swap <module> <module>/<version>.

If you need to prepare a new recipe, you may want to know what kinds of ingredients we have available in our kitchen. In order to best make use of our kitchen, you will need to write up your recipe using the special ingredients that we provide: compiler wrappers. For more information on how to use these wrappers, and how to develop your own programs for use on our systems, visit our documentation.

We recommend keeping track of the details of your user environment when you are preparing to run a computation or compile a new executable. Just like a chef will keep track of the ingredients and techniques they use while whipping up a delicious new recipe, while experimenting and working on your project you may lose track of which libraries and modules were necessary to make your application work best. This information can also help NERSC consultants in case you reach out to us to ask a question or request help. Knowing which modules and versions you usually use or need can help us help you faster!

Once you are ready to start cooking, keeping your recipes, ingredients, pots and pans, and other necessary items close by will make your meal cook faster. We recommend that you move your software stack and executables to a storage space called "global common software". This space, accessed at /global/common/software/, is mounted read-only on compute nodes. This means that any software located there loads very quickly. Once you are ready to cook, we recommend moving your software stack into the space for your project within global common software to improve the performance of your computation!

An important part of peacefully coexisting in a house is sharing. The kitchen in a house is shared among all members of the house - just like Perlmutter! Because Perlmutter is a shared resource, there are several policies in place to ensure all of our users get time to cook up a meal.

First, each project is given an "allocation", which can be thought of as a bank account. The amount of "money" that is in this account is similar to a grocery budget and is decided by our primary funding source: the U. S. Department of Energy, Office of Science. However, the currency of this account is "CPU node hours" and "GPU node hours." Each project is given compute hours that members of the project can spend in order to run computations at NERSC. After you submit a job to Slurm (interactive or batch submission) and the job runs, Slurm deducts some number of hours from your allocation. The amount of hours charged depends on several factors, such as which hardware was used, which QOS you chose, and how many nodes were used; a full description of charging hours is provided here.

Second, if you submit your job to the Slurm scheduler, only two jobs per user/QOS/project are able to be scheduled and run at a time. If you submit more than two jobs, those jobs remain "pending" in the queue until they are eligible for scheduling. This is to ensure that everyone's jobs are able to progress through the queue in a timely fashion.

Make Yourself at Home!¶

There are still many things to learn about our computational systems, but you have now been acquainted with the basics! Take your time with our documentation, as well as our NERSC YouTube Channel, which hosts videos of our regular New User Training events. These training events are aimed specifically at novice HPC users and are offered three times a year in February, June, and September. Exact dates are announced via the weekly email and on the main NERSC Website a month in advance. These trainings are available to all NERSC users, though they are particularly suited for users who are novice HPC users with limited computing experience. Additional training events available to all NERSC users are hosted by NERSC throughout the year. They cover a broad range of topics relevant to scientific computing.

We have several resources for you to get help: the NERSC Users Group, NERSC Users Slack, and our detailed documentation. In addition, we encourage all of our users to submit tickets if your questions are still unanswered! Ask for help or give us feedback by submitting a help ticket online.