AI Coding Tools at NERSC¶
This page collects practical guidance for using large language models (LLMs) and coding agents as part of a NERSC workflow.
The goal is not to let a model operate unsupervised. The goal is to use these tools to reduce repetitive work while keeping humans in control of correctness, performance, and security.
Scope and Responsibility
Users may use coding agents as user-directed development assistants for authorized NERSC project work. Users remain responsible for all prompts, commands, code, data access and modification, resource usage, software installation, and external network activity performed by or through the agent. Use of coding agents must conform to NERSC's Acceptable Use Policy: Agents must not be given prohibited data, must not bypass access controls, and must not perform security testing, prohibited software use, or activity outside the approved project scope.
This document provides general guidance, not guarantees about correctness, security, performance, or resource usage. Users are responsible for validating generated code, commands, configurations, and scientific results before using them in production or allocating large computing resources.
LLMs can produce incorrect or misleading outputs, including inefficient job configurations, invalid scientific conclusions, or unsafe commands. Overreliance on generated output without sufficient technical understanding may also reduce visibility into system behavior and degrade overall code or workflow quality.
What Are Coding Agents¶
Coding agents are programs that combine LLMs that can reason with tools that let the model act. They allow users to write code from natural language prompts, a process commonly called "vibe coding".
At NERSC, a more useful interpretation is coding in collaboration with an agent. You describe the task and its constraints, the model proposes code, commands, tests, or refactors, and then you run, inspect, and verify the result. The model can accelerate the work, but you still decide whether the output is correct and safe to keep. This is especially helpful for multi-step technical tasks where the model can read files, inspect logs, run commands, and iterate.
When It Helps¶
Coding agents are often useful when the task is concrete, repetitive, and easy to validate. At NERSC that often means:
- drafting or improving Slurm job scripts
- explaining failures from
sacct,squeue, or log output - translating shell workflows into Python
- cleaning up notebooks and analysis scripts
- writing README files and usage examples
- refactoring small utilities into more maintainable code
These tasks benefit from rapid iteration and clear feedback loops.
When To Be Careful¶
Coding agents are much less reliable when performance is critical and the code has not been benchmarked, when the model is guessing about MPI, GPU, or filesystem behavior, or when the request depends on exact NERSC policy or system behavior that was never provided in context. You should also use extra caution for anything involving credentials, tokens, sensitive research data, allocations, or production workflows. In those cases, the cost of a plausible but wrong answer is much higher.
Why CLI Agents Are Often a Good Fit¶
For NERSC work, command-line agents are often a better fit than pure chat interfaces because they can inspect repositories directly, read configuration files and logs, run shell commands, suggest or apply small patches, and work naturally with version control. This fits common NERSC workflows, which already depend heavily on the shell, job scripts, and text-based tooling.
Subagents and Specialized Workers¶
As these tools become more capable, it is increasingly common for one agent to delegate part of a task to a more specialized helper. You can think of these as subagents or worker agents. Instead of forcing one long conversation to do everything, the main agent can hand off a bounded subtask such as exploring a codebase, reviewing a patch, reading documentation, or analyzing a long error trace.
This pattern helps in two ways. First, it keeps the main session focused on the task that actually blocks progress. Second, it reduces context sprawl. A read-only explorer does not need the same instructions as a code-editing worker, and a reviewer does not need the same tool access as a debugger. When used well, subagents make multi-step work easier to structure and easier to audit.
In practice, good subagent tasks are narrow, concrete, and easy to evaluate. A helper agent should have a clear purpose, a bounded write scope if it is allowed to edit files, and only the tool access it actually needs. This is one reason that strong project instructions and clear file ownership matter even more in multi-agent workflows than in simple chat.
A Practical Workflow¶
The most reliable pattern is a short feedback loop:
- Start with a concrete task.
- Provide the exact files, logs, or command output involved.
- Ask for the smallest useful next step.
- Run and inspect the result.
- Iterate based on evidence.
This usually works better than asking for a large end-to-end solution in a single prompt.
Plan Mode and Task Decomposition¶
Tip
Explicit planning is especially important for difficult or multi-step tasks on shared HPC systems, where incorrect commands may waste allocation time, overload filesystems, or launch unintended jobs.
Many coding agents support some form of spec-driven development, where the agent gathers context, asks clarifying questions, and proposes an implementation strategy before making changes or running commands. Plan modes in coding agents are one example of this approach.
Related tooling can also include plugins, extensions, or external frameworks that guide the agent using written specifications, task breakdowns, or structured implementation plans.
Reviewing the proposed plan before execution gives you a chance to catch incorrect assumptions, unsafe actions, or missing constraints early. In practice, this often leads to better results, fewer unnecessary iterations, and safer operation on shared systems.
Prompting That Works¶
A good prompt usually names the target system, such as Perlmutter, explains the exact goal, points to the relevant files, includes the current script or configuration, and provides representative stderr, logs, or job output. It should also define what success looks like. In practice, it helps to ask for the smallest viable patch, explicit assumptions, ranked debugging hypotheses, and clear verification steps.
Weak Prompt¶
Help me run my code on NERSC.
Strong Prompt¶
I am running a Python MPI workflow on Perlmutter. Here is my Slurm script, module list, and stderr. The job hangs after initialization. Propose the smallest script changes to improve launch reliability and explain why.
Prompt Template for NERSC Tasks¶
Use a structure like this:
System: Perlmutter
Goal: Run a PyTorch training script on 2 GPU nodes
Files: train.py, env.sh, job.slurm
Problem: Job exits immediately after launch
Evidence: stderr, slurm output, module list, conda env
Constraint: Keep the current Python environment
Ask: Propose the smallest patch and list verification steps
This makes it much easier for the model to reason from real context instead of guessing.
Example: Drafting a Perlmutter GPU Job¶
For example, a user might ask an agent to draft a Perlmutter GPU batch script:
#!/bin/bash
#SBATCH --constraint=gpu
#SBATCH --gpus=4
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --account=<your_account>
#SBATCH --qos=regular
#SBATCH --job-name=prep
#SBATCH --output=slurm-%j.out
module load python
srun -n 4 python preprocess.py --input data.h5 --out $SCRATCH/results
An agent can help write this, but you still need to verify that the account and QOS are valid, that the requested resources match the workload, that the module setup is correct, that the launch pattern matches the application, and that the output path is appropriate for the workflow.
For more background, see Running Jobs on Perlmutter and Example Jobs.
Example: Debugging With sacct¶
The model is often more useful when you give it evidence instead of a vague description.
For example:
perlmutter$ sacct -j 12345678 --format=JobID,JobName,Partition,Account,AllocTRES,State,ExitCode,Elapsed
A strong follow-up prompt would be:
Here is the
sacctoutput for my failed Perlmutter job and my batch script. Explain what the state and exit code suggest, then propose the next two debugging steps.
That is much more actionable than "my job failed."
Common AI Mistakes on HPC Systems¶
Be especially skeptical when a model invents sbatch or srun flags, confuses login-node work with compute-node work, assumes pip install is always the right choice on a shared system, guesses the wrong module names or versions, mixes up $HOME, $SCRATCH, and project storage, suggests an MPI launch pattern that does not match your code, or assumes GPU access without the right Slurm constraints. These are common failure modes, not rare edge cases.
Slurm and Module Advice Needs Verification¶
When an agent suggests changes to a NERSC job, check whether the queue or QOS is valid, whether the node counts and GPU counts are consistent, whether the account is correct, whether the module names are real on this system, whether the launch command matches the application model, and whether the filesystem path is appropriate for the workload. If the model cannot answer those questions from real local context, it is guessing.
Good Engineering Makes Agents Better¶
Coding agents work better when your workflow is already disciplined. Breaking larger tasks into smaller steps, using Git commits between working states, running linters, compilers, and tests, and giving the model checks it can run itself all make the agent more reliable. It also helps to keep each session focused on one problem at a time. If the model can inspect the result, it can often help repair its own mistakes.
The same idea applies to multi-agent work. If you ask several helpers to work in parallel, you still need clear task boundaries, a sensible integration plan, and a human review step before combining their output. Parallelism can improve throughput, but it does not remove the need for judgment.
Context Engineering¶
The less irrelevant context you include, the better the model usually performs. Useful habits include writing AGENTS.md or similar project instructions, documenting project structure and build steps, mentioning cluster-specific assumptions explicitly, and starting fresh sessions when switching tasks.
For example, you may point the model to Python-specific environment guidance such as Using Python at NERSC and Python FAQ and Troubleshooting.
Security and Privacy Considerations¶
When using external AI services, do not paste credentials, tokens, or private keys into prompts. Be careful with unpublished or sensitive research data, avoid unrestricted command execution in unsafe environments, and be cautious with third-party plugins, tools, and MCP servers. If you would not paste something into an external service manually, do not hand it to an agent by default.
Sandboxing and Approval Boundaries¶
For more autonomous workflows, sandboxing is one of the most important safety mechanisms. In this context, sandboxing means that the agent runs commands inside enforced technical boundaries rather than relying only on user trust. Those boundaries often control which files can be read or modified, whether the agent can use the network, and when it must stop and ask for approval.
This matters because autonomy without boundaries tends to create two different problems. One is approval fatigue, where the user is asked to approve so many routine actions that the safety signal becomes meaningless. The other is that an unrestricted agent can cause accidental damage much faster than a human can intervene. A well-designed sandbox gives the agent room to do routine work inside a known-safe area and forces it to pause when it needs to go beyond that area.
In practice, many tools converge on similar operating modes. A read-only mode is useful for inspection and review. A workspace-write mode is often the best default for local development because it allows editing inside a project boundary and running routine local commands. Full-access modes remove all of those technical constraints. If you truly want the agent to run autonomously with broad authority, full-access modes should only be used in isolated environments like containers that are configured to limit potential damage caused by the agent.
It is also important to separate sandboxing from approvals. These are related but not identical controls. The sandbox defines what is technically possible. The approval policy decides when the agent must stop and ask before crossing a boundary. If a tool can spawn subprocesses, run tests, call package managers, or launch build tools, those subprocesses should inherit the same sandbox rules. Otherwise the boundary is much weaker than it appears.
If you want stronger isolation than a local tool provides by default, there are also open and self-hosted options in this space. Examples include OpenSandbox, microsandbox, and E2B. These tools vary in design, but the general idea is the same: run untrusted or AI-generated code in a more strongly isolated environment than your everyday shell session.
Verification Is the Whole Game¶
Before trusting generated code or advice, ask whether the job submits, whether it runs to completion, whether tests pass, whether the output makes sense, whether performance is still acceptable, and whether the explanation matches the observed behavior. Trust should be proportional to evidence.
Recommendations for Perlmutter¶
Coding agents act on your behalf and have the same permissions that you do on NERSC systems. You should not allow a coding agent to do anything on NERSC systems that you yourself would not do. In most cases, you can launch the agent from a login node and let it run small programs as needed for, e.g., debugging, but you should not allow it to run long or resource-intensive jobs directly on the login node.
Because users have access to shared files and data, we recommend using workspace-write mode for vibe coding on Perlmutter. The idea is to have the files you want the agent to manipulate inside a working directory that it has write access to, and any data or files you don't want it to modify or delete are outside the working directory. Most agents default to workspace-write mode, so all you have to do is launch the agent from inside the working directory. If the agent you are using does not use workspace-write mode by default, you can change it in its config file.
We recommend using working directories located in your $HOME or $SCRATCH directory on Perlmutter. If the files you want to manipulate are on $CFS, you should copy them to a fresh working directory on $HOME or $SCRATCH. Agents in workspace-write mode have read access everywhere you do, so it can read files on $CFS for necessary context, but cannot make changes there without your permission. It is best practice to use version control in the working directory and commit often while vibe coding.
This strategy relies on workspace-write permissions to safeguard against malicious or erroneous agent behavior. If you want to safely run without approval prompts in full-access mode, you can use a container-based sandbox that has network access disabled and read-only mounts to data and files in $CFS needed for context.
If You Only Remember Four Things¶
- Give the model real context.
- Ask for small, concrete steps.
- Verify every meaningful result.
- Do not outsource judgment.
NERSC-Specific Tools and Guidance¶
NERSC is actively exploring additional guidance and reusable agent configurations for common HPC workflows, including project-specific SKILLS.md files, prompt templates, and examples for working with Perlmutter, Slurm, modules, and job logs.
Check back as these resources become available. In the meantime, users should treat general-purpose LLM and coding-agent advice as a starting point, and validate all commands, configurations, and results against NERSC documentation and their own workflow requirements.
Related Resources¶
- Machine Learning at NERSC
- Training Launchers
- Training Libraries
- Hyperparameter Optimization
- Troubleshooting Jobs
- Running Jobs on Perlmutter
References¶
These external links are provided for convenience and may not remain available indefinitely.
- Original NERSC vibe-coding slide deck
- Expanded NERSC reference deck
- NERSC multi-GPU vibe-coding demo repository
- Claude subagents documentation
- Codex subagents documentation
- Claude sandboxing documentation
- Codex sandboxing documentation
- OpenSandbox project
- microsandbox documentation
- E2B sandbox platform
- Codex best practices
- Claude Code best practices