Skip to content

Current known issues

Cori login nodes degraded

Since the afternoon of October 9, 2020 some users have experienced intermittent difficulties with ssh connections to Cori login nodes. This issue is caused by certain processes hanging when interacting with the scratch filesystem. This results in the node being unable to return a timely login prompt.

NERSC is currently closely monitoring nodes for evidence of this problem. Nodes will be rebooted when the problem is discovered. This is a temporary workaround until a root cause and corresponding fix is in place.

Users on impacted nodes will be notified via wall to give some advance notice of a reboot when possible. The reboot will logout users and kill any current process (including xfer jobs running on that node).

The Cori load balancer has been reconfigured to avoid these nodes when detected. If you experience this issue please logout and try to login again. It may take a few minutes for the issue to register and the load balancer to update.

Cori is back in service; mitigations for cscratch1 error conditions

Update: As of 2:48pm (PT) Thursday Oct 8, Cori is available for normal production use.

We believe the conditions that triggered the cscratch1 filesystem relate to striping files over many Lustre OSTs (notes about Lustre striping)

Warning

Please refrain from running jobs that use files striped over many OSTs. If you are using custom striping limit stripe count to 72.

Tip

The default striping of 1 is not expected to trigger the error.

Note

NERSC provides helper scripts stripe_small, stripe_medium and stripe_large to set striping appropriately for different sized files. None of these settings are expected to trigger the problem.

Danger

Until we have identified and fixed the root cause, a risk remains that the crash could re-occur - we remind users to be diligent about keeping a second copy of critical data.

The special test queues have been deleted but users can see jobs that were submitted to or ran in these queues with sacct, eg sacct -q test.

The original crash damaged some cscratch1 metadata such as directory information and file access times. The actual data on cscratch1 OSTs was not damaged, and files on $HOME or CFS were also not affected. Users may see errors when accessing a damaged directory, or notice some missing files from that time. In many cases these missing files are recoverable, please open a ticket at https://help.nersc.gov for assistance with these.

Cori cscratch1 incident history

On Thursday Sept 24, 2020, at 11:55am PDT, a crash of the cscratch1 metadata server damaged the file system journal of the metadata server and required a complete check and repair of scratch. This is a long process and Cori was unavailable until shortly after 7pm on Friday evening.

On Sunday Sept 27 at 12:02pm PDT we experienced a similar crash. Starting Wednesday evening, Sept 30, Cori is in a non-production, "debug" mode with the goal of reproducing and mitigating the cscratch1 crash without damaging files already on cscratch1.

On Friday Oct 2 at about 8:15am PDT, the cscratch1 crash was reproduced. Cori continued to operate in the special debug mode, but with cscratch1 unavailable everywhere while the filesystem is checked and repaired. From the collected data NERSC was able to create a synthetic workload that quickly reproduces the error.

On Sunday Oct 4, around 5pm, we were able to reproduce the cscratch1 crash using the synthetic workload, alongside workload provided by NERSC users. Following this crash cscratch1 remained unavailable throughout Monday through Wednesday.

During the evening of Wednesday Oct 7 night Cori was removed from service and prepared for return to normal production use, with mitigations in place for the conditions that we believe triggered the filesystem crash.

At 2:48pm PDT on Thursday Oct 8 Cori was returned to normal service.

Accessing jobs and data from the special debug period

During the debug period a directory $CSCRATCH/test_20200930/ was created for each user, and users were advised not to write to any cscratch1 location outside of that directory. The directory still exists and users are free to access data inside that directory. Note that filesystem crashes during the special debug period may have damaged directory information and file metadata (names, access times, location etc) inside that directory, which could cause files to become "lost".

The special queues created during the debug period have been removed, but job information is still available via sacct:

| For jobs submitted to: |  Use                                                              |
|------------------------|-------------------------------------------------------------------|
| `-q test`              | `sacct -S 2020-09-30 -E 2020-10-07 -q test`                       |
| `-q jgitest`           | `sacct -S 2020-09-30 -E 2020-10-07 -q jgitest`                    |
| `-q jgitest_shared`    | `sacct -S 2020-09-30 -E 2020-10-07 -q jgitest`                    |
| `-q test_login`        | `module load esslurm ; sacct -S 2020-09-30 -E 2020-10-07 -q test` |
| `-q test_cmem`         | `module load esslurm ; sacct -S 2020-09-30 -E 2020-10-07 -q test` |

Possible impacts of the cscratch1 crash

The original crash damaged some cscratch1 metadata such as directory information and file access times. The actual data on cscratch1 OSTs was not damaged, and files on $HOME or CFS were also not affected. Users may see errors when accessing a damaged directory, or notice some missing files from that time. In many cases these missing files are recoverable, please open a ticket at https://help.nersc.gov for assistance with these.

Files on cscratch1 that were being read from at the time of the crash may have had their access time reset to zero, meaning Jan 1, 1970. We will not be purging cscratch1 for a while while we work to resolve access time issues.

Filesystem operations not involving cscratch1 (such as reading or writing to $HOME or the Community File System) will not have been affected.

Getting help

If you experience difficulties not mentioned here, please let us know by opening a ticket at https://help.nersc.gov

Particle counts due to California fires may limit use of HPSS tape

We may need to pause the HPSS tape libraries due to high particle counts from the nearby fires, which would mean that new data can’t be written to tape. If you urgently need to backup your data from Cori scratch, please use the Community File System or copy it to another site instead.