Society is becoming increasingly reliant on computing systems such as data centres and we are on the cusp of having fully autonomous vehicles. These systems typically use advanced commercial ‘off the shelf’ hardware, such as Graphics Processor Units (GPUs), that contain error checking and correcting codes to increase the reliability of the hardware. The high computing capabilities of GPU acceleration in these systems is critical for their high performance and energy efficiency, making GPU resilience a priority for safety critical applications.
One cause of errors is neutron-induced single event effects from cosmic rays, known as soft errors, and NVIDIA brought sample GPU devices to ChipIr to test the effect of neutrons on GPU memories and applications. One main object of investigation was GPU DRAM, as this memory is the most area-consuming and sensitive part of the GPU.
During their experiments, the researchers discovered that the high-energy beam was causing unexpected intermittent errors in DRAM due to damage to the devices. They were able to separate these errors from the soft errors. Once they had separated out the impact of soft errors, they could determine the DRAM logic structures that were most affected and design an error-correcting code that is able to correct soft errors more frequently, while simultaneously reducing the risk of silent data corruption.
“The ChipIR beamline has been extremely valuable for our studies into the effects of soft errors on NVIDIA GPUs. The stability and strength of the beam and the well-thought-out infrastructure have made it possible to run more experiments in less time.” Mike Sullivan, NVIDIA
Related publication: Characterizing and Mitigating Soft Errors in GPU DRAM, MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, 641–653 DOI: 10.1145/3466752.3480111