Commercial-Off-The-Shelf (COTS) devices are used widely in high performance computing as well as safety-critical applications such as prototype driverless cars because of their high performance, efficiency and low cost. However, the materials used to build the devices might contain boron-10 (10B), making it vulnerable to thermal neutron damage that can result in failures. This study, published in the Journal of Supercomputing, finds that this failure rate may be high enough to severely impact the reliability of these devices.
High energy neutrons are produced by the interaction of cosmic rays with the atmosphere, and their interaction with silicon chips is considered a main cause of faults in electronic devices. Thermal neutrons are low energy neutrons (below 0.5 eV) that are produced by the interaction of high energy neutrons with other materials, or the emission of neutrons from nuclear decay. As well as high-energy neutrons, thermal neutrons can also effect electronic devices. Unfortunately, the evaluation of the thermal neutrons flux in a realistic environment is extremely challenging, as it depends on several factors, including weather conditions and surrounding materials.
Testing for high-energy neutron damage is common for device manufacturers, many of whom come to ISIS to use the ChipIr instrument for this purpose. However, thermal neutron damage has been considered by the industry to be much less likely, and therefore not taken into account.
At ISIS, the group were able to use the ALF/ROTAX beamline to expose the devices to thermal neutrons and the ChipIr beamline to expose them to high-energy neutrons, as shown by the neutron spectra (right). Paolo Rech, Associate Professor at the Institute of Informatics of the Federal University of Rio Grande do Sul explains; “ISIS is a unique facility for this kind of evaluations, as it features both a high-energy and a thermal neutron beamline. This makes the testing a lot easier, as you can test exactly the same setup in the two beamlines inside the same facility!"
Chris Frost, beamline scientist on ChipIr explains; “As the importance of thermal neutron damage becomes more apparent, we are developing new capabilities at ISIS to ensure our testing ability remains extensive and world leading."
The impact of thermal neutrons on electronic devices is due to the isotope of boron present in the material, as only those containing 10B are susceptible to thermal neutron damage. Approximately 20% of naturally occurring boron is 10B, with the remainder being 11B. It is possible to use 'depleted' boron, which is primarily 11B, to solve this problem. However, this is expensive and unjustified for COTS devices for user applications. This study finds that newer silicon chips are being manufactured in a way that includes high levels of boron into COTS devices that are candidates for supercomputing applications.
“We know that using natural boron can pose risks to the reliability of electronic devices, as they become susceptible to thermal neutrons." Explains Paolo; “However, we can't blame the silicon industry for using the cheaper natural boron in their devices for the user market, as reliability is a secondary aspect when compared to price and performance. The increased demand for computing efficiency in supercomputers and automotive systems make COTS devices attractive solutions, increasing the likelihood of damage from thermal neutrons being a problem."
This investigation took six commercially-available devices that are used in high performance computing, and tested them under both high-energy and thermal neutron irradiation at ISIS. Whilst being irradiated, the devices were run under normal operating conditions and their performance measured.
Samples being measured on ChipIr
Sample measurement on ROTAX
Their experiments showed that all the devices were impacted by thermal neutrons, indicating the presence of 10B within them. The different energy of thermal neutrons compared to those coming directly from cosmic ray interaction with the atmosphere can lead to differing interactions with the materials inside the devices. The study found that different codes executed on the same device showed different sensitivities to high-energy and thermal neutrons, depending on how the code accesses the device memory, and how it executes instructions
To understand the impact caused by the two types of neutrons, the group needed to know the likelihood of the background neutron flux being high enough for the faults caused to impact the device reliability. In contrast to high energy neutrons, the rate of thermal neutrons passing through a device depends on its environment, and the presence of other materials close to the device.
The group created a neutron detector, and used it to measure the flux inside a building that replicated the conditions inside a typical data centre. They found that the rate of thermal neutrons, and therefore the failure rate of a device, was dependant on the physical layout of a machine room. It could also be impacted by the weather conditions: on a rainy day, the rate of thermal neutrons could double, causing a similar increase in the failure rate of the device.
This study can therefore be used to inform machine room designers, who could choose to prioritise tasks to be carried out by the supercomputers in such a way that those requiring a higher level of reliability are carried out by devices in locations, and under certain weather conditions, that reduce the likelihood of thermal neutron interaction. For the case of driverless cars, the group notes that, even with shielding, the thermal neutron flux may be increased by interaction of the neutrons with the driver, passengers and liquids on board, such as the fuel tank.
“We are beginning to see devices that are not designed to be reliable being used in applications that require high reliability." Says Paolo; “This is totally acceptable, but this study shows that we need to carefully consider all variables, including thermal neutrons, before assuming that the device is boron-10 free."
The full article is available at DOI: 10.1007/s11227-020-03324-9