During the Covid-19 pandemic, travel restrictions were in place that meant most of the experiments that took place at ISIS were led by ISIS staff with the users operating the beamline remotely. For most ISIS instruments, the data is collected locally on the instrument's data acquisition system. This allows for immediate diagnostics and some analysis during the experiment, which is then followed by the full scientific analysis later. With the hands-on expertise of the ISIS staff on site, and existing data connections, ISIS was able to conduct many experiments in this remote way.
However, for the ChipIr beamline, which tests the resilience of electronics chips to atmospheric neutrons, the situation is different as it is the samples themselves that are collecting large amounts of data whilst being bombarded by neutrons. Each requires constant monitoring, reprogramming, rebooting, power-cycling, and, sometimes, replacing if they fail completely. With multiple devices being tested simultaneously, a typical experiment requires someone from the experimental team to be present at all times.
To get round this issue when travel restrictions made it impossible for the external users to be present required rethinking the physical setup, data connections and methodologies. For the experiment that was the focus of this study, an international team were able to fully control the electronics remotely and simultaneously from their home institutions in Italy and Brazil, while the task of installing and managing the physical test set-up at ISIS was carried out by the ChipIr team, which included industrial placement student Sujit Malde.
Naturally, for the users it was not the same as being present at ISIS but the changes that had to be made for the experiment to run in this manner still allowed the team to conduct a highly successful test. The group, led by Paolo Rech, who has recently joined the University of
Trento in Italy, and students Fernando Fernandes dos Santos and Rubens Rech, were able to study the sources of Silent Data Corruptions (SDCs) and Detected Unrecoverable Errors (DUEs) in NVIDIA GPUs and Google TPUs executing neural networks for object detection and other critical tasks.
Their study, published in IEEE Transactions on Nuclear Science and presented at the 2022 Design, Automation and Test in Europe Conference, highlights the remote working method and sought to understand the causes of these SDCs and DUEs when they occurred, and how the device could be adapted to mitigate these errors if they do happen.
Using a virtual private network, the researchers could monitor and adjust the applications of the different GPUs and TPUs remotely, turn ON and OFF the devices, performing power-cycling and reboots when needed, thus exploring how their error correcting code was able to reduce the error rate in some instances, but not all.
The forced experimental circumstances gave the ChipIr team the opportunity to do things a bit differently. Although future experiments will see users return to the beamline cabin, this method it likely to be used alongside an in-person experimental presence to make future experiments easier by widening the experimental team beyond those who can travel to ISIS.
The users might even be able to make the most of being in different time zones and share the workload without having to miss any sleep!
The full paper on GPUs can be found at DOI: 10.1109/TNS.2022.3141341 and the TPUs paper is available at DOI: 10.1109/TNS.2022.3142092 .