We report on our experience deploying a CubeSat to study fault and error distributions against different fault tolerance schemes when using Common Off-The-Shelf (COTS) hardware in Low-Earth Orbit (LEO). Space radiation commonly causes faults in COTS hard- ware, such as bit flips in memory, which can lead to errors in a program’s execution. Fault tolerance techniques can prevent faults from turning into errors. Accurately quantifying the fault and error distributions is vital for choosing an appropriate fault tolerance scheme. We equip the CubeSat with heterogeneous hardware com- bining a regular System on a Chip (SoC) with programmable logic resources. Based on in-orbit experiments and post-processing of logs, we check the validity of two fault models. We find the single fault model to be valid, encouraging the use of techniques such as triple modular redundancy. We also demonstrate, however, that the single-bit error fault model is not valid, which means that com- mon techniques such as Hamming (7,4) codes should not be used. We observe that most errors are short-lived, allowing simple re- executions to correct them. Contrary to intuition, we also conclude that floating-point encodings are more appropriate to build fault- tolerance schemes in our setting, due to faults being easier to detect than in their integer counterparts. Our insights confirm existing findings in the literature while also providing new ones, while pro- viding a foundation to conceive fault tolerance schemes for COTS hardware deployments in space.

Fault Tolerance in Space with Heterogeneous Hardware: Experiences from a 68-day CubeSat Deployment in LEO

Luca Mottola
2025-01-01

Abstract

We report on our experience deploying a CubeSat to study fault and error distributions against different fault tolerance schemes when using Common Off-The-Shelf (COTS) hardware in Low-Earth Orbit (LEO). Space radiation commonly causes faults in COTS hard- ware, such as bit flips in memory, which can lead to errors in a program’s execution. Fault tolerance techniques can prevent faults from turning into errors. Accurately quantifying the fault and error distributions is vital for choosing an appropriate fault tolerance scheme. We equip the CubeSat with heterogeneous hardware com- bining a regular System on a Chip (SoC) with programmable logic resources. Based on in-orbit experiments and post-processing of logs, we check the validity of two fault models. We find the single fault model to be valid, encouraging the use of techniques such as triple modular redundancy. We also demonstrate, however, that the single-bit error fault model is not valid, which means that com- mon techniques such as Hamming (7,4) codes should not be used. We observe that most errors are short-lived, allowing simple re- executions to correct them. Contrary to intuition, we also conclude that floating-point encodings are more appropriate to build fault- tolerance schemes in our setting, due to faults being easier to detect than in their integer counterparts. Our insights confirm existing findings in the literature while also providing new ones, while pro- viding a foundation to conceive fault tolerance schemes for COTS hardware deployments in space.
2025
Proceedings of the International Conference on Embedded Wireless Sensor Systems (EWSN)
File in questo prodotto:
File Dimensione Formato  
ahmed25fault.pdf

accesso aperto

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 1.21 MB
Formato Adobe PDF
1.21 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1310206
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact