On-orbit servicing and active debris removal of non-cooperative spacecrafts rely on autonomous navigation systems, particularly visual navigation, to perform orbiting and approach maneuvers around the target. Robust pose estimation is a critical technology of visual autonomous navigation, providing position and orientation measurements for navigation filters. Learning-based monocular vision methods have been widely adopted for spacecraft pose estimation tasks, achieving superior performance compared to traditional approaches. However, extreme illumination in space can lead to target texture blurring, thereby reducing the accuracy and robustness of traditional monocular vision systems. Unlike traditional cameras, event cameras (neuromorphic cameras) capture asynchronous brightness changes at the pixel level. These cameras exhibit remarkable attributes, including high dynamic range, low latency, and resistance to motion blur, making them well-suited for high dynamic range scenes and high-speed motion. These features make event cameras an ideal complement to standard optical cameras for pose estimation. This work proposes a data fusion approach that integrates monocular images and event streams to estimate the pose of non-cooperative targets. An end-to-end architecture is designed, combining Convolutional Neural Networks and Transformers to enable the extraction of detailed local features and global contextual information. The self-attention mechanism within the Transformer facilitates the alignment of cross-modal features. The fusion framework effectively leverages the complementary properties of multi-modal sensor acquisition to improve model performance in challenging space environments. To support this research, the event-monocular-spacecraft dataset is generated as the first publicly available synthetic event-monocular dataset, containing RGB images under varied lighting conditions, corresponding event streams, and precise pose labels. The evaluation results on the generated dataset demonstrate that the proposed fusion approach achieves an 87% high-precision estimation rate, representing a 5.4% improvement over the event-only model and a 9.5% improvement over the monocular-only model. To demonstrate the robustness and effectiveness of the proposed method, this study evaluated it on another publicly available real-world dataset and compared it with other state-of-the-art pose estimation methods. The results show that the proposed method achieves an average translation error of 0.0535 m and an angular error of 3.082°. In comparison, the average translation error of another method is 0.151 m and the average angular error is 2.363°. Extensive experiments demonstrate that the proposed cross-modal fusion method significantly outperforms approaches using a single sensor, achieving superior pose estimation accuracy and maintaining robust performance under extreme lighting conditions.

Cross-modal fusion of monocular images and neuromorphic streams for 6D pose estimation of non-cooperative targets

Maestrini, Michele;Massari, Mauro;Di Lizia, Pierluigi
2025-01-01

Abstract

On-orbit servicing and active debris removal of non-cooperative spacecrafts rely on autonomous navigation systems, particularly visual navigation, to perform orbiting and approach maneuvers around the target. Robust pose estimation is a critical technology of visual autonomous navigation, providing position and orientation measurements for navigation filters. Learning-based monocular vision methods have been widely adopted for spacecraft pose estimation tasks, achieving superior performance compared to traditional approaches. However, extreme illumination in space can lead to target texture blurring, thereby reducing the accuracy and robustness of traditional monocular vision systems. Unlike traditional cameras, event cameras (neuromorphic cameras) capture asynchronous brightness changes at the pixel level. These cameras exhibit remarkable attributes, including high dynamic range, low latency, and resistance to motion blur, making them well-suited for high dynamic range scenes and high-speed motion. These features make event cameras an ideal complement to standard optical cameras for pose estimation. This work proposes a data fusion approach that integrates monocular images and event streams to estimate the pose of non-cooperative targets. An end-to-end architecture is designed, combining Convolutional Neural Networks and Transformers to enable the extraction of detailed local features and global contextual information. The self-attention mechanism within the Transformer facilitates the alignment of cross-modal features. The fusion framework effectively leverages the complementary properties of multi-modal sensor acquisition to improve model performance in challenging space environments. To support this research, the event-monocular-spacecraft dataset is generated as the first publicly available synthetic event-monocular dataset, containing RGB images under varied lighting conditions, corresponding event streams, and precise pose labels. The evaluation results on the generated dataset demonstrate that the proposed fusion approach achieves an 87% high-precision estimation rate, representing a 5.4% improvement over the event-only model and a 9.5% improvement over the monocular-only model. To demonstrate the robustness and effectiveness of the proposed method, this study evaluated it on another publicly available real-world dataset and compared it with other state-of-the-art pose estimation methods. The results show that the proposed method achieves an average translation error of 0.0535 m and an angular error of 3.082°. In comparison, the average translation error of another method is 0.151 m and the average angular error is 2.363°. Extensive experiments demonstrate that the proposed cross-modal fusion method significantly outperforms approaches using a single sensor, achieving superior pose estimation accuracy and maintaining robust performance under extreme lighting conditions.
2025
Cross-modal fusion
Event streams
Monocular images
Non-cooperative targets
Pose estimation
Transformer
File in questo prodotto:
File Dimensione Formato  
YISHW01-25.pdf

Accesso riservato

: Publisher’s version
Dimensione 5.34 MB
Formato Adobe PDF
5.34 MB Adobe PDF   Visualizza/Apri
YISHW_OA_01-25.pdf

embargo fino al 16/05/2027

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 1.41 MB
Formato Adobe PDF
1.41 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1291205
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
social impact