Robust and efficient single-CNN-based spacecraft relative pose estimation from monocular images

Bechini, Michele; Lavagna, Michèle

doi:10.1016/j.actaastro.2025.04.016

Autonomous spacecraft relative navigation is crucial for future space missions, but achieving accurate and efficient pose estimation, especially for uncooperative targets, remains challenging. Despite recent strides in AI-based solutions and benchmark datasets, existing algorithms prioritize accuracy over computational efficiency and struggle to generalize from synthetic to real-world scenarios. The paper addresses these issues by proposing a novel pose estimation algorithm from monocular images based on a single multitasking convolutional neural network (CNN) performing both region of interest (ROI) estimation and keypoint regression. The proposed architecture leverages an outlier rejection scheme based on confidence scores to enhance robustness and reliability, while a check on the need for a second inference has been introduced to improve the computational efficiency and potentially enable higher navigation filter update rates. To improve the domain gap bridging capabilities of the introduced pipeline, custom augmentation techniques are presented within this work, including a novel noise augmentation mimicking actual sensor noise. These augmentations have been used during the model training and proved to be beneficial in enhancing the CNN performances, setting the new highest score for object detection and keypoint regression on a benchmark dataset. The performances of the proposed architecture have been assessed using synthetic SPEED images. The outcomes of the analyses demonstrate that our architecture achieves high-level performances on synthetic SPEED images with a mean translational error lower than 10 cm and a mean angular error of about 1.4 degrees, outperforming other more complex models while maintaining an execution time on the CPU of an Apple® Silicon™ M1 Pro processor in the order of 60 ms to 131 ms, depending if one or two inferences are needed to retrieve the pose. Further, the standard deviations of the errors are 11.5 cm and 1.0 degree for the translation and attitude errors, respectively, revealing the high precision of the proposed solution and the absence of strong outliers, especially for the relative attitude estimation where the registered standard deviation is the lowest among the methods available in the literature. The evaluation using mock-up SPEED+ frames confirms the effectiveness of the introduced domain gap reduction strategies, with performance metrics remaining competitive despite increased errors compared to synthetic images mainly due to low illumination conditions. Notably, the paper also outlines the extension of the proposed architecture to more complex targets to prove the adaptability of the proposed approach. The outcomes of this analysis confirm that the introduced pipeline still achieves high accuracy in the relative pose estimation tasks even for targets with complex geometries and a high probability of keypoints occlusions, as for the case of Envisat. Conversely, for highly symmetric targets like VESPA, the performances of the keypoint regression degrades, leading to wrong estimates due to the uncertainties in the retrieved keypoints locations.