Lung nodule detection is critical for early diagnosis of lung cancer, but remains challenging due to the nodules’ resemblance to normal tissues. Recent transformer-based approaches have made significant progress; however, their large number of parameters necessitates extensive annotated datasets to achieve robust and reliable results. To address this, we leverage state-of-the-art self-supervised training methods, specifically Masked Image Modeling, on a large domain-specific dataset of lung screening CTs, followed by finetuning on the annotated LUNA16 dataset. Our method achieves an AP of 82.63% and an mAP of 81.23%, outperforming the baseline nnDetection. The experiments demonstrate the effectiveness of pretraining, yielding an increase of 24.0% in performance on the Video-ViT backbone and 4.1% on the Swin Transformer. Additionally, we examine the effect of RGB video pretraining and architectural variations during both pretraining and fine-tuning stages. This work highlights the potential of self-supervised learning in improving efficiency and accuracy in lung cancer screening. Code: github.com/vit-swin-lung-nodule-detection.
Leveraging Self-supervised Pretraining Using Transformers for Enhanced Lung Nodule Detection in CT Scans
Liu, Jiaying;Corti, Anna;Corino, Valentina.;Mainardi, Luca;
2025-01-01
Abstract
Lung nodule detection is critical for early diagnosis of lung cancer, but remains challenging due to the nodules’ resemblance to normal tissues. Recent transformer-based approaches have made significant progress; however, their large number of parameters necessitates extensive annotated datasets to achieve robust and reliable results. To address this, we leverage state-of-the-art self-supervised training methods, specifically Masked Image Modeling, on a large domain-specific dataset of lung screening CTs, followed by finetuning on the annotated LUNA16 dataset. Our method achieves an AP of 82.63% and an mAP of 81.23%, outperforming the baseline nnDetection. The experiments demonstrate the effectiveness of pretraining, yielding an increase of 24.0% in performance on the Video-ViT backbone and 4.1% on the Swin Transformer. Additionally, we examine the effect of RGB video pretraining and architectural variations during both pretraining and fine-tuning stages. This work highlights the potential of self-supervised learning in improving efficiency and accuracy in lung cancer screening. Code: github.com/vit-swin-lung-nodule-detection.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


