Lung nodule detection is critical for early diagnosis of lung cancer, but remains challenging due to the nodules’ resemblance to normal tissues. Recent transformer-based approaches have made significant progress; however, their large number of parameters necessitates extensive annotated datasets to achieve robust and reliable results. To address this, we leverage state-of-the-art self-supervised training methods, specifically Masked Image Modeling, on a large domain-specific dataset of lung screening CTs, followed by finetuning on the annotated LUNA16 dataset. Our method achieves an AP of 82.63% and an mAP of 81.23%, outperforming the baseline nnDetection. The experiments demonstrate the effectiveness of pretraining, yielding an increase of 24.0% in performance on the Video-ViT backbone and 4.1% on the Swin Transformer. Additionally, we examine the effect of RGB video pretraining and architectural variations during both pretraining and fine-tuning stages. This work highlights the potential of self-supervised learning in improving efficiency and accuracy in lung cancer screening. Code: github.com/vit-swin-lung-nodule-detection.

Leveraging Self-supervised Pretraining Using Transformers for Enhanced Lung Nodule Detection in CT Scans

Liu, Jiaying;Corti, Anna;Corino, Valentina.;Mainardi, Luca;
2025-01-01

Abstract

Lung nodule detection is critical for early diagnosis of lung cancer, but remains challenging due to the nodules’ resemblance to normal tissues. Recent transformer-based approaches have made significant progress; however, their large number of parameters necessitates extensive annotated datasets to achieve robust and reliable results. To address this, we leverage state-of-the-art self-supervised training methods, specifically Masked Image Modeling, on a large domain-specific dataset of lung screening CTs, followed by finetuning on the annotated LUNA16 dataset. Our method achieves an AP of 82.63% and an mAP of 81.23%, outperforming the baseline nnDetection. The experiments demonstrate the effectiveness of pretraining, yielding an increase of 24.0% in performance on the Video-ViT backbone and 4.1% on the Swin Transformer. Additionally, we examine the effect of RGB video pretraining and architectural variations during both pretraining and fine-tuning stages. This work highlights the potential of self-supervised learning in improving efficiency and accuracy in lung cancer screening. Code: github.com/vit-swin-lung-nodule-detection.
2025
Machine Learning in Medical Imaging. MLMI 2025. Lecture Notes in Computer Science
9783032095121
9783032095138
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1311032
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact