Objective: Accurate assessment of colorectal lesion morphology during colonoscopy is essential for guiding treatment and estimating cancer risk. The Paris classification is widely adopted for this purpose but suffers from substantial inter-observer variability, while Vision Transformers (ViTs) can base their decisions on diffuse, off-lesion attention patterns that are hard to interpret. This study investigates whether directly supervising ViT attention maps with expert lesion annotations can concurrently improve Paris classification performance and model explainability. Method: We propose a Lesion-Focused Attention Loss (GLFA), an attention-supervised pretraining objective that uses expert polyp bounding boxes to focus last-layer [CLS] attention on annotated lesion regions, followed by standard cross-entropy fine-tuning. GLFA is applied to six ViT architectures and evaluated on the public SUN dataset for binary (0-I vs. 0-II) and three-class (0-Ip, 0-Is, 0-IIa) Paris classification. Performance is assessed using frame-wise accuracy and the AttIn, we additionally perform an ablation study against a Grad-CAM consistency baseline. Results: Attention-supervised pretraining yields consistent gains in both accuracy and lesion-focused attention. Across the six ViTs, adding GLFA improves three-class accuracy by up to 7 percentage points. In a detailed ablation on ViT-B/16, GLFA outperforms a Grad-CAM consistency baseline by about 5-13 percentage points across the 2-class and 3-class tasks, and chi 2 tests confirm a significant association between high AttIn and correct predictions. Conclusion: Direct supervision of ViT attention with GLFA leverages expert knowledge to jointly boost Paris classification accuracy and spatial interpretability, and compares favourably with Grad-CAM-based explanation regularisation. The source code and dataset splits are publicly available at https://github.com/LucaCarlini/ SUNDatasetPretraining.

Enhancing accuracy and explainability in colorectal lesion classification with attention-supervised Vision Transformers

Carlini L.;Lena C.;De Momi E.
2026-01-01

Abstract

Objective: Accurate assessment of colorectal lesion morphology during colonoscopy is essential for guiding treatment and estimating cancer risk. The Paris classification is widely adopted for this purpose but suffers from substantial inter-observer variability, while Vision Transformers (ViTs) can base their decisions on diffuse, off-lesion attention patterns that are hard to interpret. This study investigates whether directly supervising ViT attention maps with expert lesion annotations can concurrently improve Paris classification performance and model explainability. Method: We propose a Lesion-Focused Attention Loss (GLFA), an attention-supervised pretraining objective that uses expert polyp bounding boxes to focus last-layer [CLS] attention on annotated lesion regions, followed by standard cross-entropy fine-tuning. GLFA is applied to six ViT architectures and evaluated on the public SUN dataset for binary (0-I vs. 0-II) and three-class (0-Ip, 0-Is, 0-IIa) Paris classification. Performance is assessed using frame-wise accuracy and the AttIn, we additionally perform an ablation study against a Grad-CAM consistency baseline. Results: Attention-supervised pretraining yields consistent gains in both accuracy and lesion-focused attention. Across the six ViTs, adding GLFA improves three-class accuracy by up to 7 percentage points. In a detailed ablation on ViT-B/16, GLFA outperforms a Grad-CAM consistency baseline by about 5-13 percentage points across the 2-class and 3-class tasks, and chi 2 tests confirm a significant association between high AttIn and correct predictions. Conclusion: Direct supervision of ViT attention with GLFA leverages expert knowledge to jointly boost Paris classification accuracy and spatial interpretability, and compares favourably with Grad-CAM-based explanation regularisation. The source code and dataset splits are publicly available at https://github.com/LucaCarlini/ SUNDatasetPretraining.
2026
Attention supervision
Colorectal lesion classification
Paris classification
Trustworthy AI
Vision transformers
File in questo prodotto:
File Dimensione Formato  
paper_expl.pdf

accesso aperto

: Publisher’s version
Dimensione 1.96 MB
Formato Adobe PDF
1.96 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1308394
Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact