Objective: Accurate assessment of colorectal lesion morphology during colonoscopy is essential for guiding treatment and estimating cancer risk. The Paris classification is widely adopted for this purpose but suffers from substantial inter-observer variability, while Vision Transformers (ViTs) can base their decisions on diffuse, off-lesion attention patterns that are hard to interpret. This study investigates whether directly supervising ViT attention maps with expert lesion annotations can concurrently improve Paris classification performance and model explainability. Method: We propose a Lesion-Focused Attention Loss (GLFA), an attention-supervised pretraining objective that uses expert polyp bounding boxes to focus last-layer [CLS] attention on annotated lesion regions, followed by standard cross-entropy fine-tuning. GLFA is applied to six ViT architectures and evaluated on the public SUN dataset for binary (0-I vs. 0-II) and three-class (0-Ip, 0-Is, 0-IIa) Paris classification. Performance is assessed using frame-wise accuracy and the AttIn, we additionally perform an ablation study against a Grad-CAM consistency baseline. Results: Attention-supervised pretraining yields consistent gains in both accuracy and lesion-focused attention. Across the six ViTs, adding GLFA improves three-class accuracy by up to 7 percentage points. In a detailed ablation on ViT-B/16, GLFA outperforms a Grad-CAM consistency baseline by about 5-13 percentage points across the 2-class and 3-class tasks, and chi 2 tests confirm a significant association between high AttIn and correct predictions. Conclusion: Direct supervision of ViT attention with GLFA leverages expert knowledge to jointly boost Paris classification accuracy and spatial interpretability, and compares favourably with Grad-CAM-based explanation regularisation. The source code and dataset splits are publicly available at https://github.com/LucaCarlini/ SUNDatasetPretraining.
Enhancing accuracy and explainability in colorectal lesion classification with attention-supervised Vision Transformers
Carlini L.;Lena C.;De Momi E.
2026-01-01
Abstract
Objective: Accurate assessment of colorectal lesion morphology during colonoscopy is essential for guiding treatment and estimating cancer risk. The Paris classification is widely adopted for this purpose but suffers from substantial inter-observer variability, while Vision Transformers (ViTs) can base their decisions on diffuse, off-lesion attention patterns that are hard to interpret. This study investigates whether directly supervising ViT attention maps with expert lesion annotations can concurrently improve Paris classification performance and model explainability. Method: We propose a Lesion-Focused Attention Loss (GLFA), an attention-supervised pretraining objective that uses expert polyp bounding boxes to focus last-layer [CLS] attention on annotated lesion regions, followed by standard cross-entropy fine-tuning. GLFA is applied to six ViT architectures and evaluated on the public SUN dataset for binary (0-I vs. 0-II) and three-class (0-Ip, 0-Is, 0-IIa) Paris classification. Performance is assessed using frame-wise accuracy and the AttIn, we additionally perform an ablation study against a Grad-CAM consistency baseline. Results: Attention-supervised pretraining yields consistent gains in both accuracy and lesion-focused attention. Across the six ViTs, adding GLFA improves three-class accuracy by up to 7 percentage points. In a detailed ablation on ViT-B/16, GLFA outperforms a Grad-CAM consistency baseline by about 5-13 percentage points across the 2-class and 3-class tasks, and chi 2 tests confirm a significant association between high AttIn and correct predictions. Conclusion: Direct supervision of ViT attention with GLFA leverages expert knowledge to jointly boost Paris classification accuracy and spatial interpretability, and compares favourably with Grad-CAM-based explanation regularisation. The source code and dataset splits are publicly available at https://github.com/LucaCarlini/ SUNDatasetPretraining.| File | Dimensione | Formato | |
|---|---|---|---|
|
paper_expl.pdf
accesso aperto
:
Publisher’s version
Dimensione
1.96 MB
Formato
Adobe PDF
|
1.96 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


