This study addresses 3D scene understanding in vineyard agriculture through multimodal data fusion for robotic applications. We present an automated annotation pipeline designed to overcome dataset limitations in agricultural environments, enabling robust semantic perception. After benchmarking state-of-the-art architectures, we propose a multimodal framework that integrates zero-shot 2D segmentation, with a fine-tuned 3D scene understanding model (CLIP2Scene) to improve semantic segmentation of sparse 3D point clouds. Experimental results highlight how combining 2D-3D semantic data with geospatial (GPS) information generates detailed, semantically enriched vineyard maps. Our work underscores the effectiveness of multimodal fusion in enhancing agricultural robotics, offering scalable solutions for precision farming. This approach not only improves operational efficiency but also promotes sustainable practices by enabling data-driven insights in complex, unstructured environments.

Deep Multimodal Fusion for 2D-3D Vineyard Scene Understanding

Usuelli M.;Sbrolli C.;Matteucci M.
2026-01-01

Abstract

This study addresses 3D scene understanding in vineyard agriculture through multimodal data fusion for robotic applications. We present an automated annotation pipeline designed to overcome dataset limitations in agricultural environments, enabling robust semantic perception. After benchmarking state-of-the-art architectures, we propose a multimodal framework that integrates zero-shot 2D segmentation, with a fine-tuned 3D scene understanding model (CLIP2Scene) to improve semantic segmentation of sparse 3D point clouds. Experimental results highlight how combining 2D-3D semantic data with geospatial (GPS) information generates detailed, semantically enriched vineyard maps. Our work underscores the effectiveness of multimodal fusion in enhancing agricultural robotics, offering scalable solutions for precision farming. This approach not only improves operational efficiency but also promotes sustainable practices by enabling data-driven insights in complex, unstructured environments.
2026
Lecture Notes in Computer Science
9783032101846
9783032101853
3D Scene Understanding
Agricultural Robotics
Multimodal Learning
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1309042
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact