This study addresses 3D scene understanding in vineyard agriculture through multimodal data fusion for robotic applications. We present an automated annotation pipeline designed to overcome dataset limitations in agricultural environments, enabling robust semantic perception. After benchmarking state-of-the-art architectures, we propose a multimodal framework that integrates zero-shot 2D segmentation, with a fine-tuned 3D scene understanding model (CLIP2Scene) to improve semantic segmentation of sparse 3D point clouds. Experimental results highlight how combining 2D-3D semantic data with geospatial (GPS) information generates detailed, semantically enriched vineyard maps. Our work underscores the effectiveness of multimodal fusion in enhancing agricultural robotics, offering scalable solutions for precision farming. This approach not only improves operational efficiency but also promotes sustainable practices by enabling data-driven insights in complex, unstructured environments.
Deep Multimodal Fusion for 2D-3D Vineyard Scene Understanding
Usuelli M.;Sbrolli C.;Matteucci M.
2026-01-01
Abstract
This study addresses 3D scene understanding in vineyard agriculture through multimodal data fusion for robotic applications. We present an automated annotation pipeline designed to overcome dataset limitations in agricultural environments, enabling robust semantic perception. After benchmarking state-of-the-art architectures, we propose a multimodal framework that integrates zero-shot 2D segmentation, with a fine-tuned 3D scene understanding model (CLIP2Scene) to improve semantic segmentation of sparse 3D point clouds. Experimental results highlight how combining 2D-3D semantic data with geospatial (GPS) information generates detailed, semantically enriched vineyard maps. Our work underscores the effectiveness of multimodal fusion in enhancing agricultural robotics, offering scalable solutions for precision farming. This approach not only improves operational efficiency but also promotes sustainable practices by enabling data-driven insights in complex, unstructured environments.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


