Deep Multimodal Fusion for 2D-3D Vineyard Scene Understanding

Usuelli, M.; Sbrolli, C.; Braccini, A.; Frosi, M.; Matteucci, M.

doi:10.1007/978-3-032-10185-3_13

This study addresses 3D scene understanding in vineyard agriculture through multimodal data fusion for robotic applications. We present an automated annotation pipeline designed to overcome dataset limitations in agricultural environments, enabling robust semantic perception. After benchmarking state-of-the-art architectures, we propose a multimodal framework that integrates zero-shot 2D segmentation, with a fine-tuned 3D scene understanding model (CLIP2Scene) to improve semantic segmentation of sparse 3D point clouds. Experimental results highlight how combining 2D-3D semantic data with geospatial (GPS) information generates detailed, semantically enriched vineyard maps. Our work underscores the effectiveness of multimodal fusion in enhancing agricultural robotics, offering scalable solutions for precision farming. This approach not only improves operational efficiency but also promotes sustainable practices by enabling data-driven insights in complex, unstructured environments.

Deep Multimodal Fusion for 2D-3D Vineyard Scene Understanding

Usuelli M.;Sbrolli C.;Braccini A.;Frosi M.;Matteucci M.

2026-01-01

Abstract

This study addresses 3D scene understanding in vineyard agriculture through multimodal data fusion for robotic applications. We present an automated annotation pipeline designed to overcome dataset limitations in agricultural environments, enabling robust semantic perception. After benchmarking state-of-the-art architectures, we propose a multimodal framework that integrates zero-shot 2D segmentation, with a fine-tuned 3D scene understanding model (CLIP2Scene) to improve semantic segmentation of sparse 3D point clouds. Experimental results highlight how combining 2D-3D semantic data with geospatial (GPS) information generates detailed, semantically enriched vineyard maps. Our work underscores the effectiveness of multimodal fusion in enhancing agricultural robotics, offering scalable solutions for precision farming. This approach not only improves operational efficiency but also promotes sustainable practices by enabling data-driven insights in complex, unstructured environments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Titolo del libro
	
				Lecture Notes in Computer Science
			
	Titolo della collana
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	ISBN (International Standard Book Number)
	
				9783032101846
9783032101853
			
	Parole chiave
	
				3D Scene Understanding
Agricultural Robotics
Multimodal Learning
			
	Appare nelle tipologie:
	
				02.1 Contributo in Volume

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1309042

Citazioni

ND

0

ND

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Deep Multimodal Fusion for 2D-3D Vineyard Scene Understanding

Usuelli M.;Sbrolli C.;Braccini A.;Frosi M.;Matteucci M.

2026-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Deep Multimodal Fusion for 2D-3D Vineyard Scene Understanding

Usuelli M.;Sbrolli C.;Braccini A.;Frosi M.;Matteucci M.

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)