RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Despite recent progress in robotic manipulation, robots still face difficulties generating actions across new tasks, objects, and environments. While foundation models such as Large Language Models (LLMs) show potential in robotic learning, they have several limitations with complex manipulation tasks. In addition, LLMs often depend on pre-trained actions or require reinforcement learning, and end-to-end robotic models demand vast amounts of data and computational power. Furthermore, building extensive multimodal datasets for real-world robotic applications is time-consuming, and training large foundation models is resource-intensive. This paper presents a framework that overcomes these challenges by employing an LLM model fine-tuned with a Parameter-Efficient Fine-Tuning (PEFT) technique to tailor them for robotic tasks. During the fine-tuning, our approach does not need real-world data because it is generated synthetically, without relying on images or multimodal inputs. This allows LLMs to directly produce generalized action plans in real-world settings, enabling the robot to perform seven tasks - including pick-and-place, stacking, lifting, and directional movements - after just a few hours of training on simulated data. By integrating a YOLO-based vision module for perception, our modular architecture achieves task success rates comparable to state-of-the-art robotic learning models on specific tasks. The primary advantages of our method are that it is trained entirely on synthetic data, provides exceptionally fast inference, and operates efficiently on a single commercial GPU for both training and inference. These features make this framework highly practical and accessible for industry use, offering a cost-effective solution in terms of time and resources.

FLARE: Fine-tuned large LAnguage models for Resource-Efficient action generation in robotics

Maccarini, Marco;Moroncelli, Angelo;Carpanzano, Emanuele;Roveda, Loris

2025-01-01

Abstract

Despite recent progress in robotic manipulation, robots still face difficulties generating actions across new tasks, objects, and environments. While foundation models such as Large Language Models (LLMs) show potential in robotic learning, they have several limitations with complex manipulation tasks. In addition, LLMs often depend on pre-trained actions or require reinforcement learning, and end-to-end robotic models demand vast amounts of data and computational power. Furthermore, building extensive multimodal datasets for real-world robotic applications is time-consuming, and training large foundation models is resource-intensive. This paper presents a framework that overcomes these challenges by employing an LLM model fine-tuned with a Parameter-Efficient Fine-Tuning (PEFT) technique to tailor them for robotic tasks. During the fine-tuning, our approach does not need real-world data because it is generated synthetically, without relying on images or multimodal inputs. This allows LLMs to directly produce generalized action plans in real-world settings, enabling the robot to perform seven tasks - including pick-and-place, stacking, lifting, and directional movements - after just a few hours of training on simulated data. By integrating a YOLO-based vision module for perception, our modular architecture achieves task success rates comparable to state-of-the-art robotic learning models on specific tasks. The primary advantages of our method are that it is trained entirely on synthetic data, provides exceptionally fast inference, and operates efficiently on a single commercial GPU for both training and inference. These features make this framework highly practical and accessible for industry use, offering a cost-effective solution in terms of time and resources.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				58th CIRP Conference on Manufacturing Systems 2025
			
	Titolo della collana
	
				PROCEDIA CIRP
			
	Parole chiave
	
				foundation models in robotics; generative AI; large language models (LLMs); parameter-efficient fine-tuning (PEFT); pre-trained language models; Robot learning; specialized LLMs;
			
	Parole chiave
	
				foundation models in robotics
generative AI
large language models (LLMs)
parameter-efficient fine-tuning (PEFT)
pre-trained language models
Robot learning
specialized LLMs
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S2212827125005761-main (1).pdf accesso aperto : Publisher’s version Dimensione 827.32 kB Formato Adobe PDF Visualizza/Apri	827.32 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1294576

Citazioni

ND

0

ND

social impact