Despite recent progress in robotic manipulation, robots still face difficulties generating actions across new tasks, objects, and environments. While foundation models such as Large Language Models (LLMs) show potential in robotic learning, they have several limitations with complex manipulation tasks. In addition, LLMs often depend on pre-trained actions or require reinforcement learning, and end-to-end robotic models demand vast amounts of data and computational power. Furthermore, building extensive multimodal datasets for real-world robotic applications is time-consuming, and training large foundation models is resource-intensive. This paper presents a framework that overcomes these challenges by employing an LLM model fine-tuned with a Parameter-Efficient Fine-Tuning (PEFT) technique to tailor them for robotic tasks. During the fine-tuning, our approach does not need real-world data because it is generated synthetically, without relying on images or multimodal inputs. This allows LLMs to directly produce generalized action plans in real-world settings, enabling the robot to perform seven tasks - including pick-and-place, stacking, lifting, and directional movements - after just a few hours of training on simulated data. By integrating a YOLO-based vision module for perception, our modular architecture achieves task success rates comparable to state-of-the-art robotic learning models on specific tasks. The primary advantages of our method are that it is trained entirely on synthetic data, provides exceptionally fast inference, and operates efficiently on a single commercial GPU for both training and inference. These features make this framework highly practical and accessible for industry use, offering a cost-effective solution in terms of time and resources.

FLARE: Fine-tuned large LAnguage models for Resource-Efficient action generation in robotics

Roveda, Loris
2025-01-01

Abstract

Despite recent progress in robotic manipulation, robots still face difficulties generating actions across new tasks, objects, and environments. While foundation models such as Large Language Models (LLMs) show potential in robotic learning, they have several limitations with complex manipulation tasks. In addition, LLMs often depend on pre-trained actions or require reinforcement learning, and end-to-end robotic models demand vast amounts of data and computational power. Furthermore, building extensive multimodal datasets for real-world robotic applications is time-consuming, and training large foundation models is resource-intensive. This paper presents a framework that overcomes these challenges by employing an LLM model fine-tuned with a Parameter-Efficient Fine-Tuning (PEFT) technique to tailor them for robotic tasks. During the fine-tuning, our approach does not need real-world data because it is generated synthetically, without relying on images or multimodal inputs. This allows LLMs to directly produce generalized action plans in real-world settings, enabling the robot to perform seven tasks - including pick-and-place, stacking, lifting, and directional movements - after just a few hours of training on simulated data. By integrating a YOLO-based vision module for perception, our modular architecture achieves task success rates comparable to state-of-the-art robotic learning models on specific tasks. The primary advantages of our method are that it is trained entirely on synthetic data, provides exceptionally fast inference, and operates efficiently on a single commercial GPU for both training and inference. These features make this framework highly practical and accessible for industry use, offering a cost-effective solution in terms of time and resources.
2025
58th CIRP Conference on Manufacturing Systems 2025
foundation models in robotics; generative AI; large language models (LLMs); parameter-efficient fine-tuning (PEFT); pre-trained language models; Robot learning; specialized LLMs;
foundation models in robotics
generative AI
large language models (LLMs)
parameter-efficient fine-tuning (PEFT)
pre-trained language models
Robot learning
specialized LLMs
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S2212827125005761-main (1).pdf

accesso aperto

: Publisher’s version
Dimensione 827.32 kB
Formato Adobe PDF
827.32 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1294576
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact