Video Question Answering (VideoQA) is a key problem contributing to advanced video understanding. The rise of Multimodal Large Language Models (MLLMs) has accelerated the improvement on VideoQA tasks. However, MLLMs can produce inconsistent output even for similar prompts and suffer from hallucinations and biases. In this position paper, we envisage a novel pipeline, where scene graphs representing people, objects, and relationships in a video are injected in the MLLM prompt. We hypothesise that leveraging a symbolic representation of the video content can improve accuracy and verifiability and reduce the latency of MLLMs for VideoQA.

Graph Against the Machine: a Neuro-Symbolic Approach for Enhanced Video Question Answering

Fabio Lusha;Agnese Chiatti;Nico Catalano;Matteo Matteucci
2025-01-01

Abstract

Video Question Answering (VideoQA) is a key problem contributing to advanced video understanding. The rise of Multimodal Large Language Models (MLLMs) has accelerated the improvement on VideoQA tasks. However, MLLMs can produce inconsistent output even for similar prompts and suffer from hallucinations and biases. In this position paper, we envisage a novel pipeline, where scene graphs representing people, objects, and relationships in a video are injected in the MLLM prompt. We hypothesise that leveraging a symbolic representation of the video content can improve accuracy and verifiability and reduce the latency of MLLMs for VideoQA.
2025
ANSyA 2025 Advanced Neuro-Symbolic Applications
File in questo prodotto:
File Dimensione Formato  
paper_17.pdf

accesso aperto

Descrizione: full paper manuscript
: Publisher’s version
Dimensione 1.07 MB
Formato Adobe PDF
1.07 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1308431
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact