Data products have emerged as a powerful paradigm for managing data in both intra-enterprise and federated environments, providing structured data assets that include not only the data itself but also services, metadata, and access policies. However, a key challenge in federated environments is the discovery of relevant data products. Traditional discovery mechanisms are heavily dependent on metadata, which is often inconsistent, incomplete, or not standardized across organizations. This lack of metadata quality significantly limits the effectiveness of discovery, making it difficult for consumers to identify and retrieve the data they need. To address this challenge, we propose a content-based discovery framework that shifts the focus from metadata to the actual content of data products. Our approach uses sampling techniques to extract meaningful data representations and a tabular retrieval model for natural language queries. Directly interacting with data improves discovery accuracy, enabling effective data access in federated environments.

Improving Content-Based Data Product Retrieval in Federated Environments with LLM and Sampling

Falconi, Matteo;Plebani, Pierluigi
2025-01-01

Abstract

Data products have emerged as a powerful paradigm for managing data in both intra-enterprise and federated environments, providing structured data assets that include not only the data itself but also services, metadata, and access policies. However, a key challenge in federated environments is the discovery of relevant data products. Traditional discovery mechanisms are heavily dependent on metadata, which is often inconsistent, incomplete, or not standardized across organizations. This lack of metadata quality significantly limits the effectiveness of discovery, making it difficult for consumers to identify and retrieve the data they need. To address this challenge, we propose a content-based discovery framework that shifts the focus from metadata to the actual content of data products. Our approach uses sampling techniques to extract meaningful data representations and a tabular retrieval model for natural language queries. Directly interacting with data improves discovery accuracy, enabling effective data access in federated environments.
2025
Advanced Information Systems Engineering Workshops. CAiSE 2025
9783031949302
9783031949319
File in questo prodotto:
File Dimensione Formato  
CAiSE_25_Workshop-3.pdf

embargo fino al 18/12/2025

: Pre-Print (o Pre-Refereeing)
Dimensione 390.66 kB
Formato Adobe PDF
390.66 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1292658
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact