Data products have emerged as a powerful paradigm for managing data in both intra-enterprise and federated environments, providing structured data assets that include not only the data itself but also services, metadata, and access policies. However, a key challenge in federated environments is the discovery of relevant data products. Traditional discovery mechanisms are heavily dependent on metadata, which is often inconsistent, incomplete, or not standardized across organizations. This lack of metadata quality significantly limits the effectiveness of discovery, making it difficult for consumers to identify and retrieve the data they need. To address this challenge, we propose a content-based discovery framework that shifts the focus from metadata to the actual content of data products. Our approach uses sampling techniques to extract meaningful data representations and a tabular retrieval model for natural language queries. Directly interacting with data improves discovery accuracy, enabling effective data access in federated environments.
Improving Content-Based Data Product Retrieval in Federated Environments with LLM and Sampling
Falconi, Matteo;Plebani, Pierluigi
2025-01-01
Abstract
Data products have emerged as a powerful paradigm for managing data in both intra-enterprise and federated environments, providing structured data assets that include not only the data itself but also services, metadata, and access policies. However, a key challenge in federated environments is the discovery of relevant data products. Traditional discovery mechanisms are heavily dependent on metadata, which is often inconsistent, incomplete, or not standardized across organizations. This lack of metadata quality significantly limits the effectiveness of discovery, making it difficult for consumers to identify and retrieve the data they need. To address this challenge, we propose a content-based discovery framework that shifts the focus from metadata to the actual content of data products. Our approach uses sampling techniques to extract meaningful data representations and a tabular retrieval model for natural language queries. Directly interacting with data improves discovery accuracy, enabling effective data access in federated environments.| File | Dimensione | Formato | |
|---|---|---|---|
|
CAiSE_25_Workshop-3.pdf
embargo fino al 18/12/2025
:
Pre-Print (o Pre-Refereeing)
Dimensione
390.66 kB
Formato
Adobe PDF
|
390.66 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


