The availability of a large amount of data facilitates spreading a data-driven culture in which data are used and analyzed to support decision-making. However, data-based decisions are effective only if the considered input data sources are not affected by poor quality and biases. For this reason, the data preparation phase is crucial for guaranteeing an appropriate output quality. There is a strong evidence in the literature that dealing with data preparation is not simple: it is the most resource consuming step in data analysis and most of the times it is performed using a trial and error approach. Considering this, we aim to support users in the design of data preparation pipelines by identifying the most suitable data transformation/cleaning operations to apply and the order in which they have to be executed. In order to achieve such a goal, using different datasets and ML algorithms, we conducted a series of experiments designed to assess the impact of different types of errors on the quality of the output. The idea is to develop a framework that provides users with guidelines that recommend to address the data quality issues with the highest negative impact first. A preliminary validation has confirmed that following the system suggestions yields better results.

Supporting the Design of Data Preparation Pipelines

Sancricca C.;Cappiello C.
2022-01-01

Abstract

The availability of a large amount of data facilitates spreading a data-driven culture in which data are used and analyzed to support decision-making. However, data-based decisions are effective only if the considered input data sources are not affected by poor quality and biases. For this reason, the data preparation phase is crucial for guaranteeing an appropriate output quality. There is a strong evidence in the literature that dealing with data preparation is not simple: it is the most resource consuming step in data analysis and most of the times it is performed using a trial and error approach. Considering this, we aim to support users in the design of data preparation pipelines by identifying the most suitable data transformation/cleaning operations to apply and the order in which they have to be executed. In order to achieve such a goal, using different datasets and ML algorithms, we conducted a series of experiments designed to assess the impact of different types of errors on the quality of the output. The idea is to develop a framework that provides users with guidelines that recommend to address the data quality issues with the highest negative impact first. A preliminary validation has confirmed that following the system suggestions yields better results.
2022
Proceedings of the 30th Italian Symposium on Advanced Database Systems (SEBD 2022), Tirrenia (PI), Italy, June 19-22, 2022
Bias
Data Preparation
Data Quality
Decision-making
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1233970
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact