RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

The availability of a large amount of data facilitates spreading a data-driven culture in which data are used and analyzed to support decision-making. However, data-based decisions are effective only if the considered input data sources are not affected by poor quality and biases. For this reason, the data preparation phase is crucial for guaranteeing an appropriate output quality. There is a strong evidence in the literature that dealing with data preparation is not simple: it is the most resource consuming step in data analysis and most of the times it is performed using a trial and error approach. Considering this, we aim to support users in the design of data preparation pipelines by identifying the most suitable data transformation/cleaning operations to apply and the order in which they have to be executed. In order to achieve such a goal, using different datasets and ML algorithms, we conducted a series of experiments designed to assess the impact of different types of errors on the quality of the output. The idea is to develop a framework that provides users with guidelines that recommend to address the data quality issues with the highest negative impact first. A preliminary validation has confirmed that following the system suggestions yields better results.

Supporting the Design of Data Preparation Pipelines

Sancricca C.;Cappiello C.

2022-01-01

Abstract

The availability of a large amount of data facilitates spreading a data-driven culture in which data are used and analyzed to support decision-making. However, data-based decisions are effective only if the considered input data sources are not affected by poor quality and biases. For this reason, the data preparation phase is crucial for guaranteeing an appropriate output quality. There is a strong evidence in the literature that dealing with data preparation is not simple: it is the most resource consuming step in data analysis and most of the times it is performed using a trial and error approach. Considering this, we aim to support users in the design of data preparation pipelines by identifying the most suitable data transformation/cleaning operations to apply and the order in which they have to be executed. In order to achieve such a goal, using different datasets and ML algorithms, we conducted a series of experiments designed to assess the impact of different types of errors on the quality of the output. The idea is to develop a framework that provides users with guidelines that recommend to address the data quality issues with the highest negative impact first. A preliminary validation has confirmed that following the system suggestions yields better results.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo del libro
	
				Proceedings of the 30th Italian Symposium on Advanced Database Systems (SEBD 2022), Tirrenia (PI), Italy, June 19-22, 2022
			
	Titolo della collana
	
				CEUR WORKSHOP PROCEEDINGS
			
	Parole chiave
	
				Bias
Data Preparation
Data Quality
Decision-making
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1233970

Citazioni

ND

3

ND

social impact