RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Ensuring high-quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi-)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well-known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.

Engineering MLOps Pipelines With Data Quality: A Case Study on Tabular Datasets in Kaggle

Matteo Pancini;Matteo Camilli;Giovanni Quattrocchi;Damian Tamburri

2025-01-01

Abstract

Ensuring high-quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi-)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well-known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo della rivista
	
				JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE
			
	Parole chiave
	
				data quality
Kaggle
machine learning
MLOps
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
J Software Evolu Process - 2025 - Pancini - Engineering MLOps Pipelines With Data Quality A Case Study on Tabular Datasets.pdf accesso aperto Dimensione 761.27 kB Formato Adobe PDF Visualizza/Apri	761.27 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1311146

Citazioni

ND

1

0

social impact