Ensuring high-quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi-)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well-known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.

Engineering MLOps Pipelines With Data Quality: A Case Study on Tabular Datasets in Kaggle

Matteo Camilli;
2025-01-01

Abstract

Ensuring high-quality data is crucial for the successful deployment of machine learning models, thereby sustaining the operational pipelines around such models. However, a significant number of practitioners do not currently use data quality checks or measurements as gateways for their model construction and operationalization, indicating a need for greater awareness and adoption of these tools. In this study, we propose an automated approach for automating the process of architecting machine learning pipelines by means of (semi-)automated data quality checks. We focus on tabular data as a representative of the most widely used structured data formats in said pipelines. Our work is based on a subset of metrics that are particularly relevant in MLOps pipelines, stemming from our engagement with expert practitioners in machine learning operations (MLOps). We selected Deepchecks, a well-known tool for conducting data quality checks, from a cohort of similar tools to evaluate the quality of datasets collected from Kaggle, a widely used platform for machine learning competitions and data science projects. We also analyze the main features used by Kaggle to rank their datasets and used these features to validate the relevance of our approach. Our approach shows the potential for automated data quality checks to improve the efficiency and effectiveness of MLOps pipelines and their operation, by decreasing the risk of introducing errors and biases into machine learning models in production.
2025
data quality
Kaggle
machine learning
MLOps
File in questo prodotto:
File Dimensione Formato  
J Software Evolu Process - 2025 - Pancini - Engineering MLOps Pipelines With Data Quality A Case Study on Tabular Datasets.pdf

accesso aperto

Dimensione 761.27 kB
Formato Adobe PDF
761.27 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1311146
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact