Data-centric AI highlights the importance of high-quality input data in machine learning, as it is essential for achieving reliable and accurate results. To this purpose, traditional data quality assessment and improvement systems might help detect and address data errors, inconsistencies, or missing values. However, recent literature has demonstrated that other factors, besides standard data quality issues, could compromise the performance of machine-learning applications. These factors are related to the characteristics of the considered datasets, such as their structure or statistical and ethical aspects (e.g., possible biases or unfairness). This paper aims to present the results of a literature survey and propose a quality model for data-centric AI. Such a model includes all the possible data characteristics that may undermine the execution of a machine-learning pipeline together with their related metrics. Validation experiments demonstrate how these characteristics affect the performance of various classification algorithms, highlighting the model’s relevance and applicability. We believe the proposed model can support the development of novel AI systems, helping data scientists to assess the suitability of input data for specific machine-learning-based analyses.

Exploring the Influence of Data Characteristics on Machine Learning Outcomes

Camilla Sancricca;Cinzia Cappiello
2025-01-01

Abstract

Data-centric AI highlights the importance of high-quality input data in machine learning, as it is essential for achieving reliable and accurate results. To this purpose, traditional data quality assessment and improvement systems might help detect and address data errors, inconsistencies, or missing values. However, recent literature has demonstrated that other factors, besides standard data quality issues, could compromise the performance of machine-learning applications. These factors are related to the characteristics of the considered datasets, such as their structure or statistical and ethical aspects (e.g., possible biases or unfairness). This paper aims to present the results of a literature survey and propose a quality model for data-centric AI. Such a model includes all the possible data characteristics that may undermine the execution of a machine-learning pipeline together with their related metrics. Validation experiments demonstrate how these characteristics affect the performance of various classification algorithms, highlighting the model’s relevance and applicability. We believe the proposed model can support the development of novel AI systems, helping data scientists to assess the suitability of input data for specific machine-learning-based analyses.
2025
Enterprise, Business-Process and Information Systems Modeling
9783031953965
9783031953972
Data Quality
Data-centric AI
Machine Learning
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1295825
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact