Data-centric AI highlights the importance of high-quality input data in machine learning, as it is essential for achieving reliable and accurate results. To this purpose, traditional data quality assessment and improvement systems might help detect and address data errors, inconsistencies, or missing values. However, recent literature has demonstrated that other factors, besides standard data quality issues, could compromise the performance of machine-learning applications. These factors are related to the characteristics of the considered datasets, such as their structure or statistical and ethical aspects (e.g., possible biases or unfairness). This paper aims to present the results of a literature survey and propose a quality model for data-centric AI. Such a model includes all the possible data characteristics that may undermine the execution of a machine-learning pipeline together with their related metrics. Validation experiments demonstrate how these characteristics affect the performance of various classification algorithms, highlighting the model’s relevance and applicability. We believe the proposed model can support the development of novel AI systems, helping data scientists to assess the suitability of input data for specific machine-learning-based analyses.
Exploring the Influence of Data Characteristics on Machine Learning Outcomes
Camilla Sancricca;Cinzia Cappiello
2025-01-01
Abstract
Data-centric AI highlights the importance of high-quality input data in machine learning, as it is essential for achieving reliable and accurate results. To this purpose, traditional data quality assessment and improvement systems might help detect and address data errors, inconsistencies, or missing values. However, recent literature has demonstrated that other factors, besides standard data quality issues, could compromise the performance of machine-learning applications. These factors are related to the characteristics of the considered datasets, such as their structure or statistical and ethical aspects (e.g., possible biases or unfairness). This paper aims to present the results of a literature survey and propose a quality model for data-centric AI. Such a model includes all the possible data characteristics that may undermine the execution of a machine-learning pipeline together with their related metrics. Validation experiments demonstrate how these characteristics affect the performance of various classification algorithms, highlighting the model’s relevance and applicability. We believe the proposed model can support the development of novel AI systems, helping data scientists to assess the suitability of input data for specific machine-learning-based analyses.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


