Exploring the Influence of Data Characteristics on Machine Learning Outcomes

Sancricca, Camilla; Castiglione, Pasquale; Cappiello, Cinzia

doi:10.1007/978-3-031-95397-2_14

Data-centric AI highlights the importance of high-quality input data in machine learning, as it is essential for achieving reliable and accurate results. To this purpose, traditional data quality assessment and improvement systems might help detect and address data errors, inconsistencies, or missing values. However, recent literature has demonstrated that other factors, besides standard data quality issues, could compromise the performance of machine-learning applications. These factors are related to the characteristics of the considered datasets, such as their structure or statistical and ethical aspects (e.g., possible biases or unfairness). This paper aims to present the results of a literature survey and propose a quality model for data-centric AI. Such a model includes all the possible data characteristics that may undermine the execution of a machine-learning pipeline together with their related metrics. Validation experiments demonstrate how these characteristics affect the performance of various classification algorithms, highlighting the model’s relevance and applicability. We believe the proposed model can support the development of novel AI systems, helping data scientists to assess the suitability of input data for specific machine-learning-based analyses.

Exploring the Influence of Data Characteristics on Machine Learning Outcomes

Camilla Sancricca;Pasquale Castiglione;Cinzia Cappiello

2025-01-01

Abstract

Data-centric AI highlights the importance of high-quality input data in machine learning, as it is essential for achieving reliable and accurate results. To this purpose, traditional data quality assessment and improvement systems might help detect and address data errors, inconsistencies, or missing values. However, recent literature has demonstrated that other factors, besides standard data quality issues, could compromise the performance of machine-learning applications. These factors are related to the characteristics of the considered datasets, such as their structure or statistical and ethical aspects (e.g., possible biases or unfairness). This paper aims to present the results of a literature survey and propose a quality model for data-centric AI. Such a model includes all the possible data characteristics that may undermine the execution of a machine-learning pipeline together with their related metrics. Validation experiments demonstrate how these characteristics affect the performance of various classification algorithms, highlighting the model’s relevance and applicability. We believe the proposed model can support the development of novel AI systems, helping data scientists to assess the suitability of input data for specific machine-learning-based analyses.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				Enterprise, Business-Process and Information Systems Modeling
			
	Titolo della collana
	
				LECTURE NOTES IN BUSINESS INFORMATION PROCESSING
			
	ISBN (International Standard Book Number)
	
				9783031953965
9783031953972
			
	Parole chiave
	
				Data Quality
Data-centric AI
Machine Learning
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1295825

Citazioni

ND

0

0

ND

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Exploring the Influence of Data Characteristics on Machine Learning Outcomes

Camilla Sancricca;Pasquale Castiglione;Cinzia Cappiello

2025-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Exploring the Influence of Data Characteristics on Machine Learning Outcomes

Camilla Sancricca;Pasquale Castiglione;Cinzia Cappiello

2025-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)