RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Multi-site datasets have become widely accessible to the research community and their usage in machine learning (ML) analysis context is greatly appreciated as it enhances sta-tistical power and improves model's generalization capabilities. Nonetheless, variability associated with data sources can act as confounders to ML models, thus site effects should be removed beforehand in a data processing stage. In the present study, we explore the multi-site harmonization topic from an ML analysis perspective. Using a multi-site neuroimaging dataset composed of healthy controls and bipolar disorder subjects, we compared the efficacy of site harmonization based on linear regression and ComBat model, either applied to the entire dataset or adapted to the cross-validation framework used to evaluate ML models. Then, we trained an SVM model for diagnosis classification and analyzed the impact of the harmonization strategies on the model's performance. The diagnosis classification auc-roc was comparable across harmonization strategies. This evidence proves the effectiveness of the CV-based ComBat in harmonizing multi-center data while avoiding information leakage in the test sets, supporting the use of this strategy in the context of ML analyses. © 2023 IEEE.

Comparison of Multi-site Neuroimaging Data Harmonization Techniques for Machine Learning Applications

Sampaio, Inês W.;Tassi, Emma;Bellani, Marcella;Benedetti, Francesco;Poletti, Sara;Spalletta, Gianfranco;Piras, Fabrizio;Bianchi, Anna Maria;Brambilla, Paolo;Maggioni, Eleonora

2023-01-01

Abstract

Multi-site datasets have become widely accessible to the research community and their usage in machine learning (ML) analysis context is greatly appreciated as it enhances sta-tistical power and improves model's generalization capabilities. Nonetheless, variability associated with data sources can act as confounders to ML models, thus site effects should be removed beforehand in a data processing stage. In the present study, we explore the multi-site harmonization topic from an ML analysis perspective. Using a multi-site neuroimaging dataset composed of healthy controls and bipolar disorder subjects, we compared the efficacy of site harmonization based on linear regression and ComBat model, either applied to the entire dataset or adapted to the cross-validation framework used to evaluate ML models. Then, we trained an SVM model for diagnosis classification and analyzed the impact of the harmonization strategies on the model's performance. The diagnosis classification auc-roc was comparable across harmonization strategies. This evidence proves the effectiveness of the CV-based ComBat in harmonizing multi-center data while avoiding information leakage in the test sets, supporting the use of this strategy in the context of ML analyses. © 2023 IEEE.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Titolo del libro
	
				20th International Conference on Smart Technologies, Proceedings
			
	ISBN (International Standard Book Number)
	
				978-1-6654-6397-3
			
	Parole chiave
	
				Confounders
Harmonization
Machine Learning
Multi-centric data
			
	Parole chiave
	
				ComBat
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1259412

Citazioni

ND

5

ND

social impact