Streaming Machine Learning (SML) studies algorithms that update their models, given an unbounded and often non-stationary flow of data performing a single pass. Online class imbalance learning is a branch of SML that combines the challenges of both class imbalance and concept drift. In this paper, we investigate the binary classification problem by rebalancing an imbalanced stream of data in the presence of concept drift, accessing one sample at a time. We propose an extensive comparative study of Continuous Synthetic Minority Oversampling Technique (C-SMOTE), inspired by the popular sampling technique SMOTE, as a meta-strategy to pipeline with SML classification algorithms. We benchmark C-SMOTE pipelines on both synthetic and real data streams, containing different types of concept drifts, different imbalance levels, and different class distributions. We bring statistical evidence that models learnt with C-SMOTE pipelines improve the minority class performance concerning both the baseline models and the state-of-the-art methods. We also perform a sensitivity analysis to detect the C-SMOTE impact on the majority class performance for the three types of concept drift and several class distributions. Moreover, we show a computational cost analysis in terms of time and memory consumption.

An extensive study of C-SMOTE, a Continuous Synthetic Minority Oversampling Technique for Evolving Data Streams

Bernardo, Alessio;Della Valle, Emanuele
2022-01-01

Abstract

Streaming Machine Learning (SML) studies algorithms that update their models, given an unbounded and often non-stationary flow of data performing a single pass. Online class imbalance learning is a branch of SML that combines the challenges of both class imbalance and concept drift. In this paper, we investigate the binary classification problem by rebalancing an imbalanced stream of data in the presence of concept drift, accessing one sample at a time. We propose an extensive comparative study of Continuous Synthetic Minority Oversampling Technique (C-SMOTE), inspired by the popular sampling technique SMOTE, as a meta-strategy to pipeline with SML classification algorithms. We benchmark C-SMOTE pipelines on both synthetic and real data streams, containing different types of concept drifts, different imbalance levels, and different class distributions. We bring statistical evidence that models learnt with C-SMOTE pipelines improve the minority class performance concerning both the baseline models and the state-of-the-art methods. We also perform a sensitivity analysis to detect the C-SMOTE impact on the majority class performance for the three types of concept drift and several class distributions. Moreover, we show a computational cost analysis in terms of time and memory consumption.
2022
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1202064
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 17
  • ???jsp.display-item.citation.isi??? 6
social impact