We consider the issue of data scarcity with class imbalance in failure-cause identification for optical fiber systems using Machine Learning (ML) techniques. We use an open dataset comprising of real Optical Time-Domain Reflectometer (OTDR) traces which have been gathered in an artificial setup spanning tens of kilometers, consistent with a long-haul network. Whilst ML methods have shown satisfactory results for automating the process of identifying failure causes in optical fiber networks, the solutions are generally strongly dependent on available labeled datasets, and require extensive data to train and validate any findings. However, in the case of failure management in optical networks, building a valuable dataset with sufficiently informative samples is in general a hard process, due to the fact that, by nature, failures occur infrequently. As such, data-labeling is time and resource intensive for domain experts. We therefore seek to mitigate these issues by exploring two generative models, namely, conditional Generative Adversarial Network (cGAN) and conditional Variational Autoencoder (cVAE), to balance the number of failures samples in a multiclass dataset. In order to balance the dataset with accurate synthetic data across the different failure causes, we adopt generative models that are conditioned on the failure classes, the SNR level of the trace and the maximum amplitude of the signal. These approaches are compared to Synthetic Minority Over-sampling TEchnique (SMOTE). We compare our approaches by training our datasets using an autoencoder classifier and testing them against three holdout datasets. Results show that, with the cGAN and cVAE, failure-cause identification can be improved by more than 5% in terms of global accuracy when compared to the imbalanced dataset, and in particular for scarcely-represented failure classes, our generative models provide an improvement in the f1 scores of over 50%.
Addressing data scarcity in ML-based failure-cause identification in optical networks through generative models
Musumeci F.
2025-01-01
Abstract
We consider the issue of data scarcity with class imbalance in failure-cause identification for optical fiber systems using Machine Learning (ML) techniques. We use an open dataset comprising of real Optical Time-Domain Reflectometer (OTDR) traces which have been gathered in an artificial setup spanning tens of kilometers, consistent with a long-haul network. Whilst ML methods have shown satisfactory results for automating the process of identifying failure causes in optical fiber networks, the solutions are generally strongly dependent on available labeled datasets, and require extensive data to train and validate any findings. However, in the case of failure management in optical networks, building a valuable dataset with sufficiently informative samples is in general a hard process, due to the fact that, by nature, failures occur infrequently. As such, data-labeling is time and resource intensive for domain experts. We therefore seek to mitigate these issues by exploring two generative models, namely, conditional Generative Adversarial Network (cGAN) and conditional Variational Autoencoder (cVAE), to balance the number of failures samples in a multiclass dataset. In order to balance the dataset with accurate synthetic data across the different failure causes, we adopt generative models that are conditioned on the failure classes, the SNR level of the trace and the maximum amplitude of the signal. These approaches are compared to Synthetic Minority Over-sampling TEchnique (SMOTE). We compare our approaches by training our datasets using an autoencoder classifier and testing them against three holdout datasets. Results show that, with the cGAN and cVAE, failure-cause identification can be improved by more than 5% in terms of global accuracy when compared to the imbalanced dataset, and in particular for scarcely-represented failure classes, our generative models provide an improvement in the f1 scores of over 50%.File | Dimensione | Formato | |
---|---|---|---|
Healy_oft2025.pdf
accesso aperto
:
Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione
784.29 kB
Formato
Adobe PDF
|
784.29 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.