The development of technologies such as cloud computing, IoT, and social networks caused the amount of data generated daily to grow at an incredible rate, giving birth to the trend of Big Data. Big data has emerged in the healthcare field, thanks to the introduction of new tools producing massive amounts of structured and unstructured data. For this reason, medical institutions are moving towards a data-based healthcare, with the goal of leveraging this data to support clinical decision-making through suitable information systems. This comes with the need to evaluate their performance. One of the techniques commonly used is modeling, which consists in performing an evaluation of a model of the system under analysis, without actually implementing it. However, to make an adequate performance assessment of Big Data systems, we need a diversity of volumes and speeds that, due to the sensitivity of data concerning healthcare, is not available. While in other fields this problem is usually solved through the use of synthetic data generators, in healthcare these are few and not specialized in performance evaluation. Therefore, this work focuses on the creation of a synthetic data generator for evaluating the performance of a Big Data system model for healthcare. The dataset used as a reference for creating the generator is MIMIC-III, which contains the digital health records of thousands of patients collected over a time span of multiple years. First, we perform an analysis of the dataset, adopting multiple distribution fitting techniques (e.g., phase-type fitting) to model the temporal distribution of the data. Then, we develop a generator structured as a multi-module library to allow the customization of each component, specifically we propose a multiformalism model to reproduce the patient behavior inside the hospital. Finally, we test the generator by evaluating the performance in different scenarios. Through these experiments, we show the granular control that the generator offers over the synthetic data produced, and the simplicity with which it can be adapted to different uses.

Workflow Characterization of a Big Data System Model for Healthcare Through Multiformalism

Dolci T.;Azzalini F.;Piantella D.;Barbierato E.;Gribaudo M.
2023-01-01

Abstract

The development of technologies such as cloud computing, IoT, and social networks caused the amount of data generated daily to grow at an incredible rate, giving birth to the trend of Big Data. Big data has emerged in the healthcare field, thanks to the introduction of new tools producing massive amounts of structured and unstructured data. For this reason, medical institutions are moving towards a data-based healthcare, with the goal of leveraging this data to support clinical decision-making through suitable information systems. This comes with the need to evaluate their performance. One of the techniques commonly used is modeling, which consists in performing an evaluation of a model of the system under analysis, without actually implementing it. However, to make an adequate performance assessment of Big Data systems, we need a diversity of volumes and speeds that, due to the sensitivity of data concerning healthcare, is not available. While in other fields this problem is usually solved through the use of synthetic data generators, in healthcare these are few and not specialized in performance evaluation. Therefore, this work focuses on the creation of a synthetic data generator for evaluating the performance of a Big Data system model for healthcare. The dataset used as a reference for creating the generator is MIMIC-III, which contains the digital health records of thousands of patients collected over a time span of multiple years. First, we perform an analysis of the dataset, adopting multiple distribution fitting techniques (e.g., phase-type fitting) to model the temporal distribution of the data. Then, we develop a generator structured as a multi-module library to allow the customization of each component, specifically we propose a multiformalism model to reproduce the patient behavior inside the hospital. Finally, we test the generator by evaluating the performance in different scenarios. Through these experiments, we show the granular control that the generator offers over the synthetic data produced, and the simplicity with which it can be adapted to different uses.
2023
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
978-3-031-43184-5
978-3-031-43185-2
Big Data
healthcare data
performance evaluation
synthetic data generation
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1257604
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact