With the growth of the Internet of Things and the rapid progress of social networks, everything appears to generate data. The ever-increasing number of connected devices is accompanied by a growth of the volume of data, produced at an ever-increasing rate, and this massive flow includes data types that are difficult to process using standard database techniques. One of the most critical scenarios is healthcare, whose activities need to store and manage a variety of data types - reports written in natural language, medical images, genomic data and waveforms of vital signs - which do not have a well-defined structure. In order to benefit from this large amount of complex data, Data Lakes have recently emerged as a solution to grant central storage and flexible analysis for all types of data. However, there is no Data Lake architecture that fits all the possible scenarios, since the architecture depends heavily on the application domain and, so far, there are no Data Lake architectures that support the specific needs of the healthcare domain. This work proposes HEALER: a Data Lake architecture that effectively performs data ingestion, data storage, and data access with the aim of providing a single central repository for efficient storage of different types of healthcare data. The architecture also enables the analysis and querying of the data, which can be loaded into the Data Lake regardless of their format and type. To verify the effectiveness of the architecture, a proof-of-concept of HEALER has been developed, that allows ingestion of various data, performs waveforms processing to make them more interpretable to researchers and analysts, grants access to the saved data and allows the analysis of natural language reports. Finally we studied the performance of the system in each of its main phases: ingestion, processing, data access and analysis. The results lead us to some important considerations to be taken into account when using and configuring the system components.

HEALER: A Data Lake Architecture for Healthcare

Manco C.;Dolci T.;Azzalini F.;Barbierato E.;Gribaudo M.;Tanca L.
2023-01-01

Abstract

With the growth of the Internet of Things and the rapid progress of social networks, everything appears to generate data. The ever-increasing number of connected devices is accompanied by a growth of the volume of data, produced at an ever-increasing rate, and this massive flow includes data types that are difficult to process using standard database techniques. One of the most critical scenarios is healthcare, whose activities need to store and manage a variety of data types - reports written in natural language, medical images, genomic data and waveforms of vital signs - which do not have a well-defined structure. In order to benefit from this large amount of complex data, Data Lakes have recently emerged as a solution to grant central storage and flexible analysis for all types of data. However, there is no Data Lake architecture that fits all the possible scenarios, since the architecture depends heavily on the application domain and, so far, there are no Data Lake architectures that support the specific needs of the healthcare domain. This work proposes HEALER: a Data Lake architecture that effectively performs data ingestion, data storage, and data access with the aim of providing a single central repository for efficient storage of different types of healthcare data. The architecture also enables the analysis and querying of the data, which can be loaded into the Data Lake regardless of their format and type. To verify the effectiveness of the architecture, a proof-of-concept of HEALER has been developed, that allows ingestion of various data, performs waveforms processing to make them more interpretable to researchers and analysts, grants access to the saved data and allows the analysis of natural language reports. Finally we studied the performance of the system in each of its main phases: ingestion, processing, data access and analysis. The results lead us to some important considerations to be taken into account when using and configuring the system components.
2023
CEUR Workshop Proceedings
Apache NiFi
Data Lakes
Hadoop Distributed File System
medical data
waveforms
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1257598
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? ND
social impact