Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ad- vantage. However, such amount of data can create a real value only if combined with quality: good decisions and actions are the results of correct, reliable and complete data. In such a scenario, methods and techniques for the data quality assessment can support the identification of suitable data to process. If in tra- ditional database numerous assessment methods are proposed, in the big data scenario new algorithms have to be designed in order to deal with novel require- ments related to variety, volume and velocity issues. In particular, in this paper we highlight that dealing with heterogeneous sources requires an adaptive ap- proach able to trigger the suitable quality assessment methods on the basis of the data type and context in which data have to be used. Furthermore, we show that in some situations it is not possible to evaluate the quality of the entire dataset due to performance and time constraints. For this reason, we suggest to focus the data quality assessment only on a portion of the dataset and to take into account the consequent loss of accuracy by introducing a confidence factor as a measure of the reliability of the quality assessment procedure. We propose a methodology to build a data quality adapter module which selects the best configuration for the data quality assessment based on the user main require- ments: time minimization, confidence maximization, and budget minimization. Experiments are performed by considering real data gathered from a smart city case study.

Context-aware Data Quality Assessment for Big Data

Danilo Ardagna;Cinzia Cappiello;Monica Vitali
2018-01-01

Abstract

Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ad- vantage. However, such amount of data can create a real value only if combined with quality: good decisions and actions are the results of correct, reliable and complete data. In such a scenario, methods and techniques for the data quality assessment can support the identification of suitable data to process. If in tra- ditional database numerous assessment methods are proposed, in the big data scenario new algorithms have to be designed in order to deal with novel require- ments related to variety, volume and velocity issues. In particular, in this paper we highlight that dealing with heterogeneous sources requires an adaptive ap- proach able to trigger the suitable quality assessment methods on the basis of the data type and context in which data have to be used. Furthermore, we show that in some situations it is not possible to evaluate the quality of the entire dataset due to performance and time constraints. For this reason, we suggest to focus the data quality assessment only on a portion of the dataset and to take into account the consequent loss of accuracy by introducing a confidence factor as a measure of the reliability of the quality assessment procedure. We propose a methodology to build a data quality adapter module which selects the best configuration for the data quality assessment based on the user main require- ments: time minimization, confidence maximization, and budget minimization. Experiments are performed by considering real data gathered from a smart city case study.
2018
Big data; Data quality
File in questo prodotto:
File Dimensione Formato  
FutureGeneration.pdf

accesso aperto

: Pre-Print (o Pre-Refereeing)
Dimensione 6.55 MB
Formato Adobe PDF
6.55 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1057520
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 73
  • ???jsp.display-item.citation.isi??? 50
social impact