RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ad- vantage. However, such amount of data can create a real value only if combined with quality: good decisions and actions are the results of correct, reliable and complete data. In such a scenario, methods and techniques for the data quality assessment can support the identification of suitable data to process. If in tra- ditional database numerous assessment methods are proposed, in the big data scenario new algorithms have to be designed in order to deal with novel require- ments related to variety, volume and velocity issues. In particular, in this paper we highlight that dealing with heterogeneous sources requires an adaptive ap- proach able to trigger the suitable quality assessment methods on the basis of the data type and context in which data have to be used. Furthermore, we show that in some situations it is not possible to evaluate the quality of the entire dataset due to performance and time constraints. For this reason, we suggest to focus the data quality assessment only on a portion of the dataset and to take into account the consequent loss of accuracy by introducing a confidence factor as a measure of the reliability of the quality assessment procedure. We propose a methodology to build a data quality adapter module which selects the best configuration for the data quality assessment based on the user main require- ments: time minimization, confidence maximization, and budget minimization. Experiments are performed by considering real data gathered from a smart city case study.

Context-aware Data Quality Assessment for Big Data

Danilo Ardagna;Cinzia Cappiello;Walter Samà;Monica Vitali

2018-01-01

Abstract

Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ad- vantage. However, such amount of data can create a real value only if combined with quality: good decisions and actions are the results of correct, reliable and complete data. In such a scenario, methods and techniques for the data quality assessment can support the identification of suitable data to process. If in tra- ditional database numerous assessment methods are proposed, in the big data scenario new algorithms have to be designed in order to deal with novel require- ments related to variety, volume and velocity issues. In particular, in this paper we highlight that dealing with heterogeneous sources requires an adaptive ap- proach able to trigger the suitable quality assessment methods on the basis of the data type and context in which data have to be used. Furthermore, we show that in some situations it is not possible to evaluate the quality of the entire dataset due to performance and time constraints. For this reason, we suggest to focus the data quality assessment only on a portion of the dataset and to take into account the consequent loss of accuracy by introducing a confidence factor as a measure of the reliability of the quality assessment procedure. We propose a methodology to build a data quality adapter module which selects the best configuration for the data quality assessment based on the user main require- ments: time minimization, confidence maximization, and budget minimization. Experiments are performed by considering real data gathered from a smart city case study.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
			2018
		
	Titolo della rivista
	
			FUTURE GENERATION COMPUTER SYSTEMS
		
	Parole chiave
	
			Big data; Data quality
		
	Appare nelle tipologie:
	
			01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
FutureGeneration.pdf accesso aperto : Pre-Print (o Pre-Refereeing) Dimensione 6.55 MB Formato Adobe PDF Visualizza/Apri	6.55 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1057520

Citazioni

ND

73

50

social impact