The abundance of data available on the Web makes more and more probable the case of finding that different sources contain (partially or completely) different values for the same item. Data Fusion is the relevant problem of discovering the true values of a data item when two entities representing it have been found and their values are different. Recent studies have shown that when, for finding the true value of an object, we rely only on majority voting, results may be wrong for up to 30% of the data items, since false values are spread very easily because data sources frequently copy from one another. Therefore, the problem must be solved by assessing the quality of the sources and giving more importance to the values coming from trusted sources. State-of-the-art Data Fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources. In this paper we propose an improved algorithm for Data Fusion, that extends existing methods based on accuracy and correlation between sources by taking into account also source authority, defined on the basis of the knowledge of which sources copy from which ones. Our method has been designed to work well also in the multi-truth case, that is, when a data item can also have multiple true values. Preliminary experimental results on a multi-truth real-world dataset show that our algorithm outperforms previous state-of-the-art approaches.

Data fusion with source authority and multiple truth

Azzalini F.;Piantella D.;Tanca L.
2019-01-01

Abstract

The abundance of data available on the Web makes more and more probable the case of finding that different sources contain (partially or completely) different values for the same item. Data Fusion is the relevant problem of discovering the true values of a data item when two entities representing it have been found and their values are different. Recent studies have shown that when, for finding the true value of an object, we rely only on majority voting, results may be wrong for up to 30% of the data items, since false values are spread very easily because data sources frequently copy from one another. Therefore, the problem must be solved by assessing the quality of the sources and giving more importance to the values coming from trusted sources. State-of-the-art Data Fusion systems define source trustworthiness on the basis of the accuracy of the provided values and on the dependence on other sources. In this paper we propose an improved algorithm for Data Fusion, that extends existing methods based on accuracy and correlation between sources by taking into account also source authority, defined on the basis of the knowledge of which sources copy from which ones. Our method has been designed to work well also in the multi-truth case, that is, when a data item can also have multiple true values. Preliminary experimental results on a multi-truth real-world dataset show that our algorithm outperforms previous state-of-the-art approaches.
2019
CEUR Workshop Proceedings
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1207989
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact