Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “Blocks” of records that can be considered similar according to some metrics, greatly reducing the overall complexity of the algorithm. We propose a Blocking strategy that, differently from the traditional methods, aims at capturing the semantic properties of data by means of recent Deep Learning frameworks. This paper is mainly inspired by a recent work on Entity Linkage whose authors were among the first to investigate the application of tuple embeddings to data integration problems. We extend their method adopting an unsupervised approach: our blocking model is trained on an external corpus and then used on new datasets, exploiting a “transfer learning” paradigm. Our choice is motivated by the fact that, in most data integration scenarios, no training data is actually available. Using a semi-automatic approach to blocking, our model, after being trained on an external corpus, can be directly applied to any data integration problem. We tested our system on six popular datasets and compared its performance against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solution outperforms standard blocking algorithms, especially on textual and noisy data.
A Deep-Learning-Based Blocking Technique for Entity Linkage
F. Azzalini;L. Tanca
2021-01-01
Abstract
Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “Blocks” of records that can be considered similar according to some metrics, greatly reducing the overall complexity of the algorithm. We propose a Blocking strategy that, differently from the traditional methods, aims at capturing the semantic properties of data by means of recent Deep Learning frameworks. This paper is mainly inspired by a recent work on Entity Linkage whose authors were among the first to investigate the application of tuple embeddings to data integration problems. We extend their method adopting an unsupervised approach: our blocking model is trained on an external corpus and then used on new datasets, exploiting a “transfer learning” paradigm. Our choice is motivated by the fact that, in most data integration scenarios, no training data is actually available. Using a semi-automatic approach to blocking, our model, after being trained on an external corpus, can be directly applied to any data integration problem. We tested our system on six popular datasets and compared its performance against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solution outperforms standard blocking algorithms, especially on textual and noisy data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.