Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, diferently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a frst phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against fve traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.

Blocking Techniques for Entity Linkage: A Semantics-Based Approach

Azzalini, Fabio;Jin, Songle;Tanca, Letizia
2020

Abstract

Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, diferently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a frst phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against fve traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.
Data integration · Entity linkage · Blocking · Deep learning
File in questo prodotto:
File Dimensione Formato  
Azzalini2021_Article_BlockingTechniquesForEntityLin.pdf

accesso aperto

: Publisher’s version
Dimensione 1.91 MB
Formato Adobe PDF
1.91 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11311/1166612
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 2
social impact