RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, diferently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a frst phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against fve traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.

Blocking Techniques for Entity Linkage: A Semantics-Based Approach

Azzalini, Fabio;Jin, Songle;Renzi, Marco;Tanca, Letizia

2020-01-01

Abstract

Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, diferently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a frst phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against fve traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Titolo della rivista
	
				DATA SCIENCE AND ENGINEERING
			
	Parole chiave
	
				Data integration · Entity linkage · Blocking · Deep learning
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
Azzalini2021_Article_BlockingTechniquesForEntityLin.pdf accesso aperto : Publisher’s version Dimensione 1.91 MB Formato Adobe PDF Visualizza/Apri	1.91 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1166612

Citazioni

ND

31

19

social impact