Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in terms of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision.
BeRTo: An Efficient Spark-Based Tool for Linking Business Registries in Big Data Environments
Colombo, Andrea;Invernici, Francesco
2024-01-01
Abstract
Linking entities from different datasets is a crucial task for the success of modern businesses. However, aligning entities becomes challenging as common identifiers might be missing. Therefore, the process should rely on string-based attributes, such as names or addresses, thus harming precision in the matching. At the same time, powerful general-purpose record linkage tools require users to clean and pre-process the initial data, introducing a bottleneck in the success of the data integration activity and a burden on actual users. Furthermore, scalability has become a relevant issue in modern big data environments, where a lot of data flows daily from external sources. This work presents a novel record linkage tool, BeRTo, that addresses the problem of linking a specific type of data source, i.e., business registries, containing information about companies and corporations. While being domain-specific harms its usability in other contexts, it manages to reach a new frontier in terms of precision but also scalability, as it has been built on Spark. Integrating the pre-processing and cleaning steps in the same tool creates a user-friendly end-to-end pipeline that requires users only to input the raw data and set their preferred configuration, allowing to focus on recall or precision.File | Dimensione | Formato | |
---|---|---|---|
DATA_2024_33_CR.pdf
accesso aperto
:
Publisher’s version
Dimensione
304.62 kB
Formato
Adobe PDF
|
304.62 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.