4.5 Article

Training Data Selection for Record Linkage Classification

Journal

SYMMETRY-BASEL
Volume 15, Issue 5, Pages -

Publisher

MDPI
DOI: 10.3390/sym15051060

Keywords

record linkage; unsupervised random forest; similarity measure; training data

Ask authors/readers for more resources

This paper presents a new approach for record linkage, focusing on creating high-quality training data. The approach uses unsupervised random forest as a similarity measure to generate a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for training data, with the top and imbalanced construction being the most effective. Random forest with this construction produced comparable results to existing methods. On average, the proposed approach improved F-1 score by 1% and recall by 6.45%. By emphasizing high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage.
This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in the first step. The approach employs the unsupervised random forest model as a similarity measure to produce a similarity score vector for record matching. Three constructions were proposed to select non-match pairs for the training data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. The top and imbalanced construction was found to be the most effective in producing training data with 100% correct labels. Random forest and support vector machine classification algorithms were compared, and random forest with the top and imbalanced construction produced an F-1-score comparable to probabilistic record linkage using the expectation maximisation algorithm and EpiLink. On average, the proposed approach using random forests and the top and imbalanced construction improved the F-1-score by 1% and recall by 6.45% compared to existing record linkage methods. By emphasising the creation of high-quality training data, this new approach has the potential to improve the accuracy and efficiency of record linkage for a wide range of applications.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available