4.5 Article

Multilayer heuristics based clustering framework (MHCF) for author name disambiguation

Journal

SCIENTOMETRICS
Volume 126, Issue 9, Pages 7637-7678

Publisher

SPRINGER
DOI: 10.1007/s11192-021-04087-7

Keywords

Author name disambiguation (AND); Name ambiguity; Digital library (DL); Unsupervised machine learning approach; Clustering distinct authors

Ask authors/readers for more resources

Author name ambiguity is a significant challenge for digital libraries and scholarly data search engines, affecting the accuracy of authorship data provided. Traditional solutions are complex, feature dependent, and fail to effectively disambiguate authors with similar names but different citation numbers. A proposed multi-layer heuristics-based clustering framework addresses this issue by utilizing global and structure aware features, and incorporating contextual information for grouping similar publications. Experimental results demonstrate the framework's superior performance compared to other existing approaches, achieving an overall pF1 of 93.3% with only three features.
Author name ambiguity is a nontrivial problem currently faced by digital libraries and scholarly data search engines affecting their findings related to the authorship data provided by them. Most existing proposed solutions are complex, inflexible, feature dependent, focusing specific scenarios, rely on keyword-based similarities and ineffectively disambiguates authors with less number of citations than others (with more publications) sharing same name. All this requires a flexible name disambiguation framework that is simple, generic, context aware and can effectively disambiguate authors sharing same names but variable number of citations. In this paper we propose a multi-layer heuristics-based clustering framework. Global and structure aware features are used to group publications together using our proposed Research2vec model. Unlike many heuristics based multilayer approaches, our proposed framework uses better discriminating powered features following our proposed feature rank in an incremental fashion to minimize false positives after each merge. Also, our proposed framework unlike other similar approaches uses contextual information to group similar publications as opposed to matching same keywords. We have carefully evaluated our proposed framework using three different datasets against two word embedding based approaches, two heuristics based, two hybrid and one graph-based approach. The results clearly show our framework's better performance than all i.e., MHCF-G (+ 5% pF1), MHCF-GL (+ 10% pF1), MDC (+ 12% pF1), HHC (+ 32% pF1), SAND-1 (+ 31%), SAND-2 (+ 22%) and GFAD (+ 18%). Our proposed solution is also evaluated on our newly proposed dataset 'CustAND' covering more than 11 most discriminating features unavailable in current AND datasets together. The experimental results using CustAND collection show that our framework can achieve an overall pF1 of 93.3% with only three features which further demonstrates its effectiveness.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available