☆ 3.8 Proceedings Paper

A Data Cleaning Method for CiteSeer Dataset

WEB INFORMATION SYSTEMS ENGINEERING - WISE 2016, PT I (2016)

Journal

WEB INFORMATION SYSTEMS ENGINEERING - WISE 2016, PT I

Volume 10041, Issue -, Pages 35-49

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG

DOI: 10.1007/978-3-319-48740-3_3

Keywords

Scholarly data; Record linkage; Data cleaning; Identification

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

CiteSeer is considered as the first academic search engine that have been serving data for almost twenty years. Recently, CiteSeer graciously makes all the data public, including raw PDF files, text transformed from PDF, and metadata extracted from the text. Numerous efforts have been tried to improve the accuracy of the metadata extraction. The problem is inherently challenging and errors are abundant. In this paper, we propose an innovative record-linkage-based method for data cleaning, which use two new matching algorithms to significantly improve the cleaning performance for the CiteSeer dataset. One is an enhanced matching algorithm for local datasets, the other is developed for online datasets. Experimental results show that 48.1% wrong metadata entries can be corrected by our method in total and the improvement is more than 539% compared to existing state-of-the-art data cleaning methods.

A Data Cleaning Method for CiteSeer Dataset

Journal

WEB INFORMATION SYSTEMS ENGINEERING - WISE 2016, PT I

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A Data Cleaning Method for CiteSeer Dataset

Journal

WEB INFORMATION SYSTEMS ENGINEERING - WISE 2016, PT I

Publisher

SPRINGER INTERNATIONAL PUBLISHING AG

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper