☆ 4.4 Article

Distributed Representations of Tuples for Entity Resolution

PROCEEDINGS OF THE VLDB ENDOWMENT (2018)

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

卷 11, 期 11, 页码 1454-1467

出版社

ASSOC COMPUTING MACHINERY

DOI: 10.14778/3236187.3236198

关键词

类别

Computer Science, Information Systems Computer Science, Theory & Methods

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Despite the efforts in 70+ years in all aspects of entity resolution (ER), there is still a high demand for democratizing ER - by reducing the heavy human involvement in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representations of words (a.k.a. word embeddings), we present a novel ER system, called DEEPER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). We use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (ILSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. We propose a locality sensitive hashing (LSH) based blocking approach that takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DEEPER outperforms existing solutions.

Distributed Representations of Tuples for Entity Resolution

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Distributed Representations of Tuples for Entity Resolution

期刊

PROCEEDINGS OF THE VLDB ENDOWMENT

出版社

ASSOC COMPUTING MACHINERY

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文