☆ 4.5 Article

Efficient binary embedding of categorical data using BinSketch

DATA MINING AND KNOWLEDGE DISCOVERY (2022)

期刊

DATA MINING AND KNOWLEDGE DISCOVERY

卷 36, 期 2, 页码 537-565

出版社

SPRINGER

DOI: 10.1007/s10618-021-00815-y

关键词

Dimensionality reduction; Sketching; Feature hashing; Clustering; Categorical data

类别

Computer Science, Artificial Intelligence Computer Science, Information Systems

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This paper presents a dimensionality reduction algorithm for categorical datasets, which constructs low-dimensional binary sketches from high-dimensional categorical vectors and approximates the Hamming distance between any two original vectors. The approach is particularly useful for sparse datasets and has been rigorously analyzed and experimentally validated.

In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points-making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as RMSE, all-pair similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques.

Efficient binary embedding of categorical data using BinSketch

期刊

DATA MINING AND KNOWLEDGE DISCOVERY

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Efficient binary embedding of categorical data using BinSketch

期刊

DATA MINING AND KNOWLEDGE DISCOVERY

出版社

SPRINGER

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文