期刊
DATA MINING AND KNOWLEDGE DISCOVERY
卷 37, 期 1, 页码 67-109出版社
SPRINGER
DOI: 10.1007/s10618-022-00869-6
关键词
k-means; GPGPU; Data mining; Clustering
The paper presents a fast and memory-efficient GPU-based algorithm called ASB k-means for exact k-means clustering. The algorithm can handle large datasets that exceed the GPU memory size, and it outperforms the standard GPU-based k-means implementation in terms of speed and performance.
The k-means algorithm is widely used for clustering, compressing, and summarizing vector data. We present a fast and memory-efficient GPU-based algorithm for exact k-means, Asynchronous Selective Batched K-means (ASB K-means). Unlike most GPU-based k-means algorithms that require loading the whole dataset onto the GPU for clustering, the amount of GPU memory required to run our algorithm can be chosen to be much smaller than the size of the whole dataset. Thus, our algorithm can cluster datasets whose size exceeds the available GPU memory. The algorithm works in a batched fashion and applies the triangle inequality in each k-means iteration to omit a data point if its membership assignment, i.e., the cluster it belongs to, remains unchanged, thus significantly reducing the number of data points that need to be transferred between the CPU's RAM and the GPU's global memory and enabling the algorithm to very efficiently process large datasets. Our algorithm can be substantially faster than a GPU-based implementation of standard k-means even in situations when application of the standard algorithm is feasible because the whole dataset fits into GPU memory. Experiments show that ASB K-means can run up to 15x times faster than a standard GPU-based implementation of k-means, and it also outperforms the GPU-based k-means implementation in NVIDIA's open-source RAPIDS machine learning library on all the datasets used in our experiments.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据