☆ 4.7 Article Proceedings Paper

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS (2009)

期刊

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

卷 6, 期 2, 页码 344-352

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TCBB.2007.70272

关键词

Pattern recognition; clustering algorithm; genome application; parallel processing

类别

Biochemical Research Methods Computer Science, Interdisciplinary Applications Mathematics, Interdisciplinary Applications Statistics & Probability

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Large sets of bioinformatical data provide a challenge in time consumption while solving the cluster identification problem, and that is why a parallel algorithm is so needed for identifying dense clusters in a noisy background. Our algorithm works on a graph representation of the data set to be analyzed. It identifies clusters through the identification of densely intraconnected subgraphs. We have employed a minimum spanning tree (MST) representation of the graph and solve the cluster identification problem using this representation. The computational bottleneck of our algorithm is the construction of an MST of a graph, for which a parallel algorithm is employed. Our high-level strategy for the parallel MST construction algorithm is to first partition the graph, then construct MSTs for the partitioned subgraphs and auxiliary bipartite graphs based on the subgraphs, and finally merge these MSTs to derive an MST of the original graph. The computational results indicate that when running on 150 CPUs, our algorithm can solve a cluster identification problem on a data set with 1,000,000 data points almost 100 times faster than on single CPU, indicating that this program is capable of handling very large data clustering problems in an efficient manner. We have implemented the clustering algorithm as the software CLUMP.

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

期刊

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

出版社

IEEE COMPUTER SOC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics

期刊

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS

出版社

IEEE COMPUTER SOC

关键词

类别

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文