4.6 Article

Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data

期刊

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
卷 103, 期 483, 页码 1281-1293

出版社

AMER STATISTICAL ASSOC
DOI: 10.1198/016214508000000454

关键词

Clustering; High-dimension low-sample data; k-means; Microarray gene expression data; p value; Statistical significance

资金

  1. National Science Foundation (NSF) [DMS 0747575, DMS 0606577]
  2. National Institutes of Health [K12 RR023248]
  3. U.S. Environmental Protection Agency [RD-83272001]

向作者/读者索取更多资源

Clustering methods provide a powerful tool for the exploratory analysis of high-dimension, low-sample size (HDLSS) data sets, such as gene expression microarray data. A fundamental statistical issue in clustering is which clusters are ''really there'', as opposed to being artifacts of the natural sampling variation. We propose SigClust as a simple and natural approach to this fundamental statistical problem. In particular, we define a cluster as data coming from a single Gaussian distribution and formulate the problem of assessing statistical significance of clustering as a testing procedure. This Gaussian null assumption allows direct formulation of p values that effectively quantify the significance of a given clustering. HDLSS covariance estimation for SigClust is achieved by a combination of invariance principles, together with a factor analysis model. The properties of SigClust are studied. Simulated examples, as well as an application to a real cancer microarray data set, show that the proposed method works remarkably well for assessing significance of clustering. Some theoretical results also are obtained.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据