☆ 4.5 Article

Are clusters found in one dataset present in another dataset?

BIOSTATISTICS (2007)

期刊

BIOSTATISTICS

卷 8, 期 1, 页码 9-31

出版社

OXFORD UNIV PRESS

DOI: 10.1093/biostatistics/kxj029

关键词

breast cancer subtypes; cluster validation; in-group proportion; prediction accuracy

类别

Mathematical & Computational Biology Statistics & Probability

资金

DIVISION OF HEART AND VASCULAR DISEASES [N01HV028183] Funding Source: NIH RePORTER
NHLBI NIH HHS [N01-HV-28183] Funding Source: Medline

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be reproducible and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the in-group proportion (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called clusterRepro available through The Comprehensive R Archive Network (http://cran.r-project.org).

Are clusters found in one dataset present in another dataset?

期刊

BIOSTATISTICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Are clusters found in one dataset present in another dataset?

期刊

BIOSTATISTICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文