☆ 4.6 Article

SATe-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees

SYSTEMATIC BIOLOGY (2012)

期刊

SYSTEMATIC BIOLOGY

卷 61, 期 1, 页码 90-106

出版社

OXFORD UNIV PRESS

DOI: 10.1093/sysbio/syr095

关键词

Alignment; maximum likelihood; phylogenetics; SATe

类别

Evolutionary Biology

资金

German Science Foundation (Deutsche Forschungsgemeinschaft)
US National Science Foundation [ITR-0331453 (CIPRES), ATOL-0733029, ATOL-0732920, EIA-0303609]
Guggenheim Foundation
Microsoft Research New England

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestirnation of alignments and trees has been attempted but currently only SATe estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATe algorithm that improves upon SATe (which we now call SATe-I) in terms of speed and of phylogenetic and alignment accuracy. SATe-II uses a different divide-and-conquer strategy than SATe-I and so produces smaller more closely related subsets than SATe-I; as a result, SATe-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATe-I. Generally, SATe is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATe-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATe-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATe's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATe-II and SATe-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.

SATe-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees

期刊

SYSTEMATIC BIOLOGY

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

SATe-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees

期刊

SYSTEMATIC BIOLOGY

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文