期刊
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
卷 15, 期 4, 页码 1231-1238出版社
IEEE COMPUTER SOC
DOI: 10.1109/TCBB.2015.2509997
关键词
Phylogenomics; phylogenetics; population genomics; sequence data; data formatting; chromoplot; multiple sequence alignment
类别
资金
- US National Science Foundation [MCB-112705]
- Indiana University Genetics, Molecular and Cellular Sciences Training Grant [T32-GM007757]
Rapid progress in the fields of phylogenomics and population genomics has driven increases in both the size of multigenomic datasets and the number and complexity of genome-wide analyses. We present the Multisample Variant Format, specifically designed to store multiple sequence alignments for phylogenomics and population genomic analysis. The signature feature of MVF is a distinctive encoding of aligned sites with specific biological information content (e.g., invariant, low-coverage). This biological pattern-based encoding of sequence data allows for rapid filtering and quality control of data and speeds up computation for many analyses. Similar to other modern formats, MVF has a simple data structure and flexible header structure to accommodate project metadata, allowing to also serve as an effective data publication and sharing format. We also propose several variants of the MVF format to accommodate protein and codon alignments, quality scores, and a mix of de novo and reference-aligned data. Using the MVFtools package, MVF files can be converted from other common sequence formats. MVFtools completes tasks ranging from simple transformation and filtering operations to complex genome-wide visualizations in only a few minutes, even on large datasets. In addition to presentation of MVF and MVFtools, we also discuss the application both in MVF and other existing data formats of the broader concept of using biological principles and patterns to inform sequence data encoding.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据