4.5 Article

False gene and chromosome losses in genome assemblies caused by GC content variation and repeats

期刊

GENOME BIOLOGY
卷 23, 期 1, 页码 -

出版社

BMC
DOI: 10.1186/s13059-022-02765-0

关键词

Genomics; Gene structure; GC content; Genomic dark matter; Annotation

资金

  1. Marine Biotechnology Program of the Korea Institute of Marine Science and Technology Promotion (KIMST) - Ministry of Ocean and Fisheries (MOF) [20180430]
  2. National Research Foundation of Korea (NRF) - Korea government (MSIT) [2021R1A2C2094111]
  3. Howard Hughes Medical Institute (HHMI)
  4. Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health [1ZIAHG200398]
  5. Wellcome grant [WT207492, WT206194]
  6. National Institute of Neurological Disease and Stroke within the National Institutes of Health (NIH/NINDS) [R03 NS115145]
  7. National Research Foundation of Korea [2021R1A2C2094111] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

向作者/读者索取更多资源

This study evaluates the improvements of new vertebrate genome reference assemblies compared to previous assemblies. They found that up to 11% of genomic sequence was missing in the previous assemblies, while the new reference assemblies revealed underestimated regulatory landscapes and protein coding sequences.
Background Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements. Results Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna's hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5 '-proximal promoters and 5 ' exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies. Conclusions Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据