4.8 Article

A sensitive repeat identification framework based on short and long reads

期刊

NUCLEIC ACIDS RESEARCH
卷 49, 期 17, 页码 -

出版社

OXFORD UNIV PRESS
DOI: 10.1093/nar/gkab563

关键词

-

资金

  1. National Natural Science Foundation of China [62002388, 61772557]
  2. NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization [U1909208]
  3. Hunan Provincial Science and Technology Program [2018wk4001]
  4. 111 Project [B18059]
  5. King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) [BAS/1/1624-01, FCC/1/1976-18-01, FCC/1/1976-23-01, FCC/1/1976-25-01, FCC/1/1976-26-01, REI/1/0018-01-01, REI/1/4216-01-01, REI/1/4437-01-01, REI/1/4473-01-01, URF/1/4352-01-01, REI/1/4742-01-01, URF/1/4098-01-01]

向作者/读者索取更多资源

LongRepMarker is a novel framework for precisely marking long repeats in genomes based on global de novo assembly and k-mer based multiple sequence alignment. By introducing barcode linked reads and unique k-mers, it achieves better efficiency and accuracy in identifying repeats, outperforming existing methods in experimental results.
Numerous studies have shown that repetitive regions in genomes play indispensable roles in the evolution, inheritance and variation of living organisms. However, most existing methods cannot achieve satisfactory performance on identifying repeats in terms of both accuracy and size, since NGS reads are too short to identify long repeats whereas SMS (Single Molecule Sequencing) long reads are with high error rates. In this study, we present a novel identification framework, LongRepMarker, based on the global de novo assembly and k-mer based multiple sequence alignment for precisely marking long repeats in genomes. The major characteristics of LongRepMarker are as follows: (i) by introducing barcode linked reads and SMS long reads to assist the assembly of all short paired-end reads, it can identify the repeats to a greater extent; (ii) by finding the overlap sequences between assemblies or chomosomes, it locates the repeats faster and more accurately; (iii) by using the multi-alignment unique k-mers rather than the high frequency k-mers to identify repeats in overlap sequences, it can obtain the repeats more comprehensively and stably; (iv) by applying the parallel alignment model based on the multi-alignment unique k-mers, the efficiency of data processing can be greatly optimized and (v) by taking the corresponding identification strategies, structural variations that occur between repeats can be identified. Comprehensive experimental results show that LongRepMarker can achieve more satisfactory results than the existing de novo detection methods (https://github.com/BioinformaticsCSU/LongRepMarker).

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.8
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据