☆ 4.7 Article

Finding long tandem repeats in long noisy reads

BIOINFORMATICS (2021)

期刊

BIOINFORMATICS

卷 37, 期 5, 页码 612-621

出版社

OXFORD UNIV PRESS

DOI: 10.1093/bioinformatics/btaa865

关键词

类别

Biochemical Research Methods Biotechnology & Applied Microbiology Computer Science, Interdisciplinary Applications Mathematical & Computational Biology Statistics & Probability

资金

Japan Agency for Medical Research and Development (GRIFIN) [16H06279]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

This study presents an efficient algorithm that utilizes the characteristics of repetitive sequences to address the issue of high error rates in long-read sequencing technologies. The algorithm first estimates regions that may contain tandem repeats, and then assembles the k-mers of these regions into a consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results demonstrate that the proposed algorithm outperforms the widely used Tandem Repeats Finder in terms of sensitivity.

Motivation: Long tandem repeat expansions of more than 1000 nt have been suggested to be associated with diseases, but remain largely unexplored in individual human genomes because read lengths have been too short. However, new long-read sequencing technologies can produce single reads of 10 000 nt or more that can span such repeat expansions, although these long reads have high error rates, of 10-20%, which complicates the detection of repetitive elements. Moreover, most traditional algorithms for finding tandem repeats are designed to find short tandem repeats (< 1000 nt) and cannot effectively handle the high error rate of long reads in a reasonable amount of time. Results: Here, we report an efficient algorithm for solving this problem that takes advantage of the length of the repeat. Namely, a long tandem repeat has hundreds or thousands of approximate copies of the repeated unit, so despite the error rate, many short k-mers will be error-free in many copies of the unit. We exploited this characteristic to develop a method for first estimating regions that could contain a tandem repeat, by analyzing the k-mer frequency distributions of fixed-size windows across the target read, followed by an algorithm that assembles the k-mers of a putative region into the consensus repeat unit by greedily traversing a de Bruijn graph. Experimental results indicated that the proposed algorithm largely outperformed Tandem Repeats Finder, a widely used program for finding tandem repeats, in terms of sensitivity.

Finding long tandem repeats in long noisy reads

期刊

BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Finding long tandem repeats in long noisy reads

期刊

BIOINFORMATICS

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文