☆ 4.8 Article

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

NUCLEIC ACIDS RESEARCH (2022)

Journal

NUCLEIC ACIDS RESEARCH

Volume 50, Issue 14, Pages -

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/nar/gkac326

Keywords

Funding

Guangdong Provincial Academician Workstation of BGI Synthetic Genomics [2017B090904014]
Program of Shanghai Academic Research Leader [20XD1401100]
Program for Outstanding Medical Academic Leader [2019LJ01]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

LOGO is a self-attention based pre-trained language model that learns bi-directional representations of the unlabelled human reference genome. It achieves good performance in interpreting non-coding regions and shows advantages in tasks such as promoter identification and enhancer-promoter interaction prediction.

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

Journal

NUCLEIC ACIDS RESEARCH

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution

Journal

NUCLEIC ACIDS RESEARCH

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper