☆ 4.8 Article

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE (2022)

期刊

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

卷 44, 期 10, 页码 7112-7127

出版社

IEEE COMPUTER SOC

DOI: 10.1109/TPAMI.2021.3095381

关键词

Proteins; Training; Amino acids; Task analysis; Databases; Computational modeling; Three-dimensional displays; Computational biology; high performance computing; machine learning; language modeling; deep learning

类别

Computer Science, Artificial Intelligence Engineering, Electrical & Electronic

资金

Software Campus 2.0 (TUM) through the German Ministry for Research and Education (BMBF)
Alexander von Humboldt foundation through the German Ministry for Research and Education (BMBF)
Deutsche Forschungsgemeinschaft [DFG-GZ: RO1320/4-1]
NVIDIA
National Research Foundation of Korea [2019R1A6A1A10073437, NRF-2020M3A9G7103933]
SeoulNational University
Google Cloud
Google Cloud Research Credits Program under Covid19 HPC Consortium grant
DOE Office of Science User Facility [DEAC05-00OR22725]
TPU pods under TensorFlow Research Cloud grant
National Research Foundation of Korea [2019R1A6A1A10073437] Funding Source: Korea Institute of Science & Technology Information (KISTI), National Science & Technology Information Service (NTIS)

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Computational biology and bioinformatics provide valuable data for the development of language models in natural language processing. In this study, six different models were trained on protein sequence data and the resulting embeddings were used for various protein structure prediction tasks, demonstrating their advantages over traditional methods.

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

期刊

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

期刊

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE

出版社

IEEE COMPUTER SOC

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文