期刊
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS
卷 19, 期 3, 页码 1772-1781出版社
IEEE COMPUTER SOC
DOI: 10.1109/TCBB.2020.3044230
关键词
Proteins; Annotations; Databases; Task analysis; Neural networks; Tools; Random forests; Sequence analysis; protein function prediction; neural network; convolutional neural network; random forest
This paper addresses the increasing demand for automated protein function prediction by developing an ensemble system that utilizes machine learning models to predict protein functions based on their sequences. The proposed system demonstrates competitive performance in the CAFA3 evaluation.
Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据