4.3 Article

Learning a functional grammar of protein domains using natural language word embedding techniques

Journal

PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS
Volume 88, Issue 4, Pages 616-624

Publisher

WILEY
DOI: 10.1002/prot.25842

Keywords

function prediction; machine learning; protein domains; semantic embedding; word2vec

Funding

  1. Biotechnology and Biological Sciences Research Council [BB/M011712/1]
  2. BBSRC [BB/M011712/1] Funding Source: UKRI

Ask authors/readers for more resources

In this paper, using Word2vec, a widely-used natural language processing method, we demonstrate that protein domains may have a learnable implicit semantic meaning in the context of their functional contributions to the multi-domain proteins in which they are found. Word2vec is a group of models which can be used to produce semantically meaningful embeddings of words or tokens in a fixed-dimension vector space. In this work, we treat multi-domain proteins as sentences where domain identifiers are tokens which may be considered as words. Using all InterPro (Finn et al. 2017) pfam domain assignments we observe that the embedding could be used to suggest putative GO assignments for Pfam (Finn et al. 2016) domains of unknown function.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available