☆ 4.6 Article

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks

PLOS ONE (2021)

期刊

PLOS ONE

卷 16, 期 10, 页码 -

出版社

PUBLIC LIBRARY SCIENCE

DOI: 10.1371/journal.pone.0258623

关键词

类别

Multidisciplinary Sciences

资金

German Ministry of Education and Research (Bundesministerium fur Bildung und Forschung, BMBF) project iDDSEM MyPathSem [031L0024A+B]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

智能总结 New
摘要

Biomedical and life science literature plays an important role in publishing experimental results, and the rapid growth of new publications has led to an increase in scientific knowledge represented in free text. Developing techniques to extract this knowledge using word2vec approach has shown to be effective in aiding scientists in discovering new relationships between biological entities. The study generated word vector representations based on a large corpus of PubMed abstracts, and demonstrated the utility of word2vec embeddings in biomedical analysis through validation experiments. By creating gene-gene networks and using them to train Graph-Convolutional Neural Networks, the study showed that word2vec-embedding-derived networks performed well in tasks such as predicting metastatic events in breast cancer, validating the usefulness of the generated word embeddings in constructing biological networks.

Biomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on large breast cancer gene expression data and on other cancer datasets. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed sufficiently good for the metastatic event prediction tasks compared to other networks. Such performance was good enough to validate the utility of our generated word embeddings in constructing biological networks. Word representations as produced by text mining algorithms like word2vec, therefore are able to capture biologically meaningful relations between entities. Our generated embeddings are publicly available at https://github.com/genexplain/ Word2vec-based-Networks/blob/main/README.md.

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks

期刊

PLOS ONE

出版社

PUBLIC LIBRARY SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Text mining-based word representations for biomedical data analysis and protein-protein interaction networks in machine learning tasks

期刊

PLOS ONE

出版社

PUBLIC LIBRARY SCIENCE

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文