4.6 Article

ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed

期刊

出版社

FRONTIERS MEDIA SA
DOI: 10.3389/fmolb.2021.620475

关键词

protein stability; text mining; document classification; automated literature mining; thermodynamic data

资金

  1. Ministero Istruzione, Universita e Ricerca [201744NR8S]

向作者/读者索取更多资源

In recent years, the increase in DNA sequencing and protein mutagenesis studies has generated a large amount of variation data. The manual curation of data from literature is time-consuming and costly, prompting the development of tools like ThermoScan for extracting relevant thermodynamic data on protein stability from full-text articles. ThermoScan's text mining approach has shown accurate predictions and outperformed other text-mining algorithms based on publication abstracts.
During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据