☆ 4.5 Article

Integrating text mining into the MGI biocuration workflow

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION (2009)

期刊

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

卷 -, 期 -, 页码 -

出版社

OXFORD UNIV PRESS

DOI: 10.1093/database/bap019

关键词

类别

Mathematical & Computational Biology

资金

National Institutes of Health: National Human Genome Research Institute [HG000330, HG002273, HG003622]
National Institute of Child Health and Human Disease [HD033745]
National Cancer Institute [CA089713]
University of Maine Graduate School of Biomedical Sciences
National Science Foundation [0221625]
National Human Genome Research Institute [HG000330]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

A major challenge for functional and comparative genomics resource development is the extraction of data from the biomedical literature. Although text mining for biological data is an active research field, few applications have been integrated into production literature curation systems such as those of the model organism databases (MODs). Not only are most available biological natural language (bioNLP) and information retrieval and extraction solutions difficult to adapt to existing MOD curation workflows, but many also have high error rates or are unable to process documents available in those formats preferred by scientific journals. In September 2008, Mouse Genome Informatics (MGI) at The Jackson Laboratory initiated a search for dictionary-based text mining tools that we could integrate into our biocuration workflow. MGI has rigorous document triage and annotation procedures designed to identify appropriate articles about mouse genetics and genome biology. We currently screen similar to 1000 journal articles a month for Gene Ontology terms, gene mapping, gene expression, phenotype data and other key biological information. Although we do not foresee that curation tasks will ever be fully automated, we are eager to implement named entity recognition (NER) tools for gene tagging that can help streamline our curation workflow and simplify gene indexing tasks within the MGI system. Gene indexing is an MGI-specific curation function that involves identifying which mouse genes are being studied in an article, then associating the appropriate gene symbols with the article reference number in the MGI database. Here, we discuss our search process, performance metrics and success criteria, and how we identified a short list of potential text mining tools for further evaluation. We provide an overview of our pilot projects with NCBO's Open Biomedical Annotator and Fraunhofer SCAI's ProMiner. In doing so, we prove the potential for the further incorporation of semi-automated processes into the curation of the biomedical literature.

Integrating text mining into the MGI biocuration workflow

期刊

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Integrating text mining into the MGI biocuration workflow

期刊

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

出版社

OXFORD UNIV PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文