☆ 4.5 Article

BC4GO: a full-text corpus for the BioCreative IV GO task

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION (2014)

Journal

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

Volume -, Issue -, Pages -

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/database/bau074

Keywords

Funding

Intramural Research Program of the NIH, National Library of Medicine
USDA ARS
National Human Genome Research Institute at the US National Institutes of Health [HG004090, HG002223, HG002273]
National Science Foundation [ABI-1062520, ABI-1147029, DBI-0850319]
MRC [G1000968] Funding Source: UKRI
Medical Research Council [G1000968] Funding Source: researchfish
ARS [813447, ARS-0424655] Funding Source: Federal RePORTER
Direct For Biological Sciences
Div Of Biological Infrastructure [1062520] Funding Source: National Science Foundation
Div Of Biological Infrastructure
Direct For Biological Sciences [0850319] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F-1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain similar to 10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community.

BC4GO: a full-text corpus for the BioCreative IV GO task

Journal

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

BC4GO: a full-text corpus for the BioCreative IV GO task

Journal

DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper