4.5 Article

BC4GO: a full-text corpus for the BioCreative IV GO task

Publisher

OXFORD UNIV PRESS
DOI: 10.1093/database/bau074

Keywords

-

Funding

  1. Intramural Research Program of the NIH, National Library of Medicine
  2. USDA ARS
  3. National Human Genome Research Institute at the US National Institutes of Health [HG004090, HG002223, HG002273]
  4. National Science Foundation [ABI-1062520, ABI-1147029, DBI-0850319]
  5. MRC [G1000968] Funding Source: UKRI
  6. Medical Research Council [G1000968] Funding Source: researchfish
  7. ARS [813447, ARS-0424655] Funding Source: Federal RePORTER
  8. Direct For Biological Sciences
  9. Div Of Biological Infrastructure [1062520] Funding Source: National Science Foundation
  10. Div Of Biological Infrastructure
  11. Direct For Biological Sciences [0850319] Funding Source: National Science Foundation

Ask authors/readers for more resources

Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F-1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain similar to 10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available