4.6 Article

A method for determining the number of documents needed for a gold standard corpus

Journal

JOURNAL OF BIOMEDICAL INFORMATICS
Volume 45, Issue 3, Pages 460-470

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE
DOI: 10.1016/j.jbi.2011.12.010

Keywords

Natural language processing; Gold standard corpus; Sampling; Capture probability

Funding

  1. Office of the Vice President for Research and Graduate Studies
  2. Clinical & Translations Sciences Institute of Michigan State University

Ask authors/readers for more resources

The unstructured narratives in medicine have been increasingly targeted for content extraction using the techniques of natural language processing (NLP). In most cases, these efforts are facilitated by creating a manually annotated set of narratives containing the ground truth; commonly referred to as a gold standard corpus. This corpus is used for modeling, fine-tuning, and testing NLP software as well as providing the basis for training in machine learning. Determining the number of annotated documents (size) for this corpus is important, but rarely described; rather, the factors of cost and time appear to dominate decision-making about corpus size. In this report, a method is outlined to determine gold standard size based on the capture probabilities for the unique words within a target corpus. To demonstrate this method, a corpus of dictation letters from the Michigan Pain Consultant (MPC) clinics for pain management are described and analyzed. A well-formed working corpus of 10,000 dictations was first constructed to provide a representative subset of the total, with no more than one dictation letter per patient. Each dictation was divided into words and common words were removed. The Poisson function was used to determine probabilities of word capture within samples taken from the working corpus, and then integrated over word length to give a single capture probability as a function of sample size. For these MPC dictations, a sample size of 500 documents is predicted to give a capture probability of approximately 0.95. Continuing the demonstration of sample selection, a provisional gold standard corpus of 500 documents was selected and examined for its similarity to the MPC structured coding and demographic data available for each patient. It is shown that a representative sample, of justifiable size, can be selected for use as a gold standard. (C) 2012 Elsevier Inc. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available