☆ 4.6 Article

A method for determining the number of documents needed for a gold standard corpus

JOURNAL OF BIOMEDICAL INFORMATICS (2012)

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Volume 45, Issue 3, Pages 460-470

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

DOI: 10.1016/j.jbi.2011.12.010

Keywords

Natural language processing; Gold standard corpus; Sampling; Capture probability

Funding

Office of the Vice President for Research and Graduate Studies
Clinical & Translations Sciences Institute of Michigan State University

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

The unstructured narratives in medicine have been increasingly targeted for content extraction using the techniques of natural language processing (NLP). In most cases, these efforts are facilitated by creating a manually annotated set of narratives containing the ground truth; commonly referred to as a gold standard corpus. This corpus is used for modeling, fine-tuning, and testing NLP software as well as providing the basis for training in machine learning. Determining the number of annotated documents (size) for this corpus is important, but rarely described; rather, the factors of cost and time appear to dominate decision-making about corpus size. In this report, a method is outlined to determine gold standard size based on the capture probabilities for the unique words within a target corpus. To demonstrate this method, a corpus of dictation letters from the Michigan Pain Consultant (MPC) clinics for pain management are described and analyzed. A well-formed working corpus of 10,000 dictations was first constructed to provide a representative subset of the total, with no more than one dictation letter per patient. Each dictation was divided into words and common words were removed. The Poisson function was used to determine probabilities of word capture within samples taken from the working corpus, and then integrated over word length to give a single capture probability as a function of sample size. For these MPC dictations, a sample size of 500 documents is predicted to give a capture probability of approximately 0.95. Continuing the demonstration of sample selection, a provisional gold standard corpus of 500 documents was selected and examined for its similarity to the MPC structured coding and demographic data available for each patient. It is shown that a representative sample, of justifiable size, can be selected for use as a gold standard. (C) 2012 Elsevier Inc. All rights reserved.

A method for determining the number of documents needed for a gold standard corpus

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A method for determining the number of documents needed for a gold standard corpus

Journal

JOURNAL OF BIOMEDICAL INFORMATICS

Publisher

ACADEMIC PRESS INC ELSEVIER SCIENCE

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper