☆ 4.7 Article

A de-identifier for medical discharge summaries

ARTIFICIAL INTELLIGENCE IN MEDICINE (2008)

Journal

ARTIFICIAL INTELLIGENCE IN MEDICINE

Volume 42, Issue 1, Pages 13-35

Publisher

ELSEVIER

DOI: 10.1016/j.artmed.2007.10.001

Keywords

automatic de-identification of narrative patient records; local lexical context; local syntactic context; dictionaries; sentential global context; syntactic information for de-identification

Funding

NATIONAL INSTITUTE OF BIOMEDICAL IMAGING AND BIOENGINEERING [R01EB001659] Funding Source: NIH RePORTER
NATIONAL LIBRARY OF MEDICINE [U54LM008748] Funding Source: NIH RePORTER

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Objective: Clinical records contain significant medical information that can be useful to researchers in various disciplines. However, these records also contain personal health information (PHI) whose presence limits the use of the records outside of hospitals. The goal of de-identification is to remove all PHI from clinical records. This is a challenging task because many records contain foreign and misspelled PHI; they also contain PHI that are ambiguous with non-PHI. These complications are compounded by the linguistic characteristics of clinical records. For example, medical discharge summaries, which are studied in this paper, are characterized by fragmented, incomplete utterances and domain-specific language; they cannot be fully processed by tools designed for Lay language. Methods and results: In this paper, we show that we can de-identify medical discharge summaries using a de-identifier, Stat De-id, based on support vector machines and local context (F-measure = 97% on PHI). Our representation of local context aids de-identification even when PHI include out-of-vocabulary words and even when PHI are ambiguous with non-PHI within the same corpus. Comparison of Stat De-id with a rule-based approach shows that Local context contributes more to de-identification than dictionaries combined with hand-tailored heuristics (F-measure = 85%). Comparison with two well-known named entity recognition (NER) systems, SNoW (F-measure = 94%) and IdentiFinder (F-measure = 36%), on five representative corpora show that when the language of documents is fragmented, a system with a relatively thorough representation of local context can be a more effective de-identifier than systems that combine (relatively simpler) local context with global context. Comparison with a Conditional Random Field De-identifier (CRFD), which utilizes global context in addition to the local context of Stat De-id, confirms this finding (F-measure=88%) and establishes that strengthening the representation of local context may be more beneficial for de-identification than complementing local with global context. (C) 2007 Elsevier B.V. All rights reserved.

A de-identifier for medical discharge summaries

Journal

ARTIFICIAL INTELLIGENCE IN MEDICINE

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

A de-identifier for medical discharge summaries

Journal

ARTIFICIAL INTELLIGENCE IN MEDICINE

Publisher

ELSEVIER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper