☆ 4.7 Article Data Paper

Health record hiccups-5,526 real-world time series with change points labelled by crowdsourced visual inspection

GIGASCIENCE (2023)

Journal

GIGASCIENCE

Volume 12, Issue -, Pages -

Publisher

OXFORD UNIV PRESS

DOI: 10.1093/gigascience/giad060

Keywords

time series; change point detection; anomalies; data quality

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Automated Summary New
Abstract

This study used a large collection of labeled EHR time series to validate the performance of automated change point detection methods. The results showed that the method performed well in detecting change points and was validated in a real-world setting.

Background: Large routinely collected data such as electronic health records (EHRs) are increasingly used in research, but the statistical methods and processes used to check such data for temporal data quality issues have not moved beyond manual, ad hoc production and visual inspection of graphs. With the prospect of EHR data being used for disease surveillance via automated pipelines and public-facing dashboards, automation of data quality checks will become increasingly valuable. Findings: We generated 5,526 time series from 8 different EHR datasets and engaged >2,000 citizen-science volunteers to label the locations of all suspicious-looking change points in the resulting graphs. Consensus labels were produced using density-based clustering with noise, with validation conducted using 956 images containing labels produced by an experienced data scientist. Parameter tuning was done against 670 images and performance calculated against 286 images, resulting in a final sensitivity of 80.4% (95% CI, 77.1%-83.3%), specificity of 99.8% (99.7%-99.8%), positive predictive value of 84.5% (81.4%-87.2%), and negative predictive value of 99.7% (99.6%-99.7%). In total, 12,745 change points were found within 3,687 of the time series. Conclusions: This large collection of labelled EHR time series can be used to validate automated methods for change point detection in real-world settings, encouraging the development of methods that can successfully be applied in practice. It is particularly valuable since change point detection methods are typically validated using synthetic data, so their performance in real-world settings cannot be assumed to be comparable. While the dataset focusses on EHRs and data quality, it should also be applicable in other fields.

Health record hiccups-5,526 real-world time series with change points labelled by crowdsourced visual inspection

Journal

GIGASCIENCE

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Health record hiccups-5,526 real-world time series with change points labelled by crowdsourced visual inspection

Journal

GIGASCIENCE

Publisher

OXFORD UNIV PRESS

Keywords

Categories

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper