☆ 4.7 Review

Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review

ACCIDENT ANALYSIS AND PREVENTION (2017)

期刊

ACCIDENT ANALYSIS AND PREVENTION

卷 98, 期 -, 页码 359-371

出版社

PERGAMON-ELSEVIER SCIENCE LTD

DOI: 10.1016/j.aap.2016.10.014

关键词

Injury; Narrative text; Injury surveillance; Cause of injury; Machine learning

类别

Ergonomics Public, Environmental & Occupational Health Social Sciences, Interdisciplinary Transportation

资金

Liberty Mutual Research Institute for Safety

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Injury narratives are now available real time and include useful information for injury surveillance and prevention. However, manual classification of the cause or events leading to injury found in large batches of narratives, such as workers compensation claims databases, can be prohibitive. In this study we compare the utility of four machine learning algorithms (Naive Bayes, Single word and Bi-gram models, Support Vector Machine and Logistic Regression) for classifying narratives into Bureau of Labor Statistics Occupational Injury and Illness event leading to injury classifications for a large workers compensation database. These algorithms are known to do well classifying narrative text and are fairly easy to implement with off-the-shelf software packages such as Python. We propose human-machine learning ensemble approaches which maximize the power and accuracy of the algorithms for machine-assigned codes and allow for strategic filtering of rare, emerging or ambiguous narratives for manual review. We compare human-machine approaches based on filtering on the prediction strength of the classifier vs. agreement between algorithms. Regularized Logistic Regression (LR) was the best performing algorithm alone. Using this algorithm and filtering out the bottom 30% of predictions for manual review resulted in high accuracy (overall sensitivity/positive predictive value of 0.89) of the final machine-human coded dataset. The best pairings of algorithms included Naive Bayes with Support Vector Machine whereby the triple ensemble NBSW = NBBI-GRAM = SVM had very high performance (0.93 overall sensitivity/positive predictive value and high accuracy (i.e. high sensitivity and positive predictive values)) across both large and small categories leaving 41% of the narratives for manual review. Integrating LR into this ensemble mix improved performance only slightly. For large administrative datasets we propose incorporation of methods based on human-machine pairings such as we have done here, utilizing readily-available off-the-shelf machine learning techniques and resulting in only a fraction of narratives that require manual review. Human-machine ensemble methods are likely to improve performance over total manual coding. (C) 2016 The Authors. Published by Elsevier Ltd.

Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review

期刊

ACCIDENT ANALYSIS AND PREVENTION

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Classifying injury narratives of large administrative databases for surveillance-A practical approach combining machine learning ensembles and human review

期刊

ACCIDENT ANALYSIS AND PREVENTION

出版社

PERGAMON-ELSEVIER SCIENCE LTD

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文