4.7 Article

Construction of Genealogical Knowledge Graphs From Obituaries: Multitask Neural Network Extraction System

期刊

出版社

JMIR PUBLICATIONS, INC
DOI: 10.2196/25670

关键词

genealogical knowledge graph; EHR; information extraction; genealogy; neural network

资金

  1. National Key Research and Development Program of China [2018YFC0910404]
  2. National Natural Science Foundation of China [61772409]
  3. Innovative Research Group of the National Natural Science Foundation of China [61721002]
  4. Innovation Research Team of the Ministry of Education, Project of China Knowledge Centre for Engineering Science and Technology [IRT_17R86]

向作者/读者索取更多资源

Researchers utilized online obituary data to construct genealogical knowledge graphs, successfully extracting and assembling family relationship data through a multitask neural network model, providing more comprehensive and accurate support for biomedical research.
Background: Genealogical information, such as that found in family trees, is imperative for biomedical research such as disease heritability and risk prediction. Researchers have used policyholder and their dependent information in medical claims data and emergency contacts in electronic health records (EHRs) to infer family relationships at a large scale. We have previously demonstrated that online obituaries can be a novel data source for building more complete and accurate family trees. Objective: Aiming at supplementing EHR data with family relationships for biomedical research, we built an end-to-end information extraction system using a multitask-based artificial neural network model to construct genealogical knowledge graphs (GKGs) from online obituaries. GKGs are enriched family trees with detailed information including age, gender, death and birth dates, and residence. Methods: Built on a predefined family relationship map consisting of 4 types of entities (eg, people's name, residence, birth date, and death date) and 71 types of relationships, we curated a corpus containing 1700 online obituaries from the metropolitan area of Minneapolis and St Paul in Minnesota. We also adopted data augmentation technology to generate additional synthetic data to alleviate the issue of data scarcity for rare family relationships. A multitask-based artificial neural network model was then built to simultaneously detect names, extract relationships between them, and assign attributes (eg, birth dates and death dates, residence, age, and gender) to each individual. In the end, we assemble related GKGs into larger ones by identifying people appearing in multiple obituaries. Results: Our system achieved satisfying precision (94.79%), recall (91.45%), and F-1 measures (93.09%) on 10-fold cross-validation. We also constructed 12,407 GKGs, with the largest one made up of 4 generations and 30 people. Conclusions: In this work, we discussed the meaning of GKGs for biomedical research, presented a new version of a corpus with a predefined family relationship map and augmented training data, and proposed a multitask deep neural system to construct and assemble GKGs. The results show our system can extract and demonstrate the potential of enriching EHR data for more genetic research. We share the source codes and system with the entire scientific community on GitHub without the corpus for privacy protection.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据