4.4 Article

Comparison of biomedical relationship extraction methods and models for knowledge graph creation

Journal

JOURNAL OF WEB SEMANTICS
Volume 75, Issue -, Pages -

Publisher

ELSEVIER
DOI: 10.1016/j.websem.2022.100756

Keywords

Knowledge graphs; Information extraction; Machine learning; Natural language processing; Text mining; Text-to-text model; Linked data; Transformers; PubMedBERT; T5; SciFive

Ask authors/readers for more resources

Biomedical research is expanding rapidly, leading to an overwhelming amount of published literature. Knowledge graphs offer a framework for representing semantic knowledge from this literature. This paper presents and compares rule-based and machine learning methods for scalable relationship extraction from biomedical literature, with a focus on their resilience to unbalanced and small datasets. Experiments show that transformer-based models, such as PubMedBERT and distilBERT, perform well in handling both small and unbalanced datasets.
Biomedical research is growing at such an exponential pace that scientists, researchers, and practition-ers are no more able to cope with the amount of published literature in the domain. The knowledge presented in the literature needs to be systematized in such a way that claims and hypotheses can be easily found, accessed, and validated. Knowledge graphs can provide such a framework for semantic knowledge representation from literature. However, in order to build a knowledge graph, it is necessary to extract knowledge as relationships between biomedical entities and normalize both entities and relationship types. In this paper, we present and compare a few rule-based and machine learning-based (Naive Bayes, Random Forests as examples of traditional machine learning methods and DistilBERT, PubMedBERT, T5, and SciFive-based models as examples of modern deep learning transformers) methods for scalable relationship extraction from biomedical literature, and for the integration into the knowledge graphs. We examine how resilient are these various methods to unbalanced and fairly small datasets. Our experiments show that transformer-based models handle well both small (due to pre-training on a large dataset) and unbalanced datasets. The best performing model was the PubMedBERT-based model fine-tuned on balanced data, with a reported F1-score of 0.92. The distilBERT-based model followed with an F1-score of 0.89, performing faster and with lower resource requirements. BERT-based models performed better than T5-based generative models.(c) 2022 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.4
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available