4.7 Article

Preliminary exploration of topic modelling representations for Electronic Health Records coding according to the International Classification of Diseases in Spanish

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 204, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2022.117303

Keywords

Multi-label classification; Document classification; Electronic Health Records; ICD classification; Topic models; Partially labelled dirichlet allocation

Funding

  1. Spanish Ministry of Science and Innovation [DOTT-HEALTH/PAT-MED PID2019-106942RB-C31]
  2. European Commission (FEDER)
  3. Basque Government, Spain [IXA IT-1343-19, PRE-2019-1-0158]

Ask authors/readers for more resources

In this study, we focused on classifying Spanish Electronic Health Records (EHR) based on the International Classification of Diseases (ICD) using Topic Models. We found that Topic Models offer a suitable alternative approach for Spanish clinical text mining when there are limited resources available. Specifically, we explored two methods, Latent Dirichlet Allocation (LDA) and Partially Labelled Latent Dirichlet Allocation (PLDA), and found that PLDA is able to discover topics associated with the ICD, making it a versatile representation for EHRs. Compared to supervised categorization approaches, LDA and PLDA provide an interpretable approach that can be associated with ICDs.
In this work, we cope with the classification of Electronic Health Records (EHR) in Spanish according to the International Classification of Diseases (ICD). We employ Topic Models representing each document as a probabilistic distribution over topics, offering a low-dimensional representation of documents.The trend is to turn to an embedding text representation, but these approaches require large amounts of textual data. We found Topic Models as a suitable alternative approach to deal with the few resources available for Spanish clinical text mining. Besides, they are interpretable and aid the explainability in artificial intelligence (XAI).We explored two different methods, known as Latent Dirichlet Allocation (LDA) and Partially Labelled Latent Dirichlet Allocation (PLDA), the supervised approach of the former. We assessed the results attained in Spanish with an analogous task in English as a reference. Evaluation methods were applied directly to the representation, with metrics to determine topic coherence and the relationship between topics and ICD labels.We learned that PLDA was able to discover topics associated with the ICD. This finding means that this representation itself can reveal ICD codes previous to classification. Also, this representation was used as predictive features to feed a conventional classifier to show their competence in a downstream task. We conclude that in a context with a lack of big data availability, PLDA emerges as a versatile candidate, able to offer a competitive representation of EHRs.While other works are primarily concerned with supervised categorization and do not pay attention to the representation, LDA and PLDA offer an interpretable approach that can be associated with ICDs. Moreover, compared with those that employ LDA, we demonstrate how its' supervised version, PLDA, can be more intuitive as it shows a closer relation with the ICDs.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available