4.5 Article

Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web

期刊

JOURNAL OF BIG DATA
卷 8, 期 1, 页码 -

出版社

SPRINGERNATURE
DOI: 10.1186/s40537-021-00528-5

关键词

Data mining; Epidemiology; Knowledge engineering; Ontologies; Text classification

资金

  1. Norwegian Agency for Development Cooperation-Health Informatics Training and Research in East Africa for Improved health Care (NORAD -HITRAIN) project

向作者/读者索取更多资源

Twitter and social media can serve as important sources for disease surveillance data, but the messiness of tweets poses challenges for information extraction. Most systems rely on simple keyword matching, leading to potential false positives, and solutions for multilingual scenarios often lack semantic context. The paper experimentally examines different text classification approaches for epidemiological surveillance on the social web and compares the impact of different input representations on performance.
Twitter and social media as a whole have great potential as a source of disease surveillance data however the general messiness of tweets presents several challenges for standard information extraction methods. Most deployed systems employ approaches that rely on simple keyword matching and do not distinguish between relevant and irrelevant keyword mentions making them susceptible to false positives as a result of the fact that keyword volume can be influenced by several social phenomena that may be unrelated to disease occurrence. Furthermore, most solutions are intended for a single language and those meant for multilingual scenarios do not incorporate semantic context. In this paper we experimentally examine different approaches for classifying text for epidemiological surveillance on the social web in addition we offer a systematic comparison of the impact of different input representations on performance. Specifically we compare continuous representations against one-hot encoding for word-based, class-based (ontology-based) and subword units in the form of byte pair encodings. We also go on to establish the desirable performance characteristics for multi-lingual semantic filtering approaches and offer an in-depth discussion of the implications for end-to-end surveillance.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据