4.6 Article

RHMD: A Real-World Dataset for Health Mention Classification on Reddit

Journal

IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
Volume 10, Issue 5, Pages 2325-2334

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCSS.2022.3186883

Keywords

Social networking (online); Diseases; Blogs; Frequency modulation; Annotations; Task analysis; Labeling; Health mention (HM) classification; public health surveillance; social media

Ask authors/readers for more resources

People on social media using disease and symptom words to discuss their health can introduce biases in data-driven public health applications. This study presents a new dataset called RHMD, which consists of 10,015 manually annotated Reddit posts. The dataset is labeled with four categories and provides a comprehensive performance analysis of baseline methods. The release of this dataset is expected to facilitate the development of new methods for detecting health mentions in user-generated text.
People on social media share their thoughts and experiences using diseases and symptoms words other than to mention their health, which can introduce biases in data-driven public health applications. For the advancement of HMC research, in this study, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD is composed of 10 015 manually annotated Reddit posts that include 15 common disease or symptom terms and are labeled with four labels: personal health mentions (HMs), nonpersonal HMs, figurative HMs, and hyperbolic HMs. Empirical evaluation using recently proposed methods demonstrates the challenge of labeling user-generated text across these four types. Contributions to this work include the public release of a robustly annotated Reddit dataset (RHMD) for HM tasks and a comprehensive performance analysis of baseline methods. We expect the release of the dataset, and the evaluations will help facilitate the development of new methods for detecting HMs in the user-generated text. The dataset is available at.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available