4.7 Article

An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm

Journal

EXPERT SYSTEMS WITH APPLICATIONS
Volume 203, Issue -, Pages -

Publisher

PERGAMON-ELSEVIER SCIENCE LTD
DOI: 10.1016/j.eswa.2022.117384

Keywords

Arabic corpus; Topic modeling; Multi-label annotation; Genetic algorithm; Latent Dirichlet allocation; Crowdsourcing

Ask authors/readers for more resources

This paper proposes an unsupervised model using multi-label topic modeling and genetic algorithm to automatically annotate textual data. The model was tested on Arabic and English datasets, and compared against human annotations. The results show promising performance in automatic annotation.
Every day the world produces an enormous amount of textual data. This unstructured text is of little use unless it is labeled using a combination of categories, keywords, tags. Humans can never annotate such massive data, and with a growing divide between the daily produced data and those annotated, the only alternative is to mechanize it. Automatic annotation process helps in saving resources in terms of time and cost. The process of multi-label annotation involves associating a document with multiple relevant labels. This paper proposes an unsupervised model to annotate corpus using multi-labels automatically. The model is based on multi-label topic modeling and genetic algorithm (GA). Topic modeling is a technique to extract the hidden topics from text, and the GA is used to find the optimal number of topics. We hyper-tuned the parameters of the topic modeling using two different training methods: variational Bayes and Gibbs sampling. The class imbalance in a corpus can affect the result of topic modeling, where the majority class dominates the minority class. We overcome this problem using the partitioning method. Though the proposed model was developed for the Arabic dataset, it is language neutral. We tested our model on three large Arabic corpora and three large English social media datasets. For the Arabic language, our work being the first work that tackles multi-label annotation, we needed a reference to compare our model. For the Arabic corpus, we compared the result of automatic annotation against humans using crowdsourcing (whose labeling was checked for quality). The analysis of the annotation shows an agreement among models (machine vs. human) of 79.30%. Moreover, for the English dataset, the results are quite competitive.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available