4.6 Article

Integrating Text Classification into Topic Discovery Using Semantic Embedding Models

期刊

APPLIED SCIENCES-BASEL
卷 13, 期 17, 页码 -

出版社

MDPI
DOI: 10.3390/app13179857

关键词

deep learning; natural language processing; topic discovery; text classification

向作者/读者索取更多资源

Topic discovery is the process of identifying main ideas in large volumes of text data. Current models often involve text preprocessing and can generate general topics. However, integrating automatic text classification before the topic discovery process can generate specific topics with coherent semantic relationships. This paper presents an approach that combines text classification and topic discovery on English textual data.
Topic discovery involves identifying the main ideas within large volumes of textual data. It indicates recurring topics in documents, providing an overview of the text. Current topic discovery models receive the text, with or without pre-processing, including stop word removal, text cleaning, and normalization (lowercase conversion). A topic discovery process that receives general domain text with or without processing generates general topics. General topics do not offer detailed overviews of the input text, and manual text categorization is tedious and time-consuming. Extracting topics from text with an automatic classification task is necessary to generate specific topics enriched with top words that maintain semantic relationships among them. Therefore, this paper presents an approach that integrates text classification for topic discovery from large amounts of English textual data, such as 20-Newsgroups and Reuters Corpora. We rely on integrating automatic text classification before the topic discovery process to obtain specific topics for each class with relevant semantic relationships between top words. Text classification performs a word analysis that makes up a document to decide what class or category to identify; then, the proposed integration provides latent and specific topics depicted by top words with high coherence from each obtained class. Text classification accomplishes this with a convolutional neural network (CNN), incorporating an embedding model based on semantic relationships. Topic discovery over categorized text is realized with latent Dirichlet analysis (LDA), probabilistic latent semantic analysis (PLSA), and latent semantic analysis (LSA) algorithms. An evaluation process for topic discovery over categorized text was performed based on the normalized topic coherence metric. The 20-Newsgroups corpus was classified, and twenty topics with the ten top words were identified for each class. The normalized topic coherence obtained was 0.1723 with LDA, 0.1622 with LSA, and 0.1716 with PLSA. The Reuters Corpus was also classified, and twenty and fifty topics were identified. A normalized topic coherence of 0.1441 was achieved when applying the LDA algorithm, obtaining 20 topics for each class; with LSA, the coherence was 0.1360, and with PLSA, it was 0.1436.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据