☆ 4.5 Article

News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark

CMC-COMPUTERS MATERIALS & CONTINUA (2020)

期刊

CMC-COMPUTERS MATERIALS & CONTINUA

卷 62, 期 1, 页码 217-231

出版社

TECH SCIENCE PRESS

DOI: 10.32604/cmc.2020.06431

关键词

News text topic clustering; spark platform; countvectorizer algorithm; TF-IDF algorithm; latent dirichlet allocation model

类别

Computer Science, Information Systems Materials Science, Multidisciplinary

资金

Science Research Projects of Hunan Provincial Education Department [18A174, 18C0262]
National Natural Science Foundation of China [61772561]
Key Research & Development Plan of Hunan Province [2018NK2012, 2019SK2022]
Degree & Postgraduate Education Reform Project of Hunan Province [209]
Postgraduate Education and Teaching Reform Project of Central South Forestry University [2019JG013]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Due to the slow processing speed of text topic clustering in stand-alone architecture under the background of big data, this paper takes news text as the research object and proposes LDA text topic clustering algorithm based on Spark big data platform. Since the TF-IDF (term frequency-inverse document frequency) algorithm under Spark is irreversible to word mapping, the mapped words indexes cannot be traced back to the original words. In this paper, an optimized method is proposed that TF-IDF under Spark to ensure the text words can be restored. Firstly, the text feature is extracted by the TF-IDF algorithm combined CountVectorizer proposed in this paper, and then the features are inputted to the LDA (Latent Dirichlet Allocation) topic model for training. Finally, the text topic clustering is obtained. Experimental results show that for large data samples, the processing speed of LDA topic model clustering has been improved based Spark. At the same time, compared with the LDA topic model based on word frequency input, the model proposed in this paper has a reduction of perplexity.

News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark

期刊

CMC-COMPUTERS MATERIALS & CONTINUA

出版社

TECH SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark

期刊

CMC-COMPUTERS MATERIALS & CONTINUA

出版社

TECH SCIENCE PRESS

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文