☆ 4.5 Article

Improved TFIDF in big news retrieval: An empirical study

PATTERN RECOGNITION LETTERS (2017)

期刊

PATTERN RECOGNITION LETTERS

卷 93, 期 -, 页码 113-122

出版社

ELSEVIER SCIENCE BV

DOI: 10.1016/j.patrec.2016.11.004

关键词

Big news; Term weighting; Two-stage learning; News classification; News clustering

类别

Computer Science, Artificial Intelligence

资金

Ministry of Science and Technology in Taiwan [MOST 104-2410-H-275 -007 - MY3]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Thomson Reuters news articles have been considered integral data sources that have given rise to several inspiring applications of text classification and clustering. The most well-known term weighting approach, the term frequency-inverse document frequency (TFIDF) method, is often used to assign term weights that support such applications. Thomson Reuters reports pertinent incoming news (e.g., the refugee crisis in Europe) over a given period of time, and the most prominent terms (e.g., refugee) are thus frequently found in a large collection of news stories. When term weights are measured via the TFIDF method, such weights must be heavily compromised while the collection of news is sufficiently large. As the TFIDF approach is vulnerable to biases because the most important terms are typically referred to as noise, thus leading lower term weights, news retrieval without the use of the most important terms is difficult and ineffective. We thus present a new distance-based term weighting method for overcoming this bias by considering a basic characteristic whereby each news article must be similar or different from others while processing big news that include large amounts of news. All news must not be considered to contribute equally to the weighting of a particular term. In this study, the weight of a particular term is assessed based on its distance in an article to other instances of the same term, and this weight is highly sensitive to whether similar articles cause a term to occur and to whether different articles cause a term to disappear. The most important terms are thus delivered in large news corpora when studying similarities between news stories. In addition, we create a two-stage learning algorithm to refine the term's weights, and we develop an intelligent model that applies our term weighting method to Reuters news analyses based upon classification and clustering problems. The experimental results show that our methods perform better performance than TFIDF in terms of news classification and clustering. (C) 2016 Elsevier B.V. All rights reserved.

Improved TFIDF in big news retrieval: An empirical study

期刊

PATTERN RECOGNITION LETTERS

出版社

ELSEVIER SCIENCE BV

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Improved TFIDF in big news retrieval: An empirical study

期刊

PATTERN RECOGNITION LETTERS

出版社

ELSEVIER SCIENCE BV

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文