4.5 Article

Improved TFIDF in big news retrieval: An empirical study

期刊

PATTERN RECOGNITION LETTERS
卷 93, 期 -, 页码 113-122

出版社

ELSEVIER SCIENCE BV
DOI: 10.1016/j.patrec.2016.11.004

关键词

Big news; Term weighting; Two-stage learning; News classification; News clustering

资金

  1. Ministry of Science and Technology in Taiwan [MOST 104-2410-H-275 -007 - MY3]

向作者/读者索取更多资源

Thomson Reuters news articles have been considered integral data sources that have given rise to several inspiring applications of text classification and clustering. The most well-known term weighting approach, the term frequency-inverse document frequency (TFIDF) method, is often used to assign term weights that support such applications. Thomson Reuters reports pertinent incoming news (e.g., the refugee crisis in Europe) over a given period of time, and the most prominent terms (e.g., refugee) are thus frequently found in a large collection of news stories. When term weights are measured via the TFIDF method, such weights must be heavily compromised while the collection of news is sufficiently large. As the TFIDF approach is vulnerable to biases because the most important terms are typically referred to as noise, thus leading lower term weights, news retrieval without the use of the most important terms is difficult and ineffective. We thus present a new distance-based term weighting method for overcoming this bias by considering a basic characteristic whereby each news article must be similar or different from others while processing big news that include large amounts of news. All news must not be considered to contribute equally to the weighting of a particular term. In this study, the weight of a particular term is assessed based on its distance in an article to other instances of the same term, and this weight is highly sensitive to whether similar articles cause a term to occur and to whether different articles cause a term to disappear. The most important terms are thus delivered in large news corpora when studying similarities between news stories. In addition, we create a two-stage learning algorithm to refine the term's weights, and we develop an intelligent model that applies our term weighting method to Reuters news analyses based upon classification and clustering problems. The experimental results show that our methods perform better performance than TFIDF in terms of news classification and clustering. (C) 2016 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.5
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据