4.7 Article

BTM: Topic Modeling over Short Texts

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
Volume 26, Issue 12, Pages 2928-2941

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TKDE.2014.2313872

Keywords

Short text; topic model; biterm; online algorithm; content analysis

Funding

  1. 973 Program of China [2014CB340401, 2012CB316303]
  2. 863 Program of China [2012AA011003]
  3. National Natural Science Foundation of China [61232010, 61173064, 61202213, 61203298]
  4. National Key Technology RD Program [2012BAH39B04]

Ask authors/readers for more resources

Short texts are popular on today's web, especially with the emergence of social media. Inferring topics from large scale short texts becomes a critical but challenging task for many content analysis tasks. Conventional topic models such as latent Dirichlet allocation (LDA) and probabilistic latent semantic analysis (PLSA) learn topics from document-level word co-occurrences by modeling each document as a mixture of topics, whose inference suffers from the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a novel way for short text topic modeling, referred as biterm topic model (BTM). BTM learns topics by directly modeling the generation of word co-occurrence patterns (i.e., biterms) in the corpus, making the inference effective with the rich corpus-level information. To cope with large scale short text data, we further introduce two online algorithms for BTM for efficient topic learning. Experiments on real-word short text collections show that BTM can discover more prominent and coherent topics, and significantly outperform the state-of-the-art baselines. We also demonstrate the appealing performance of the two online BTM algorithms on both time efficiency and topic learning.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available