☆ 4.7 Article

Automated Phrase Mining from Massive Text Corpora

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2018)

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Volume 30, Issue 10, Pages 1825-1837

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TKDE.2018.2812203

Keywords

Automatic phrase mining; phrase mining; distant training; part-of-speech tag; multiple languages

Funding

U.S. Army Research Lab. [W911NF-09-2-0053]
National Science Foundation [IIS-1320617, IIS 16-18481]
NIGMS by the trans-NIH Big Data to Knowledge (BD2K) initiative [1U54GM114838]
Google

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.

Automated Phrase Mining from Massive Text Corpora

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Automated Phrase Mining from Massive Text Corpora

Journal

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper