4.6 Article

Automatic Multilingual Stopwords Identification from Very Small Corpora

Journal

ELECTRONICS
Volume 10, Issue 17, Pages -

Publisher

MDPI
DOI: 10.3390/electronics10172169

Keywords

natural language processing; machine learning; stopword identification

Ask authors/readers for more resources

This paper focuses on stopword identification, proposing a novel method based on term and document frequency, and an automatic cutoff strategy for selecting stopwords in small corpora. These methods are generic, fully automatic, and do not require prior linguistic knowledge.
Tools for Natural Language Processing work using linguistic resources, that are language-specific. The complexity of building such resources causes many languages to lack them. So, learning them automatically from sample texts would be a desirable solution. This usually requires huge training corpora, which are not available for many local languages and jargons, lacking a wide literature. This paper focuses on stopwords, i.e., terms in a text which do not contribute in conveying its topic or content. It provides two main, inter-related and complementary, methodological contributions: (i) it proposes a novel approach based on term and document frequency to rank candidate stopwords, that works also on very small corpora (even single documents); and (ii) it proposes an automatic cutoff strategy to select the best candidates in the ranking, thus addressing one of the most critical problems in the stopword identification practice. Nice features of these approaches are that (i) they are generic and applicable to different languages, (ii) they are fully automatic, and (iii) they do not require any previous linguistic knowledge. Extensive experiments show that both are extremely effective and reliable. The former outperforms all comparable approaches in the state-of-the-art, both in terms of performance (Precision stays at 100% or nearly so for a large portion of the top-ranked candidate stopwords, while Recall is quite close to the maximum reachable in theory.) and in smooth behavior (Precision is monotonically decreasing, and Recall is monotonically increasing, allowing the experimenter to choose the preferred balance.). The latter is more flexible than existing solutions in the literature, requiring just one parameter intuitively related to the balance between Precision and Recall one wishes to obtain.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available