4.7 Article

Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

Journal

INFORMATION PROCESSING & MANAGEMENT
Volume 57, Issue 4, Pages -

Publisher

ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2020.102263

Keywords

Text classification pipelines; Pre-processing; Meta-features; Selective sampling; Sparsification; Experimental evaluation

Funding

  1. CAPES
  2. CNPq
  3. Finep
  4. Fapemig
  5. Mundiale
  6. Astrein
  7. project InWeb
  8. project MASWeb

Ask authors/readers for more resources

Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this paper, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original termdocument matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs and noise. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the best documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Other main contributions of our work include a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline as well as a comprehensive comparative experimental evaluation of many alternatives in terms of representations, approaches, etc.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available