4.7 Article

Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

期刊

出版社

ELSEVIER SCI LTD
DOI: 10.1016/j.ipm.2020.102263

关键词

Text classification pipelines; Pre-processing; Meta-features; Selective sampling; Sparsification; Experimental evaluation

资金

  1. CAPES
  2. CNPq
  3. Finep
  4. Fapemig
  5. Mundiale
  6. Astrein
  7. project InWeb
  8. project MASWeb

向作者/读者索取更多资源

Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this paper, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original termdocument matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs and noise. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the best documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Other main contributions of our work include a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline as well as a comprehensive comparative experimental evaluation of many alternatives in terms of representations, approaches, etc.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据