4.7 Article

Deep Neural Network Self-Distillation Exploiting Data Representation Invariance

Journal

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TNNLS.2020.3027634

Keywords

Training; Nonlinear distortion; Data models; Neural networks; Knowledge engineering; Network architecture; Generalization error; network compression; representation invariance; self-distillation (SD)

Funding

  1. Major Project for New Generation of Artificial Intelligence (AI) [2018AAA0100400]
  2. National Natural Science Foundation of China (NSFC) [61836014, 61721004]
  3. Ministry of Science and Technology of China

Ask authors/readers for more resources

This article proposes an elegant self-distillation mechanism to directly obtain high-accuracy models without the need for an assistive model. It learns data representation invariance and effectively reduces generalization errors for various network architectures, surpassing existing model distillation methods with little extra training efforts.
To harvest small networks with high accuracies, most existing methods mainly utilize compression techniques such as low-rank decomposition and pruning to compress a trained large model into a small network or transfer knowledge from a powerful large model (teacher) to a small network (student). Despite their success in generating small models of high performance, the dependence of accompanying assistive models complicates the training process and increases memory and time cost. In this article, we propose an elegant self-distillation (SD) mechanism to obtain high-accuracy models directly without going through an assistive model. Inspired by the invariant recognition in the human vision system, different distorted instances of the same input should possess similar high-level data representations. Thus, we can learn data representation invariance between different distorted versions of the same sample. Especially, in our learning algorithm based on SD, the single network utilizes the maximum mean discrepancy metric to learn the global feature consistency and the Kullback-Leibler divergence to constrain the posterior class probability consistency across the different distorted branches. Extensive experiments on MNIST, CIFAR-10/100, and ImageNet data sets demonstrate that the proposed method can effectively reduce the generalization error for various network architectures, such as AlexNet, VGGNet, ResNet, Wide ResNet, and DenseNet, and outperform existing model distillation methods with little extra training efforts.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available