4.6 Article

Adadb: Adaptive Diff-Batch Optimization Technique for Gradient Descent

Journal

IEEE ACCESS
Volume 9, Issue -, Pages 99581-99588

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2021.3096976

Keywords

Convergence; Friction; Optimization; Neural networks; Training; Stochastic processes; Signal processing algorithms; Machine learning; gradient descent; optimization; image classification

Ask authors/readers for more resources

Gradient descent is widely used in deep neural networks, but it suffers from slow convergence, which can be improved by methods like momentum, Adam, diffGrad, and AdaBelief. This paper introduces a new optimization technique called adadb, which addresses issues in existing methods and increases the convergence rate.
Gradient descent is the workhorse of deep neural networks. Gradient descent has the disadvantage of slow convergence. The famous way to overcome slow convergence is to use momentum. Momentum effectively increases the learning factor of gradient descent. Recently, many approaches have been proposed to control the momentum for better optimization towards global minima, such as Adam, diffGrad, and AdaBelief. Adam decreases the momentum by dividing it with square root of moving averages of squared past gradients or second moment. The sudden decrease in the second moment often results in the overshoot of the gradient from the minima and then settle at the closest minima. DiffGrad decreases this problem by using a friction constant based on the difference of current gradient and immediate past gradient in Adam. The friction constant further decreases the momentum and results in slow convergence. AdaBelief adapts the step size according to the belief in the current gradient direction. Another famous way of fast convergence is to increase the batch size adaptively. This paper proposes a new optimization technique named adaptive diff-batch or adadb that removes the problem of overshooting gradient in Adam, slow convergence in diffGrad, and combines the methods with adaptive batch size for further increase in convergence rate. The proposed technique uses the friction constant based on the past three differences of gradients rather than one as in diffGrad and a condition to decide the use of friction constant. The proposed technique has outperformed the Adam, diffGrad, and AdaBelief optimizers on synthetic complex non-convex functions and real-world datasets.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available