Convergence Analysis of Stochastic Gradient Descent with Adaptive Learning Rates: A Mathematical Framework
DOI:
https://doi.org/10.63163/jpehss.v4i1.1072Keywords:
Stochastic Gradient Descent, Adaptive Learning Rates, Neural Networks, Machine Learning, Deep Learning, Non-Convex OptimizationAbstract
Neural networks are growing and shaping tech industry very rapidly, especially the deep neural networks have been employed to a wide variety of AI applications. Stochastic Gradient Descent (SGD) is one of the algorithms used in deep neural networks. As an optimization method, the SGD using adaptive learning rates is used as choice of training deep neural networks. In spite of its widespread popularity, the deep understanding of its convergence properties for adaptive methods remains imperfect. This study provides a mathematical framework to analyze the convergence of adaptive variations of SGD which comprise RMSprop, AdaGrad, and Adam. This work focusses on establishing the convergence rates within various assumptions targeting the objective function. These assumptions non-convex settings for deep learning. Our study discloses the importance of second-moment accumulation in variance reduction and reveal explicit error bounds. We show that in suitable conditions, adaptive methods attain O(1\/√T) convergence rate for non-convex objectives and O(1\/T) for strongly convex functions. Along with proofs, we implement the numerical experiments which validate our theoretical findings. The results show a significant mathematical justification for selecting choices in adaptive optimizers and provide better approaches for hyperparameter tuning.