Optimizers in Deep Learning: A Comparative Study and Analysis

Authors: Sujay Bashetty, Kalyan Raja , Sahiti Adepu, Ajeet Jain

DOI Link: https://doi.org/10.22214/ijraset.2022.48050

Abstract

Machine learning has enormously contributed towards optimization techniques with new ways for optimization algorithms. These approaches in deep learning have wide applications with resurgence of novelty starting from Stochastic Gradient Descent to convex and non-convex ones. Selecting an optimizer is a vital choice in deep learning as it determines the training speed and final performance predicted by the DL model. The complexity further increases with growing deeper due to hyper-parameter tuning and as the data sets become larger. In this work, we analyze most popular and widely optimizers algorithms empirically. The augmenting behaviors of these are tested on MNIST, Auto Encoder data sets. We compare them pointing out their similarities, differences and likelihood of their suitability for a given applications. Recent variants of optimizers are highlighted. The article focuses on their critical role and pinpoints which one would be a better option while making a trade-off.

Introduction

I. INTRODUCTION

Deep learning (DL) algorithms are essential in statistical computations because of their efficiency as data sets grow in size. Interestingly, one of the pillars of DL is the mathematical tactics of the optimization process that make decisions based on previously invisible data. This is achieved through carefully chosen parameters for a given learning problem (an intuitive near-optimal solution). The hyper–parameters are the parameters of a learning algorithm and not of a given model. Evidently, the inspiration is to look forward to the optimizing algorithm which works well and predict accurately [1, 2, 3, 4]. Many people have worked on text classification in ML because of the fundamental problem of learning from examples. Similarly, speech and image recognition have been dealt with great success and accuracy – yet offers the place for new improvements. In achieving higher goals, use of various optimizing techniques involving convexity principles are much more cited [5, 6, 7] now a days and using logistic and other regression techniques. Moreover, the Stochastic Gradient Descent (SGD) has been very popular over last many years, but also suffers from ill-conditioning and also taking more time to compute for larger data sets. In some cases, it also requires hyper-parameter tuning and different learning rates.

II. BACKGROUND

DL has produced a strong trace in all fields of engineering exercises and has generated acute interest due to its secrecy to natural cognition. Machine literacy( ML) has turn a base while addressing real world challenges like healthcare, social networking and behaviour analysis, econometry, SCM to mention a many. also, we've intelligent products and services, e.g., speech recognition, computer vision, anomaly discovery, game playing and numerous. The ever changing tools and ways of ML have a widening impact to diagnose conditions, independent vehicle driving, amped pictures, smart delegated systems and further in channel as intelligent products. Shoveling into the history, reveals that the ground work began with optimizer and regularization methodologies — starting from Gradient Descent( GD) to Stochastic GD to Momentum grounded optimizers[8] also, the convex and non-convex proposition of optimization is covered compactly and one can relate more on these motifs in cited references[9, 10].

The composition layout a thoughtful process to answer utmost of the postdating questions and harangues the issues and challenges. A many material are:

How to scale the optimal interpretation? What are those attributes which compares them using colorful ways in deep literacy?
How non-convex styles could be metamorphosed into convex styles?
How secondary free algorithms are getting significance while busting computational time ?
How hyperactive- parameter tuning is suited for optimization ways?

III. OPTIMIZATION AND ROLE OF OPTIMIZER IN DL

In optimization, the algorithm performance is judged by the fact that” how close is the expected output with desired one”? This is achieved by loss function of the network [1, 3, 4]. It takes the predicted value and compares with the true target, and calculates the difference— suggestive of our performance on this specific data set, as depicted in Fig. 1.

The optimizer with considered loss sore utilizes that in order to keep the value of the weights therefore leading lower loss score iteratively. This adaptation is the event performed by ‘optimizer’, which augments what's conventionally known as back propagation algorithm [11, 12, 13, 14, 15].

A. Optimization Issues

The cruciality's of optimization issues in DL are fairly complex, and a pictorial representation is in Fig.2 with recitation as in Fig

(i) Making the algorithm starts run and converging to a realistic result.

(ii) Making the algorithm to assemble presto and speed up confluence rate.

(iii) Icing confluence with a dumpy valuation – like global minimum.

???????B. Stochastic GD Optimization

Ironically, SGD nearly succeed the grade of amini -batch tagged at arbitrary. While training a network, we estimate the grade using a suitable loss function. At a replication ‘k’, the grade will be streamlined consequently. Hence, the computation for ‘m’ exemplifications input from the training set having y (i) as target, is:

Here ‘η’ (eta) is getting rate. Further, the literacy rate is of consummate significance as the consequence of an update at ‘kth’ - replication is directed by this. For case, if η = 0.01 (certifiably small to small), also putatively more number of replication updates will be needed for confluence. On the negative, if η = 0.5 or further, also in this case the recent updates shall be largely dependent on the recent case. ultimately, an egregious wise decision is to elect(choose) it arbitrary by trial — this is one veritably important hyperactive- parameter tuning in DL systems. On the resemblant side of it, yet another way could be ‘choose one among several literacy rates’ which give lowest loss value.

???????C. Stocastic Gradient Descent With Momentum

From the antedating section and paragraphs, it’s egregious that SGD has a trouble to get towards global optima, and has a tendency to get stuck into original minima, as depicted in Fig. 3. also, lower values of grade or noisy bones

Can produce another problem of evaporating grade issue! To overcome this problem, the system involving instigation (a principle espoused from drugs) is espoused to accelerate the process of literacy. The instigation system aims to resolve 2 veritably important issues

(i) friction in SGD

(ii) friction when working Hessian Matrix for poor exertion

The system takes the brevity of running moving average by incorporating former update in the recent change as if there's a instigation due to antedating updates.

The instigation- grounded SGD will meet briskly with reducing oscillations. To achieve this, we use another hyperactive- parameter ‘ν’ known as haste. This hyperactive- parameter tells the speed and of course the direction by which its moves in the given parameter space. Generally, ‘ν’ is set as negative of grade value of exponential decaying normal. Moving further on, we'd bear yet one further hyperactive- parameter α (nascence) α ? (0, 1), known as instigation parameter and its donation is to find how presto the former grade exponentially decays. The new (streamlined) values are computes as:

From equation( 2) it's egregious that the haste vector ‘ ν ’ keeps on adding the grade values. also, for a bigger value of α( nascence) relative to ?, the grade affects the current direction more from former replication. The generally used values of α from0.5 to0.99. Despite being so intuitive and nice fashion, the limitation of this algorithm is fresh parameter addition and redundant computations involved.

???????D. Various Optimizers IN DL

The presently available optimizers with their process frame are compactly described with their relative graces and limitations. Each bone has some tricks or the other and an sapience of those will an exemplary study progression.

ADAGRAD

The simplest of optimizing algorithms to begin with is AdaGrad, where the algorithm’s names itself suggest, the algorithm adapts, i.e., stoutly changes the literacy rate with model’s parameters Then, for parameters whose partial outgrowth are advanced (larger) for them drop their corresponding literacy rate mainly. Contrary to this suspicion, the algorithm takes equally to where derivations are lower. A natural question to ask is ‘why one needs different literacy rates’? So, to negotiate these characteristics, AdaGrad employs square value of the grade vector using a variable ‘r’ for grade accumulation, as stated in following equation

Then ? driver implieselement-wise addition of vectors. As can be inferred from above equations, when ‘ r ’ is close to a ‘ near- zero ’ value, the term in the denominator shouldn't be estimated as ‘ NaN = Not A Number ’ and therefore the term δ helps to avoid this to be. Also, the term ‘ ? ‘ stands for global literacy rate.

2. RMSPROP

The modified interpretation of AdaGrad is RMSProp – Root Mean Square Proportional [16]. In order to palliate the problems of AdaGrad, then we recursively define a decaying normal of all once slants. By doing so, the flowing exponential shifting normal at each time step depends only on the normal of former and current slants. It performs better in thenon-convex setting as well with same characteristics features. Comparison wise, AdaGrad contracts the literacy rate according to the entire history of the squared grade whereas RMSProp exploits an exponentially decaying normal to discard history from the extreme history similar that it can meet snappily after finding a convex coliseum. The equation to apply is:

3. ADAM

Adam( Adaptive momentum) one majorly used optimization algorithms in DL and joins the heuristic of both the momentum and RMSProp and interestingly been designed for deep neural nets [17]. This algorithmic fashion has the squared grade point of AdaGrad and to scale the learning rate similar to RMSProp and point of momentum using moving average. The fine algorithm calculates individual learning rate for each parameter using a term called ‘first moment’ (analogous to velocity vector) and ‘alternate moment’ (analogous to acceleration vector). A many salient features are:

— Momentum term is in- built as an estimate of first- order moment

— In-built bias correction while estimating for first and alternate order moments eventually called as initialization at origin( start point)

— Update moving exponential pars of grade ‘ mt ’ and square grade ‘ ut ’ with hyperactive- parameters ρ1 and ρ2( in original paper by the authors, they're denoted by β1 and β2) as these control.

These moving averages are estimates of the mean (first moment) and non-central variance (second moment) of the gradient.

Continuing the pipeline process at time "t", the various estimates are:

Several salient points emerge as they can be analyzed visually using different plots from the experimental analysis:

Each optimizer has its vividness and Adam and Yogi Optimizer’s loss gradually decreases with increasing accuracy with number of epochs. Others are either saturated or unable to generalize.
Except Adam and Yogi, the remaining one’s performances on this data set is monotonous and nothing indicative performance improvement is seen
Selecting a Particular Optimizer: Choose a well-understood optimizer with default learning rates and other parameters settings. Try changing these parameters in iterations (epochs) and see the loss (accuracy). Subsequently, shift towards other similar-featured optimizer and observe the changes. This is indeed an exhaustive process!

VI. FUTURE SCOPE

We have provided intuitive way of reasoning based upon the experimental dataset. Moreover, various optimizers can be employed to test on different data sets and thus provide a valuable insight to select a particular one. Mostly, researchers’ guess rely on past experience or earlier cited proven examples. Furthermore, the entire ML and DL literature is slender with merits and demerits as compelling reasons. Also, getting an overview of their criticalities and understanding the reasons for choice makes a footing platform in ML [37,38,39]. Importantly, the optimizer and their intricacies area provide lot of scope for exploration and results could be agnostic in terms of accuracy of the model and eventually their performance.

References

[2] Bishop, C.M., Neural Network for Pattern Recognition, Clarendon Press, USA 1995 [3] François Chollet, Deep Learning with Python, Manning Pub., 1st Ed, NY, USA, 2018 [4] Ajeet K. Jain, Dr. PVRD Prasad Rao and Dr. K Venkatesh Sharma;”A Perspective Analysis of Regularization and Optimization Techniques in Machine Learning”,Computational Analysis and Understanding of Deep Learning or Medical Care: Principles, Methods and Applications\". CUDLMC 2020 , Wiley-Scrivener, April/May 2021 [5] John Paul Mueller and Luca Massaron, Deep Learning for Dummies, John Wiley, 019 [6] Josh Patterson and Adam Gibson, Deep Learning: APractitioner’s Approach, O’Reilly Pub. Indian Edition, 2017 [7] Ajeet K. Jain, Dr.PVRD Prasad Rao , Dr. K. Venkatesh Sharma, Deep Learning with Recursive Neural Network for Temporal Logic Implementation, International Journalof Advanced Trends in Computer Science and Engineering, Volume 9, No.4, July – August 2020, pp 6829-6833 [8] Srivasatava et al. http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf [9] Dimitri P. Bertsekas, Convex Optimization Theory, Athena Scientific Pub., MIT Press, USA 2009 [10] Stephen Boyd and Lieven Vandenberghe, Convex Optimization, Cambridge University Press, USA 2004 [11] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551 [12] Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Improving neural networks by preventing co-adaptation of feature detectors.arXiv:1207.0580, 2012 [13] Glorot, X. and Bengio, Y., Understanding the difficulty of training deep feed forward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 249–256. (2010) [14] Glorot, X., Bordes, A., and Bengio, Y, Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 315–323. 2011. [15] Zeiler, M. and Fergus, R. , Stochastic pooling for regularization of deep convolutional neural networks. In Proceedings of the International Conference on Learning Representations ,ICLR, 2013Fabian Latorre, Paul Rolland and Volkan Cevher,Lipschitz Constant Estimation Of Neural Networks Via Sparse Polynomial Optimization, ICLR 2020 [16] D. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980, 2014 [17] Manzil Zaheer, et al., Adaptive Methods for Nonconvex Optimization, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada

Copyright

Copyright © 2022 Sujay Bashetty, Kalyan Raja , Sahiti Adepu, Ajeet Jain. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET48050

Publish Date : 2022-12-10

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here