We focus on the optimization problem
min f (x) (1)icR"
These kinds of problems arise in various fields of applied science. By now deterministic algorithms for solving problem (1) have been extensively investigated (Nesterov ((2018)). Meanwhile, with the emergence of new fields of interest (Vap- nik ((1999))), one often has to deal with the stochastic formulation of the problem (1):
f (X) = E- ^ [f (X)] , (2)
where £ is a random variable. Stochastic formulations come up when we do not have access to (or do not want to use due to the high computational cost) the deterministic value of f (x). For example, in the case of statistical supervised learning theory (Shalev-Shwartz and Ben-David ((2014))). In this case, x is a machine learning model with n weights/parameters (it can be a regression problem or a deep neural network), D represents an unknown data distribution, f is the loss of model x on datapoint £, and f is the generalization error. We assume that we do not know the distribution D from which the data come, but have some samples £ from D. But we want to have a suitable model fit to whole D. Deterministic methods are no longer suitable, we need stochastic modifications of classical methods. One of the most popular such methods is the stochastic gradient descent (SGD) (Robbins and Monro ((1951)); Nemirovski et al. ((2009)); Vaswani et al. ((2019)); Gorbunov et al. ((2020))).
In the paper we compare different stochastic gradient methods in the regime of fixed step size. In contrast to deterministic methods, stochastic methods in such a situation do not lead to an exact solution of the problem, but begin to oscillate around the solution, having achieved some accuracy. Then, one comes to tradeoff between rate of convergence and limit noise level near the solution. It is proposed to compare stochastic gradient methods by their rate of convergence with desired limit noise being fixed. We use SGD, an efficient method of convex optimization, as a reasonable baseline to compare with. Also we introduce new algorithm called Polynomial Lookahead or alternatively Polynomial Dynamic Modification(PDM) which takes its inspiration in already studied in the literature Lookahead algorithm firstly introduced in Michael R. Zhang and Ba ((2019)) and analysed in quadratic stochastic case. Also Lookahead was studied in non-convex regime in Jianyu Wang and Rabbat ((2020)). These techniques allow to come closer to the solution more efficiently and combat the limit radius of oscillation around the solution.
We introduced new algorithm called Polynomial Dynamic Modification and showed that it experiences faster convergence rate on the quadratic stochastic problem compared to SGD and Lookahead with desired noise near solution being fixed. Moreover, we prove linear convergence of PDM to some limit noise on the class of stongly convex twicely differentiable functions.
1. C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1-27, 2011.
2. E. Gorbunov, F. Hanzely, and P. Richtarik. A unified theory of sgd: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 680-690. PMLR, 2020.
3. N. B. Jianyu Wang, Vinayak Tantia and M. Rabbat. Lookahead converges to stationary points of smooth non-convex functions. 2020.
4. H. Michael R. Zhang, James Lucas and J. Ba. Lookahead optimizer: k steps forward, 1 step back. 2019.
5. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. Society for Industrial and Applied Mathematics, 19:1574-1609,01 2009. doi: 10.1137/070704277.
6. Y. Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
7. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 - 407, 1951. doi: 10.1214/aoms/1177729586. URL https://doi.org/10.1214/aoms/1177729586.
8. S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
9. V. Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.
10. S. Vaswani, F. Bach, and M. Schmidt. Fast and faster convergence of sgd for overparameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1195-1204. PMLR, 2019.