Abstract: Stochastic Gradient Descent (SGD) has become one of the most popular optimization methods for training machine learning models on massive datasets. However, SGD suffers from two main ...