Skip to main content
  1. Learning Notes/

Gradient Descent Optimization Algorithms

·5 mins·
Learning Machine-Learning Machine Learning Optimization Mathematics Algorithms
Table of Contents
77 - This article is part of a series.

Introduction
#

Gradient descent is the fundamental optimization algorithm powering most machine learning models. Understanding its variants and improvements is crucial for effective model training.

Basic Gradient Descent
#

The core idea is to iteratively move in the direction of steepest descent to minimize a loss function.

Update Rule:

θ_{t+1} = θ_t - α∇J(θ_t)

Where:

  • θ: Model parameters
  • α: Learning rate
  • ∇J(θ): Gradient of loss function J with respect to θ

Gradient Descent Variants
#

1. Batch Gradient Descent (BGD)
#

Uses the entire dataset to compute gradients at each step.

Pros:

  • Guaranteed convergence to global minimum (for convex functions)
  • Stable gradient estimates

Cons:

  • Computationally expensive for large datasets
  • Memory intensive
def batch_gradient_descent(X, y, theta, learning_rate, iterations):
    m = len(y)
    cost_history = []
    
    for i in range(iterations):
        # Forward pass
        h = X.dot(theta)
        
        # Compute cost
        cost = (1/(2*m)) * np.sum((h - y)**2)
        cost_history.append(cost)
        
        # Compute gradient
        gradient = (1/m) * X.T.dot(h - y)
        
        # Update parameters
        theta = theta - learning_rate * gradient
    
    return theta, cost_history

2. Stochastic Gradient Descent (SGD)
#

Updates parameters using one training example at a time.

Pros:

  • Fast updates
  • Can escape local minima due to noise
  • Memory efficient

Cons:

  • High variance in updates
  • May not converge to exact minimum
def stochastic_gradient_descent(X, y, theta, learning_rate, iterations):
    m = len(y)
    cost_history = []
    
    for i in range(iterations):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        for j in range(m):
            xi = X_shuffled[j:j+1]
            yi = y_shuffled[j:j+1]
            
            # Forward pass
            h = xi.dot(theta)
            
            # Compute gradient for single example
            gradient = xi.T.dot(h - yi)
            
            # Update parameters
            theta = theta - learning_rate * gradient
    
    return theta

3. Mini-Batch Gradient Descent
#

Balances between BGD and SGD by using small batches of training examples.

Pros:

  • Reduced variance compared to SGD
  • Computationally efficient
  • Can leverage vectorization

Cons:

  • Additional hyperparameter (batch size)
def mini_batch_gradient_descent(X, y, theta, learning_rate, batch_size, iterations):
    m = len(y)
    
    for i in range(iterations):
        # Shuffle data
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        # Process mini-batches
        for j in range(0, m, batch_size):
            end_idx = min(j + batch_size, m)
            X_batch = X_shuffled[j:end_idx]
            y_batch = y_shuffled[j:end_idx]
            
            # Forward pass
            h = X_batch.dot(theta)
            
            # Compute gradient for batch
            gradient = (1/batch_size) * X_batch.T.dot(h - y_batch)
            
            # Update parameters
            theta = theta - learning_rate * gradient
    
    return theta

Advanced Optimization Algorithms
#

1. Momentum
#

Adds a velocity term to accelerate gradients in persistent directions and dampen oscillations.

class MomentumOptimizer:
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = None
    
    def update(self, params, gradients):
        if self.velocity is None:
            self.velocity = np.zeros_like(params)
        
        self.velocity = self.momentum * self.velocity - self.learning_rate * gradients
        params += self.velocity
        return params

2. RMSprop
#

Adapts learning rates based on recent gradient magnitudes.

class RMSpropOptimizer:
    def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.decay_rate = decay_rate
        self.epsilon = epsilon
        self.cache = None
    
    def update(self, params, gradients):
        if self.cache is None:
            self.cache = np.zeros_like(params)
        
        self.cache = self.decay_rate * self.cache + (1 - self.decay_rate) * gradients**2
        params -= self.learning_rate * gradients / (np.sqrt(self.cache) + self.epsilon)
        return params

3. Adam (Adaptive Moment Estimation)
#

Combines momentum and RMSprop by keeping track of both first and second moments of gradients.

class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None  # First moment
        self.v = None  # Second moment
        self.t = 0     # Time step
    
    def update(self, params, gradients):
        self.t += 1
        
        if self.m is None:
            self.m = np.zeros_like(params)
            self.v = np.zeros_like(params)
        
        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        
        # Update biased second moment estimate
        self.v = self.beta2 * self.v + (1 - self.beta2) * gradients**2
        
        # Compute bias-corrected estimates
        m_corrected = self.m / (1 - self.beta1**self.t)
        v_corrected = self.v / (1 - self.beta2**self.t)
        
        # Update parameters
        params -= self.learning_rate * m_corrected / (np.sqrt(v_corrected) + self.epsilon)
        return params

Learning Rate Scheduling
#

1. Step Decay
#

def step_decay(epoch, initial_lr, drop_rate=0.5, epochs_drop=10):
    return initial_lr * (drop_rate ** np.floor(epoch / epochs_drop))

2. Exponential Decay
#

def exponential_decay(epoch, initial_lr, decay_rate=0.96):
    return initial_lr * (decay_rate ** epoch)

3. Cosine Annealing
#

def cosine_annealing(epoch, initial_lr, max_epochs):
    return initial_lr * (1 + np.cos(np.pi * epoch / max_epochs)) / 2

Practical Tips
#

1. Learning Rate Selection
#

  • Start with 0.001 for Adam, 0.01 for SGD
  • Use learning rate finder to identify optimal range
  • Monitor loss curves for signs of too high/low learning rates

2. Batch Size Considerations
#

  • Larger batches: More stable gradients, better GPU utilization
  • Smaller batches: More updates per epoch, regularization effect
  • Common sizes: 32, 64, 128, 256

3. Convergence Monitoring
#

def check_convergence(loss_history, patience=10, min_delta=1e-6):
    if len(loss_history) < patience:
        return False
    
    recent_losses = loss_history[-patience:]
    improvement = recent_losses[0] - recent_losses[-1]
    
    return improvement < min_delta

Common Issues & Solutions
#

1. Vanishing/Exploding Gradients
#

  • Solution: Gradient clipping, proper weight initialization, batch normalization

2. Slow Convergence
#

  • Solution: Learning rate scheduling, momentum, advanced optimizers

3. Oscillations Around Minimum
#

  • Solution: Reduce learning rate, use adaptive optimizers

4. Getting Stuck in Local Minima
#

  • Solution: Add noise (SGD), use momentum, restart with different initialization

Comparison Table
#

AlgorithmMemoryConvergenceHyperparametersBest For
SGDLowNoisyLearning rateSimple problems
SGD + MomentumLowFastLR + momentumMost problems
RMSpropMediumAdaptiveLR + decay rateRNNs
AdamMediumFast + AdaptiveLR + betasDefault choice
AdaGradMediumAdaptiveLearning rateSparse features

Further Reading
#


Last Updated: January 2024
Next Topics: Second-order optimization methods, Natural gradients

77 - This article is part of a series.