Introduction#
Gradient descent is the fundamental optimization algorithm powering most machine learning models. Understanding its variants and improvements is crucial for effective model training.
Basic Gradient Descent#
The core idea is to iteratively move in the direction of steepest descent to minimize a loss function.
Update Rule:
θ_{t+1} = θ_t - α∇J(θ_t)
Where:
- θ: Model parameters
- α: Learning rate
- ∇J(θ): Gradient of loss function J with respect to θ
Gradient Descent Variants#
1. Batch Gradient Descent (BGD)#
Uses the entire dataset to compute gradients at each step.
Pros:
- Guaranteed convergence to global minimum (for convex functions)
- Stable gradient estimates
Cons:
- Computationally expensive for large datasets
- Memory intensive
def batch_gradient_descent(X, y, theta, learning_rate, iterations):
m = len(y)
cost_history = []
for i in range(iterations):
# Forward pass
h = X.dot(theta)
# Compute cost
cost = (1/(2*m)) * np.sum((h - y)**2)
cost_history.append(cost)
# Compute gradient
gradient = (1/m) * X.T.dot(h - y)
# Update parameters
theta = theta - learning_rate * gradient
return theta, cost_history
2. Stochastic Gradient Descent (SGD)#
Updates parameters using one training example at a time.
Pros:
- Fast updates
- Can escape local minima due to noise
- Memory efficient
Cons:
- High variance in updates
- May not converge to exact minimum
def stochastic_gradient_descent(X, y, theta, learning_rate, iterations):
m = len(y)
cost_history = []
for i in range(iterations):
# Shuffle data
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
for j in range(m):
xi = X_shuffled[j:j+1]
yi = y_shuffled[j:j+1]
# Forward pass
h = xi.dot(theta)
# Compute gradient for single example
gradient = xi.T.dot(h - yi)
# Update parameters
theta = theta - learning_rate * gradient
return theta
3. Mini-Batch Gradient Descent#
Balances between BGD and SGD by using small batches of training examples.
Pros:
- Reduced variance compared to SGD
- Computationally efficient
- Can leverage vectorization
Cons:
- Additional hyperparameter (batch size)
def mini_batch_gradient_descent(X, y, theta, learning_rate, batch_size, iterations):
m = len(y)
for i in range(iterations):
# Shuffle data
indices = np.random.permutation(m)
X_shuffled = X[indices]
y_shuffled = y[indices]
# Process mini-batches
for j in range(0, m, batch_size):
end_idx = min(j + batch_size, m)
X_batch = X_shuffled[j:end_idx]
y_batch = y_shuffled[j:end_idx]
# Forward pass
h = X_batch.dot(theta)
# Compute gradient for batch
gradient = (1/batch_size) * X_batch.T.dot(h - y_batch)
# Update parameters
theta = theta - learning_rate * gradient
return theta
Advanced Optimization Algorithms#
1. Momentum#
Adds a velocity term to accelerate gradients in persistent directions and dampen oscillations.
class MomentumOptimizer:
def __init__(self, learning_rate=0.01, momentum=0.9):
self.learning_rate = learning_rate
self.momentum = momentum
self.velocity = None
def update(self, params, gradients):
if self.velocity is None:
self.velocity = np.zeros_like(params)
self.velocity = self.momentum * self.velocity - self.learning_rate * gradients
params += self.velocity
return params
2. RMSprop#
Adapts learning rates based on recent gradient magnitudes.
class RMSpropOptimizer:
def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
self.learning_rate = learning_rate
self.decay_rate = decay_rate
self.epsilon = epsilon
self.cache = None
def update(self, params, gradients):
if self.cache is None:
self.cache = np.zeros_like(params)
self.cache = self.decay_rate * self.cache + (1 - self.decay_rate) * gradients**2
params -= self.learning_rate * gradients / (np.sqrt(self.cache) + self.epsilon)
return params
3. Adam (Adaptive Moment Estimation)#
Combines momentum and RMSprop by keeping track of both first and second moments of gradients.
class AdamOptimizer:
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.learning_rate = learning_rate
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None # First moment
self.v = None # Second moment
self.t = 0 # Time step
def update(self, params, gradients):
self.t += 1
if self.m is None:
self.m = np.zeros_like(params)
self.v = np.zeros_like(params)
# Update biased first moment estimate
self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
# Update biased second moment estimate
self.v = self.beta2 * self.v + (1 - self.beta2) * gradients**2
# Compute bias-corrected estimates
m_corrected = self.m / (1 - self.beta1**self.t)
v_corrected = self.v / (1 - self.beta2**self.t)
# Update parameters
params -= self.learning_rate * m_corrected / (np.sqrt(v_corrected) + self.epsilon)
return params
Learning Rate Scheduling#
1. Step Decay#
def step_decay(epoch, initial_lr, drop_rate=0.5, epochs_drop=10):
return initial_lr * (drop_rate ** np.floor(epoch / epochs_drop))
2. Exponential Decay#
def exponential_decay(epoch, initial_lr, decay_rate=0.96):
return initial_lr * (decay_rate ** epoch)
3. Cosine Annealing#
def cosine_annealing(epoch, initial_lr, max_epochs):
return initial_lr * (1 + np.cos(np.pi * epoch / max_epochs)) / 2
Practical Tips#
1. Learning Rate Selection#
- Start with 0.001 for Adam, 0.01 for SGD
- Use learning rate finder to identify optimal range
- Monitor loss curves for signs of too high/low learning rates
2. Batch Size Considerations#
- Larger batches: More stable gradients, better GPU utilization
- Smaller batches: More updates per epoch, regularization effect
- Common sizes: 32, 64, 128, 256
3. Convergence Monitoring#
def check_convergence(loss_history, patience=10, min_delta=1e-6):
if len(loss_history) < patience:
return False
recent_losses = loss_history[-patience:]
improvement = recent_losses[0] - recent_losses[-1]
return improvement < min_delta
Common Issues & Solutions#
1. Vanishing/Exploding Gradients#
- Solution: Gradient clipping, proper weight initialization, batch normalization
2. Slow Convergence#
- Solution: Learning rate scheduling, momentum, advanced optimizers
3. Oscillations Around Minimum#
- Solution: Reduce learning rate, use adaptive optimizers
4. Getting Stuck in Local Minima#
- Solution: Add noise (SGD), use momentum, restart with different initialization
Comparison Table#
Algorithm | Memory | Convergence | Hyperparameters | Best For |
---|---|---|---|---|
SGD | Low | Noisy | Learning rate | Simple problems |
SGD + Momentum | Low | Fast | LR + momentum | Most problems |
RMSprop | Medium | Adaptive | LR + decay rate | RNNs |
Adam | Medium | Fast + Adaptive | LR + betas | Default choice |
AdaGrad | Medium | Adaptive | Learning rate | Sparse features |
Further Reading#
- An Overview of Gradient Descent Optimization Algorithms
- Adam: A Method for Stochastic Optimization
- Understanding Learning Rates and How It Improves Performance
Last Updated: January 2024
Next Topics: Second-order optimization methods, Natural gradients