Complete Guide to PPO, RNN, and Actor-Critic Methods
Understanding the algorithms behind modern reinforcement learning
Table of Contents
Reinforcement Learning Foundations
The RL Problem
Reinforcement Learning is about training agents to make decisions in an environment to maximize cumulative reward.
Key Components:
- Agent: The decision maker (your robot/AI)
- Environment: The world the agent operates in
- State (s): Current situation description
- Action (a): What the agent can do
- Reward (r): Feedback signal
- Policy (π): Strategy for choosing actions
Goal: Find the optimal policy π* that maximizes expected cumulative reward.
Policy vs Value Methods
- Policy Methods: Directly learn what action to take
- Value Methods: Learn how good states/actions are, then act greedily
- Actor-Critic: Combines both approaches
Actor-Critic Methods
Core Concept
Actor-Critic uses two neural networks working together:
Actor (Policy Network)
- Role: Decides what action to take
- Input: Current state
- Output: Action probabilities or action values
- Goal: Maximize expected reward
Critic (Value Network)
- Role: Evaluates how good the current state is
- Input: Current state
- Output: Value estimate V(s)
- Goal: Accurately predict future rewards
Why Actor-Critic Works
Traditional Policy Gradient Problem:
∇J(θ) = E[∇log π(a|s) × R]
The reward R has high variance, making learning slow and unstable.
Actor-Critic Solution:
∇J(θ) = E[∇log π(a|s) × A(s,a)]
Where A(s,a) = Q(s,a) - V(s) is the advantage.
Advantage Function Benefits:
- Lower Variance: Subtracting baseline V(s) reduces noise
- Better Signal: Tells us how much better/worse an action is than average
- Faster Learning: More stable gradients
Actor-Critic Algorithm
# Pseudocode for episode in episodes: state = env.reset() while not done: # Actor: Choose action action_probs = actor(state) action = sample(action_probs) # Environment step next_state, reward, done = env.step(action) # Critic: Evaluate states value = critic(state) next_value = critic(next_state) # Compute advantage advantage = reward + gamma * next_value - value # Update networks actor_loss = -log(action_probs[action]) * advantage critic_loss = advantage² optimize(actor_loss, critic_loss) state = next_state
Recurrent Neural Networks (RNNs)
Why RNNs in RL?
The Problem: Many RL environments are partially observable
- Agent doesn't see complete state
- Need to remember past information
- Must infer hidden state from observation history
Examples:
- Robot with limited sensors
- Game with fog of war
- Financial markets with delayed information
RNN Architecture
Basic RNN:
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h) y_t = W_hy * h_t + b_y
GRU (Gated Recurrent Unit) - More Stable:
z_t = σ(W_z * [h_{t-1}, x_t]) # Update gate r_t = σ(W_r * [h_{t-1}, x_t]) # Reset gate h̃_t = tanh(W * [r_t * h_{t-1}, x_t]) # Candidate state h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t # Final state
RNN in Actor-Critic
Modified Architecture:
class ActorRNN: def __init__(self): self.gru = GRU(input_size, hidden_size) self.policy_head = Linear(hidden_size, num_actions) def forward(self, obs, hidden_state): # Process observation through RNN rnn_out, new_hidden = self.gru(obs, hidden_state) # Generate action probabilities action_probs = softmax(self.policy_head(rnn_out)) return action_probs, new_hidden
Key Benefits:
- Memory: Maintains context across timesteps
- Temporal Patterns: Can learn sequences and timing
- Partial Observability: Handles incomplete information
Proximal Policy Optimization (PPO)
The Problem PPO Solves
Policy Gradient Challenge:
- Small policy updates → slow learning
- Large policy updates → unstable, catastrophic failures
- Need "just right" update size
Previous Solutions:
- TRPO: Complex, computationally expensive
- A3C: Requires many parallel workers
PPO Core Innovation
Clipped Surrogate Objective:
r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) # Probability ratio L^CLIP(θ) = E_t[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]
What This Means:
- r_t(θ): How much the policy changed
- clip(r_t, 1-ε, 1+ε): Limit changes to [1-ε, 1+ε] range
- Typical ε: 0.2 (allows 20% policy change)
PPO Algorithm
# PPO Pseudocode for iteration in training_iterations: # 1. Collect trajectories with current policy trajectories = [] for env in parallel_envs: trajectory = collect_trajectory(env, policy, steps=2048) trajectories.append(trajectory) # 2. Compute advantages for traj in trajectories: values = critic(traj.states) advantages = compute_gae(traj.rewards, values, gamma=0.99) traj.advantages = advantages # 3. Update policy for multiple epochs for epoch in range(4): # Typically 4 epochs for batch in mini_batches(trajectories): # Compute probability ratios old_probs = batch.action_probs new_probs = policy(batch.states) ratios = new_probs / old_probs # Clipped surrogate loss surr1 = ratios * batch.advantages surr2 = clip(ratios, 1-0.2, 1+0.2) * batch.advantages policy_loss = -min(surr1, surr2).mean() # Value loss value_loss = (critic(batch.states) - batch.returns)².mean() # Total loss total_loss = policy_loss + 0.5 * value_loss # Update networks optimizer.step(total_loss)
PPO Advantages
Simplicity:
- Easy to implement
- Few hyperparameters
- Stable across different environments
Efficiency:
- Reuses collected data multiple times
- No need for many parallel workers
- Good sample efficiency
Stability:
- Prevents catastrophic policy updates
- Reliable convergence
- Works well out-of-the-box
Combining All Three: PPO + RNN + Actor-Critic
Architecture Overview
class PPOActorCriticRNN: def __init__(self): # Shared RNN backbone self.rnn = GRU(obs_size, hidden_size) # Actor head self.actor_head = Linear(hidden_size, num_actions) # Critic head self.critic_head = Linear(hidden_size, 1) def forward(self, obs, hidden_state): # Process through RNN rnn_out, new_hidden = self.rnn(obs, hidden_state) # Generate outputs action_logits = self.actor_head(rnn_out) value = self.critic_head(rnn_out) return action_logits, value, new_hidden
Training Loop
def train_ppo_rnn_actor_critic(): model = PPOActorCriticRNN() for iteration in range(num_iterations): # 1. Collect trajectories trajectories = [] hidden_states = initialize_hidden_states() for step in range(rollout_length): # Forward pass action_logits, values, hidden_states = model( observations, hidden_states ) # Sample actions actions = sample_actions(action_logits) # Environment step next_obs, rewards, dones = env.step(actions) # Store transition trajectories.append({ 'obs': observations, 'actions': actions, 'rewards': rewards, 'values': values, 'action_logits': action_logits, 'hidden_states': hidden_states, 'dones': dones }) # Reset hidden states on episode end hidden_states = reset_hidden_on_done(hidden_states, dones) observations = next_obs # 2. Compute advantages using GAE advantages = compute_gae(trajectories) # 3. PPO updates for epoch in range(ppo_epochs): for batch in create_mini_batches(trajectories): # Recreate forward pass action_logits, values, _ = model( batch.obs, batch.hidden_states ) # PPO loss computation policy_loss = compute_ppo_loss( action_logits, batch.actions, batch.old_action_logits, batch.advantages ) value_loss = compute_value_loss(values, batch.returns) total_loss = policy_loss + 0.5 * value_loss # Update optimizer.zero_grad() total_loss.backward() optimizer.step()
Key Implementation Details
Hidden State Management:
- Reset hidden states on episode boundaries
- Carry hidden states across timesteps within episodes
- Handle variable-length episodes properly
Advantage Computation:
- Use GAE (Generalized Advantage Estimation)
- Account for episode boundaries in GAE computation
- Normalize advantages for stability
Mini-batch Creation:
- Preserve temporal structure within episodes
- Handle variable-length sequences
- Efficient padding/masking
Implementation Details
Generalized Advantage Estimation (GAE)
Problem: Simple advantage A = r + γV(s') - V(s) has high variance
Solution: GAE blends n-step returns:
GAE(λ) = (1-λ)[A₁ + λA₂ + λ²A₃ + ...] where: A₁ = r + γV(s') - V(s) A₂ = r + γr' + γ²V(s'') - V(s) A₃ = r + γr' + γ²r'' + γ³V(s''') - V(s)
λ Parameter:
- λ = 0: Low variance, high bias (just 1-step)
- λ = 1: High variance, low bias (Monte Carlo)
- λ = 0.95: Good balance (commonly used)
Hyperparameter Guidelines
PPO Hyperparameters:
- Clip ratio (ε): 0.2
- PPO epochs: 4
- Mini-batch size: 64-256
- Learning rate: 3e-4
- GAE λ: 0.95
- Discount γ: 0.99
RNN Hyperparameters:
- Hidden size: 128-512
- Sequence length: 128-512
- Gradient clipping: 0.5-1.0
Training Setup:
- Parallel environments: 8-2048
- Rollout length: 2048 steps
- Total timesteps: 10M-100M
Practical Applications
Robotics (Like Your Humanoid)
Why PPO + RNN + Actor-Critic:
- Partial Observability: Robot sensors don't see everything
- Temporal Patterns: Walking requires coordinated sequences
- Stability: PPO prevents dangerous policy updates
- Continuous Control: Actor-Critic handles continuous actions
Game AI
Applications:
- Real-time strategy games (partial information)
- Fighting games (combo sequences)
- Racing games (temporal control)
Financial Trading
Use Cases:
- Market making (memory of recent trades)
- Portfolio management (long-term dependencies)
- Risk management (stable policy updates)
Natural Language Processing
Applications:
- Dialogue systems (conversation context)
- Text generation (sequential dependencies)
- Machine translation (sequence-to-sequence)
Summary
Actor-Critic: Combines policy learning (actor) with value estimation (critic) for lower variance and more stable learning.
RNN: Adds memory to handle partial observability and temporal patterns crucial for real-world applications.
PPO: Ensures stable policy updates through clipping, making the training process reliable and efficient.
Together: They create a powerful, stable, and practical algorithm for complex sequential decision-making problems like humanoid locomotion.
The beauty of this combination is that each component solves a fundamental challenge in RL:
- Actor-Critic → variance reduction
- RNN → temporal dependencies
- PPO → stable updates
This is why it's become the go-to method for many modern RL applications, including the humanoid walking code you analyzed!