Back to Blog

Complete Guide to PPO, RNN, and Actor-Critic Methods

Understanding the algorithms behind modern reinforcement learning


Reinforcement Learning Foundations

The RL Problem

Reinforcement Learning is about training agents to make decisions in an environment to maximize cumulative reward.

Key Components:

  • Agent: The decision maker (your robot/AI)
  • Environment: The world the agent operates in
  • State (s): Current situation description
  • Action (a): What the agent can do
  • Reward (r): Feedback signal
  • Policy (π): Strategy for choosing actions

Goal: Find the optimal policy π* that maximizes expected cumulative reward.

Policy vs Value Methods

  • Policy Methods: Directly learn what action to take
  • Value Methods: Learn how good states/actions are, then act greedily
  • Actor-Critic: Combines both approaches

Actor-Critic Methods

Core Concept

Actor-Critic uses two neural networks working together:

Actor (Policy Network)

  • Role: Decides what action to take
  • Input: Current state
  • Output: Action probabilities or action values
  • Goal: Maximize expected reward

Critic (Value Network)

  • Role: Evaluates how good the current state is
  • Input: Current state
  • Output: Value estimate V(s)
  • Goal: Accurately predict future rewards

Why Actor-Critic Works

Traditional Policy Gradient Problem:

∇J(θ) = E[∇log π(a|s) × R]

The reward R has high variance, making learning slow and unstable.

Actor-Critic Solution:

∇J(θ) = E[∇log π(a|s) × A(s,a)]

Where A(s,a) = Q(s,a) - V(s) is the advantage.

Advantage Function Benefits:

  • Lower Variance: Subtracting baseline V(s) reduces noise
  • Better Signal: Tells us how much better/worse an action is than average
  • Faster Learning: More stable gradients

Actor-Critic Algorithm

# Pseudocode
for episode in episodes:
    state = env.reset()
    
    while not done:
        # Actor: Choose action
        action_probs = actor(state)
        action = sample(action_probs)
        
        # Environment step
        next_state, reward, done = env.step(action)
        
        # Critic: Evaluate states
        value = critic(state)
        next_value = critic(next_state)
        
        # Compute advantage
        advantage = reward + gamma * next_value - value
        
        # Update networks
        actor_loss = -log(action_probs[action]) * advantage
        critic_loss = advantage²
        
        optimize(actor_loss, critic_loss)
        state = next_state

Recurrent Neural Networks (RNNs)

Why RNNs in RL?

The Problem: Many RL environments are partially observable

  • Agent doesn't see complete state
  • Need to remember past information
  • Must infer hidden state from observation history

Examples:

  • Robot with limited sensors
  • Game with fog of war
  • Financial markets with delayed information

RNN Architecture

Basic RNN:

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y

GRU (Gated Recurrent Unit) - More Stable:

z_t = σ(W_z * [h_{t-1}, x_t])     # Update gate
r_t = σ(W_r * [h_{t-1}, x_t])     # Reset gate
h̃_t = tanh(W * [r_t * h_{t-1}, x_t])  # Candidate state
h_t = (1 - z_t) * h_{t-1} + z_t * h̃_t  # Final state

RNN in Actor-Critic

Modified Architecture:

class ActorRNN:
    def __init__(self):
        self.gru = GRU(input_size, hidden_size)
        self.policy_head = Linear(hidden_size, num_actions)
    
    def forward(self, obs, hidden_state):
        # Process observation through RNN
        rnn_out, new_hidden = self.gru(obs, hidden_state)
        
        # Generate action probabilities
        action_probs = softmax(self.policy_head(rnn_out))
        
        return action_probs, new_hidden

Key Benefits:

  • Memory: Maintains context across timesteps
  • Temporal Patterns: Can learn sequences and timing
  • Partial Observability: Handles incomplete information

Proximal Policy Optimization (PPO)

The Problem PPO Solves

Policy Gradient Challenge:

  • Small policy updates → slow learning
  • Large policy updates → unstable, catastrophic failures
  • Need "just right" update size

Previous Solutions:

  • TRPO: Complex, computationally expensive
  • A3C: Requires many parallel workers

PPO Core Innovation

Clipped Surrogate Objective:

r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)  # Probability ratio

L^CLIP(θ) = E_t[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]

What This Means:

  • r_t(θ): How much the policy changed
  • clip(r_t, 1-ε, 1+ε): Limit changes to [1-ε, 1+ε] range
  • Typical ε: 0.2 (allows 20% policy change)

PPO Algorithm

# PPO Pseudocode
for iteration in training_iterations:
    # 1. Collect trajectories with current policy
    trajectories = []
    for env in parallel_envs:
        trajectory = collect_trajectory(env, policy, steps=2048)
        trajectories.append(trajectory)
    
    # 2. Compute advantages
    for traj in trajectories:
        values = critic(traj.states)
        advantages = compute_gae(traj.rewards, values, gamma=0.99)
        traj.advantages = advantages
    
    # 3. Update policy for multiple epochs
    for epoch in range(4):  # Typically 4 epochs
        for batch in mini_batches(trajectories):
            # Compute probability ratios
            old_probs = batch.action_probs
            new_probs = policy(batch.states)
            ratios = new_probs / old_probs
            
            # Clipped surrogate loss
            surr1 = ratios * batch.advantages
            surr2 = clip(ratios, 1-0.2, 1+0.2) * batch.advantages
            policy_loss = -min(surr1, surr2).mean()
            
            # Value loss
            value_loss = (critic(batch.states) - batch.returns)².mean()
            
            # Total loss
            total_loss = policy_loss + 0.5 * value_loss
            
            # Update networks
            optimizer.step(total_loss)

PPO Advantages

Simplicity:

  • Easy to implement
  • Few hyperparameters
  • Stable across different environments

Efficiency:

  • Reuses collected data multiple times
  • No need for many parallel workers
  • Good sample efficiency

Stability:

  • Prevents catastrophic policy updates
  • Reliable convergence
  • Works well out-of-the-box

Combining All Three: PPO + RNN + Actor-Critic

Architecture Overview

class PPOActorCriticRNN:
    def __init__(self):
        # Shared RNN backbone
        self.rnn = GRU(obs_size, hidden_size)
        
        # Actor head
        self.actor_head = Linear(hidden_size, num_actions)
        
        # Critic head
        self.critic_head = Linear(hidden_size, 1)
    
    def forward(self, obs, hidden_state):
        # Process through RNN
        rnn_out, new_hidden = self.rnn(obs, hidden_state)
        
        # Generate outputs
        action_logits = self.actor_head(rnn_out)
        value = self.critic_head(rnn_out)
        
        return action_logits, value, new_hidden

Training Loop

def train_ppo_rnn_actor_critic():
    model = PPOActorCriticRNN()
    
    for iteration in range(num_iterations):
        # 1. Collect trajectories
        trajectories = []
        hidden_states = initialize_hidden_states()
        
        for step in range(rollout_length):
            # Forward pass
            action_logits, values, hidden_states = model(
                observations, hidden_states
            )
            
            # Sample actions
            actions = sample_actions(action_logits)
            
            # Environment step
            next_obs, rewards, dones = env.step(actions)
            
            # Store transition
            trajectories.append({
                'obs': observations,
                'actions': actions,
                'rewards': rewards,
                'values': values,
                'action_logits': action_logits,
                'hidden_states': hidden_states,
                'dones': dones
            })
            
            # Reset hidden states on episode end
            hidden_states = reset_hidden_on_done(hidden_states, dones)
            observations = next_obs
        
        # 2. Compute advantages using GAE
        advantages = compute_gae(trajectories)
        
        # 3. PPO updates
        for epoch in range(ppo_epochs):
            for batch in create_mini_batches(trajectories):
                # Recreate forward pass
                action_logits, values, _ = model(
                    batch.obs, batch.hidden_states
                )
                
                # PPO loss computation
                policy_loss = compute_ppo_loss(
                    action_logits, batch.actions, 
                    batch.old_action_logits, batch.advantages
                )
                
                value_loss = compute_value_loss(values, batch.returns)
                
                total_loss = policy_loss + 0.5 * value_loss
                
                # Update
                optimizer.zero_grad()
                total_loss.backward()
                optimizer.step()

Key Implementation Details

Hidden State Management:

  • Reset hidden states on episode boundaries
  • Carry hidden states across timesteps within episodes
  • Handle variable-length episodes properly

Advantage Computation:

  • Use GAE (Generalized Advantage Estimation)
  • Account for episode boundaries in GAE computation
  • Normalize advantages for stability

Mini-batch Creation:

  • Preserve temporal structure within episodes
  • Handle variable-length sequences
  • Efficient padding/masking

Implementation Details

Generalized Advantage Estimation (GAE)

Problem: Simple advantage A = r + γV(s') - V(s) has high variance

Solution: GAE blends n-step returns:

GAE(λ) = (1-λ)[A₁ + λA₂ + λ²A₃ + ...]

where:
A₁ = r + γV(s') - V(s)
A₂ = r + γr' + γ²V(s'') - V(s)
A₃ = r + γr' + γ²r'' + γ³V(s''') - V(s)

λ Parameter:

  • λ = 0: Low variance, high bias (just 1-step)
  • λ = 1: High variance, low bias (Monte Carlo)
  • λ = 0.95: Good balance (commonly used)

Hyperparameter Guidelines

PPO Hyperparameters:

  • Clip ratio (ε): 0.2
  • PPO epochs: 4
  • Mini-batch size: 64-256
  • Learning rate: 3e-4
  • GAE λ: 0.95
  • Discount γ: 0.99

RNN Hyperparameters:

  • Hidden size: 128-512
  • Sequence length: 128-512
  • Gradient clipping: 0.5-1.0

Training Setup:

  • Parallel environments: 8-2048
  • Rollout length: 2048 steps
  • Total timesteps: 10M-100M

Practical Applications

Robotics (Like Your Humanoid)

Why PPO + RNN + Actor-Critic:

  • Partial Observability: Robot sensors don't see everything
  • Temporal Patterns: Walking requires coordinated sequences
  • Stability: PPO prevents dangerous policy updates
  • Continuous Control: Actor-Critic handles continuous actions

Game AI

Applications:

  • Real-time strategy games (partial information)
  • Fighting games (combo sequences)
  • Racing games (temporal control)

Financial Trading

Use Cases:

  • Market making (memory of recent trades)
  • Portfolio management (long-term dependencies)
  • Risk management (stable policy updates)

Natural Language Processing

Applications:

  • Dialogue systems (conversation context)
  • Text generation (sequential dependencies)
  • Machine translation (sequence-to-sequence)

Summary

Actor-Critic: Combines policy learning (actor) with value estimation (critic) for lower variance and more stable learning.

RNN: Adds memory to handle partial observability and temporal patterns crucial for real-world applications.

PPO: Ensures stable policy updates through clipping, making the training process reliable and efficient.

Together: They create a powerful, stable, and practical algorithm for complex sequential decision-making problems like humanoid locomotion.

The beauty of this combination is that each component solves a fundamental challenge in RL:

  • Actor-Critic → variance reduction
  • RNN → temporal dependencies
  • PPO → stable updates

This is why it's become the go-to method for many modern RL applications, including the humanoid walking code you analyzed!