Skip to main content

On This Page

Online Process Reward Learning (OPRL) Solves Sparse-Reward Mazes with Preference-Driven Shaping

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Online Process Reward Learning (OPRL)

[2-sentence hook. Name the event, person, or system + one hard fact.]
Online Process Reward Learning (OPRL) transforms sparse terminal rewards into dense, step-level signals using trajectory preferences. The system achieves goal success in an 8×8 maze with 500 training episodes.

Why This Matters

Sparse-reward environments, like mazes, hinder reinforcement learning agents by offering minimal feedback. Traditional methods struggle with credit assignment, leading to unstable training. OPRL addresses this by learning dense rewards from human or algorithmic preference comparisons, enabling faster, more stable policy optimization. This approach reduces the need for handcrafted reward functions and scales to complex tasks where sparse rewards are unavoidable.

Key Insights

  • “Maze environment with 8×8 grid and obstacles, 2025-12-02”: The MazeEnv class defines a grid with walls and a goal state.
  • “Process Reward Model with LayerNorm and Tanh, 2025-12-02”: The ProcessRewardModel uses LayerNorm and Tanh to generate differentiable step-level rewards.
  • “PolicyNetwork with entropy regularization, 2025-12-02”: The policy network incorporates entropy bonuses to avoid overfitting to preference data.

Working Example

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

class MazeEnv:
    def __init__(self, size=8):
        self.size = size
        self.start = (0, 0)
        self.goal = (size-1, size-1)
        self.obstacles = set([(i, size//2) for i in range(1, size-2)])
        self.reset()
    def reset(self):
        self.pos = self.start
        self.steps = 0
        return self._get_state()
    def _get_state(self):
        state = np.zeros(self.size * self.size)
        state[self.pos[0] * self.size + self.pos[1]] = 1
        return state
    def step(self, action):
        moves = [(-1,0), (0,1), (1,0), (0,-1)]
        new_pos = (self.pos[0] + moves[action][0],
                   self.pos[1] + moves[action][1])
        if (0 <= new_pos[0] < self.size and
            0 <= new_pos[1] < self.size and
            new_pos not in self.obstacles):
            self.pos = new_pos
            self.steps += 1
            done = self.pos == self.goal or self.steps >= 60
            reward = 10.0 if self.pos == self.goal else 0.0
            return self._get_state(), reward, done
class ProcessRewardModel(nn.Module):
    def __init__(self, state_dim, hidden=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden),
            nn.LayerNorm(hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.LayerNorm(hidden),
            nn.ReLU(),
            nn.Linear(hidden, 1),
            nn.Tanh()
        )
    def forward(self, states):
        return self.net(states)
def train_oprl(episodes=500, render_interval=100):
    env = MazeEnv(size=8)
    agent = OPRLAgent(state_dim=64, action_dim=4, lr=3e-4)
    returns, reward_losses, policy_losses = [], [], []
    for ep in range(episodes):
        traj = agent.collect_trajectory(env, epsilon=0.1)
        returns.append(traj['return'])
        if ep % 2 == 0 and ep > 10:
            agent.generate_preference()
        if ep > 20 and ep % 2 == 0:
            rew_loss = agent.train_reward_model(n_updates=3)
            reward_losses.append(rew_loss)
        if ep > 10:
            pol_loss = agent.train_policy(n_updates=2)
            policy_losses.append(pol_loss)
        if ep % render_interval == 0 and ep > 0:
            test_env = MazeEnv(size=8)
            agent.collect_trajectory(test_env, epsilon=0)
            print(test_env.render())
    return returns, reward_losses, policy_losses

Practical Applications

  • Use Case: Maze navigation with sparse rewards (e.g., robotics pathfinding).
  • Pitfall: Over-reliance on preference data may bias reward shaping, leading to suboptimal policies in unseen scenarios.

References:


Continue reading

Next article

Zero-Code Data Analyst Tool Built with FastAPI and Plotly

Related Content