Skip to main content

On This Page

Google AI Unveils Supervised Reinforcement Learning (SRL): A Step-Wise Framework for Enhancing Small Language Models

4 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Supervised Reinforcement Learning (SRL): A Breakthrough in Training Small Language Models

Google AI researchers, in collaboration with UCLA, have introduced Supervised Reinforcement Learning (SRL), a framework designed to enhance the reasoning capabilities of small language models (SLMs) by addressing limitations in traditional supervised fine-tuning (SFT) and reinforcement learning (RL). SRL enables models like Qwen2.5 7B Instruct to solve complex tasks—such as advanced math problems and software engineering challenges—by integrating expert trajectories with dense, step-wise rewards. This approach avoids the pitfalls of rote imitation and overfitting, achieving significant performance gains on benchmarks like AMC23, AIME24, and SWE Bench Verified.


Core Principles of SRL

SRL combines elements of supervised learning and reinforcement learning by injecting supervision into the reward channel rather than the loss function. Key components include:

  • Expert Trajectory Parsing: Each expert solution (e.g., from the s1K 1.1 dataset) is decomposed into a sequence of actions (e.g., mathematical steps or code edits). These actions are used to generate training examples.
  • Private Reasoning Span: The model generates an internal reasoning span (e.g., “), followed by an action. Only the action is compared to the expert’s action using sequence similarity metrics (e.g., difflib).
  • Dense Rewards: Every step receives a reward, even if the final answer is incorrect. This ensures the model learns from partial progress, unlike traditional RL, which relies on final outcomes.

Performance Improvements on Mathematical Reasoning

SRL was tested on the s1K 1.1 dataset, which contains challenging math problems. Results compared models initialized from Qwen2.5 7B Instruct:

  • Base Model:
    • AMC23 Greedy: 50.0%
    • AIME24 Greedy: 13.3%
    • AIME25 Greedy: 6.7%
  • SRL Alone:
    • AMC23 Greedy: 50.0%
    • AIME24 Greedy: 16.7% (↑33%)
    • AIME25 Greedy: 13.3% (↑100%)
  • SRL + RLVR (Reinforcement Learning with Value Refinement):
    • AMC23 Greedy: 57.5% (↑15%)
    • AIME24 Greedy: 20.0% (↑50%)
    • AIME25 Greedy: 10.0% (↑50%)

Impact: SRL alone eliminates performance degradation from SFT, while combining SRL with RLVR achieves state-of-the-art results for open-source models on math benchmarks.


Software Engineering Benchmarks: SWE Bench Verified

SRL was also applied to Qwen2.5 Coder 7B Instruct using 5,000 verified agent trajectories from Claude 3 7 Sonnet (20250219). Results on SWE Bench Verified:

  • Base Model:
    • Oracle File Edit Mode: 5.8%
    • End-to-End: 3.2%
  • SWE Gym 7B (SFT Baseline):
    • Oracle File Edit Mode: 8.4%
    • End-to-End: 4.2%
  • SRL:
    • Oracle File Edit Mode: 14.8% (↑161%)
    • End-to-End: 8.6% (↑105%)

Impact: SRL outperforms both the base model and SFT baselines by 2x, demonstrating its effectiveness in agentic software engineering tasks.


Key Takeaways and Technical Advantages

  • Step-Wise Action Generation: SRL frames reasoning as a sequence of actions, rewarding the model for correct intermediate steps rather than final answers.
  • Avoids Overfitting and Collapse:
    • Unlike SFT, SRL does not overfit to long demonstrations.
    • Unlike RLVR, SRL does not collapse when no rollout is correct.
  • Lightweight and Scalable:
    • Uses only expert actions and string similarity (no reward model required).
    • Works on small, hard datasets (e.g., 1,000 or 5,000 items).

Recommendations for Implementation

  • Use Case: Apply SRL to small models (e.g., 7B-scale) for tasks requiring step-wise reasoning (math, coding, etc.).
  • Best Practices:
    • Pair SRL with RLVR for optimal performance on benchmarks.
    • Use verified expert trajectories (e.g., from Claude 3) to ensure quality.
  • Pitfalls to Avoid:
    • Do not rely solely on SRL for final training; combine with RLVR for maximum gains.
    • Ensure trajectories are decomposed into granular actions to avoid ambiguity.

Reference

For further details, including the full paper and implementation guides, visit the MarkTechPost article.

Continue reading

Next article

AWS Introduces Kiro: An AI IDE for Spec-Driven Development

Related Content