AutoAgent: Automating AI Agent Optimization and Harness Engineering
These articles are AI-generated summaries. Please check the original sources for full details.
Meet ‘AutoAgent’: The Open-Source Library That Lets an AI Engineer and Optimize Its Own Agent Harness Overnight
Developed by Kevin Gu at thirdlayer.inc, AutoAgent is an open-source library designed to automate the manual iteration of agent system prompts and tools. In a single 24-hour run, the system achieved a #1 ranking on SpreadsheetBench with a score of 96.5%.
Why This Matters
Traditional agent engineering relies on a tedious manual prompt-tuning loop where humans tweak system prompts and tool definitions based on benchmark failure traces. AutoAgent shifts this paradigm by treating the agent harness—including orchestration and routing logic—as an optimization surface for a meta-agent, effectively hill-climbing on benchmark scores to outperform human-crafted configurations. This approach addresses the scalability limits of manual engineering by automating the diagnosis and remediation of agent failures.
Key Insights
- AutoAgent achieved a 55.1% score on TerminalBench, the highest recorded for GPT-5, by autonomously iterating on agent configurations (2026).
- The system utilizes a ‘ratchet loop’ inspired by Andrej Karpathy’s autoresearch, applying propose-train-evaluate cycles to agent scaffolding rather than model weights.
- A ‘meta-agent’ manages a single agent.py file, rewriting tool definitions and routing logic based on performance data recorded in a results.tsv experiment log.
- The library integrates with the Harbor format, using Docker containers and LLM-as-judge verifiers to provide consistent scoring for complex, non-deterministic tasks.
- Experiments suggest a ‘model empathy’ effect where a Claude-based meta-agent optimizes Claude-based sub-agents more effectively than those based on GPT.
Practical Applications
- Spreadsheet Automation: AutoAgent optimized an agent to reach 96.5% accuracy on SpreadsheetBench; a common pitfall is manual prompt-tuning which fails to capture edge cases handled by autonomous iteration.
- Terminal Task Execution: Using the Harbor adapter, AutoAgent reached a 55.1% score on TerminalBench; the anti-pattern of hard-coding tool routing often leads to brittle agents that fail on complex CLI environments.
References:
Continue reading
Next article
MaxToki: A 1B-Parameter Temporal Foundation Model for Cellular Aging Trajectories
Related Content
Poetiq Meta-System Achieves State-of-the-Art on LiveCodeBench Pro via Automated Inference Harnesses
Poetiq’s Meta-System boosted GPT 5.5 High to a 93.9% score on LiveCodeBench Pro by automatically generating a model-agnostic inference harness without fine-tuning.
Microsoft Releases Agent Lightning: A Reinforcement Learning Framework for Optimizing AI Agents
Microsoft introduces Agent Lightning, an open-source framework that enables reinforcement learning (RL)-based training of large language models (LLMs) for AI agents without requiring changes to existing agent stacks.
OpenPlanter: A Recursive Open-Source AI Agent for Micro Surveillance and Data Investigation
OpenPlanter uses a recursive sub-agent engine with a max-depth of 4 to automate complex micro-surveillance and entity resolution across 100+ data formats.