OpenAI's Codex CLI Internals Revealed

Codex Agent Loop

OpenAI’s Codex software development agent utilizes a loop that takes input from a user and leverages a Large Language Model (LLM) to generate tool calls or responses, with the inaugural post in the series detailing the internals of the Codex harness. The harness is designed to manage context and reduce prompt cache misses, with strategies informed by lessons learned from user-reported bugs.

Why This Matters

The technical reality of building an agent loop on top of the Open Responses API is fraught with challenges, including LLM inference performance and prompt caching. Ideal models often overlook the complexities of real-world implementation, where issues like quadratic inference performance and cache misses can have significant impacts on the user experience, with potential costs including decreased productivity and increased latency.

Key Insights

The Codex CLI uses the Open Responses API, making it LLM agnostic and capable of utilizing any model wrapped by this API: according to OpenAI, this design can benefit anyone building an agent based on the API.
The agent loop consists of a turn that begins with assembling an initial prompt for the LLM, including instructions, tools, and input, which is then packaged into a JSON object and sent to the Responses API.
Temporal and other workflow management tools can be used to optimize the agent loop, reducing the complexity of managing multiple tools and inputs.

Working Example

import json

# Example of assembling an initial prompt for the LLM
instructions = {"coding_standards": "follow PEP 8"}
tools = ["MCP_server_1", "MCP_server_2"]
input_data = {"text": "Hello, world!", "images": [], "files": []}

prompt = {
    "instructions": instructions,
    "tools": tools,
    "input": input_data
}

# Package the prompt into a JSON object
json_prompt = json.dumps(prompt)

# Send the JSON object to the Responses API
# (Implementation details omitted for brevity)

Practical Applications

Use Case: GitHub uses a similar agent loop to power its code review tools, leveraging LLMs to provide automated feedback and suggestions to developers.
Pitfall: Failing to implement prompt caching can result in quadratic inference performance, leading to decreased user experience and increased latency.

References:

On This Page

Codex Agent Loop

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

OpenAI Launches Codex CLI for Local Software Development Lifecycle Integration

SVI: A New CLI Tool to Streamline Prompt Engineering for AI-Assisted Coding

AI Agents: The Future of Unified Interfaces in Software Development