I Built a 35-Agent AI Coding Swarm That Runs Overnight

Mathew Dostal engineered an autonomous coding system using 35 concurrent AI sessions distributed across two local machines. The swarm processed over 6,500 runs by scanning project management boards every 120 seconds to generate pull requests without human intervention.

Why This Matters

Scaling AI agents from interactive chat to autonomous swarms reveals a massive gap between theoretical model capabilities and production reliability. Dostal’s system demonstrated that without rigorous orchestration—including zombie process detection and deterministic branch naming—autonomous agents can trigger ‘infrastructure fires’ such as 124 duplicate pull requests and $65/day cost spikes due to model misconfiguration.

Key Insights

A 5-layer memory architecture—utilizing CLAUDE.md, local file memory, and Qdrant vector databases—prevents agents from repeating mistakes found in over 16,000 recorded knowledge points.
Isolated development environments are maintained via ‘git worktree’, allowing 35 concurrent agents to operate on the same repositories without triggering merge conflicts.
Model escalation logic optimizes operational costs by running initial attempts on Claude 3.5 Sonnet and upgrading to Opus only after two consecutive failures.
Rate limit detection via pattern matching is critical for distinguishing between complex logic failures and infrastructure outages, preventing the permanent skipping of viable tickets.
Homogeneous orchestration across Arch Linux and Mac Studio nodes requires specialized task queuing; Dostal used the Qdrant vector database as a task queue to bypass network reachability issues.

Working Examples

Keyword-based ticket routing logic used by the swarm director.

function detectRepo(issue) {
  const text = `${issue.title} ${issue.description || ''}`.toLowerCase();
  for (const repo of ['shindig', 'venues', 'event-api', 'game-library', 'website', 'monitoring']) {
    if (text.includes(repo)) return repo;
  }
  const shindigKeywords = ['maestro', 'e2e test', 'testflight', 'ios', 'android', 'kotlin', 'swift', 'xcode'];
  for (const kw of shindigKeywords) {
    if (text.includes(kw)) return 'shindig';
  }
  return 'unknown';
}

Director logic for model escalation and fast-fail detection.

async function processTicket(ticket, config, state) {
  const prev = state?.ticketsWorked?.[ticket.identifier];
  const failedAttempts = prev ? prev.attempts : 0;
  const escalated = failedAttempts >= threshold && config.model !== config.escalationModel;
  const ticketConfig = escalated ? { ...config, model: config.escalationModel } : config;
  
  acquireLock(ticket.identifier, ticket.repo);
  worktree = createWorktree(ticket.repo, ticket.identifier);
  const result = await runClaudeSession(ticket, worktree.worktreePath, worktree, ticketConfig);
  
  if (result.duration_ms < FAST_FAIL_THRESHOLD_MS && !result.timedOut) {
    result.rateLimited = true;
  }
  return result;
}

Persistent OAuth authentication via bind-mounting host credentials into Podman containers.

volumes:
  - ${HOME}/.claude/.credentials.json:/home/swarm/.claude/.credentials.json:rw

Practical Applications

Use Case: Autonomous PR generation from Linear or Jira tickets. Pitfall: Missing ‘dedup guards’ leading to redundant processing cycles and duplicate PR creation.
Use Case: Distributed QA using Mac Studio LaunchAgents for mobile emulation. Pitfall: Stale OAuth tokens in containers if credentials are copied at build-time rather than shared via bind-mounts.
Use Case: Agentic memory persistence via vector databases. Pitfall: Cheaper models like Haiku confidently hallucinating external configuration requirements without file-level evidence.

References:

https://dev.to/mdostal/i-built-a-35-agent-ai-coding-swarm-that-runs-overnight-440

On This Page

I Built a 35-Agent AI Coding Swarm That Runs Overnight