Skip to main content

On This Page

Building a Production-Grade Async Job Queue: Engineering Resilience and Backpressure

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

I Built a Production-Grade Async Job Queue from Scratch — Here’s Everything That Actually Happened

Macaulay Praise developed a custom async job queue using Python, FastAPI, and Redis Streams to bypass high-level abstractions. The final system features 47 passing tests and a services layer with ~92% coverage.

Why This Matters

While frameworks like Celery abstract infrastructure, they often obscure the mechanics of failure. In production environments, the critical challenge is not basic message delivery but handling ‘zombie jobs’—where workers crash mid-execution without sending an ACK—which can poison metrics and cause silent data loss if not managed by a dedicated recovery service.

Key Insights

  • The ‘Reaper’ service ensures trustworthiness by implementing visibility timeout recovery via XPENDING queries in Redis Streams and zombie detection via PostgreSQL heartbeats.
  • Backpressure oscillation is mitigated using a two-watermark band (high and low) rather than a single threshold to prevent rapid toggling between accepting and rejecting requests.
  • Weighted Fair Scheduling prevents priority starvation by using runtime weights stored in Redis (e.g., critical: 60, high: 30, normal: 10) to distribute worker capacity.
  • Thundering herd problems are solved through exponential backoff combined with full jitter (* random.random()) to spread retry windows.

Working Examples

Exponential backoff with full jitter to prevent synchronized retries.

def _backoff_seconds(retry_count: int, base: float = 1.0, max_delay: float = 60.0) -> float:
    return min(base * (2 ** retry_count), max_delay) * random.random()

Weighted scheduler implementation to ensure no priority level is starved.

async def pick_queue(redis) -> str:
    weights = await get_weights(redis)
    roll = random.randint(1, 100)
    cumulative = 0
    for priority, weight in weights.items():
        cumulative += weight
        if roll <= cumulative:
            return f"queue:{priority}"

Practical Applications

  • .env Management: Using .env.example for committed templates while keeping .env in .gitignore; failure to rotate credentials after accidental commits leads to security breaches.
  • Docker Infrastructure: Utilizing condition: service_healthy in depends_on; simply listing services causes app crashes because containers start before internal services are ready to accept connections.
  • Virtual Environment Mapping: Placing the venv at /opt/venv instead of /app/.venv; bind mounting the application directory overlays the venv and wipes installed packages.

References:

Continue reading

Next article

Implementing State-Based AI Workflows with LangGraph Templates

Related Content