Skip to main content
← All Tags

distributed-systems

24 articles in this category

AI NewsDistributed SystemsSoftware Architecture

The Hidden Cost of Auto-Ack: Avoiding Silent Duplicate Processing in Async Queues

Infrastructure costs climbed steadily due to a race condition where messages were processed multiple times despite zero reported errors.

Read more
AI NewsSystem DesignDistributed Systems

Scaling a Real-Time Marketplace: Engineering Lessons from Uber's Architecture

Uber manages millions of simultaneous rider-driver interactions through specialized geospatial indexing and real-time event streaming.

Read more
AI NewsBackend EngineeringDistributed Systems

Building a Production-Grade Async Job Queue: Engineering Resilience and Backpressure

A technical deep dive into building an async job queue with Redis Streams, achieving 85% test coverage and a sustained throughput of 56 req/s.

Read more
AI NewsDevOpsDistributed Systems

The Shift to Distributed Tracing: How OpenTelemetry Standardized Observability

Distributed tracing replaces logs as the primary source of truth, reducing debugging time from 4 hours to 15 minutes via OpenTelemetry.

Read more
AI NewsData EngineeringDistributed Systems

Building Real-Time Streaming Systems with Apache Kafka and Python

Apache Kafka enables distributed systems to process millions of messages per second using scalable brokers and idempotent producers.

Read more
AI NewsSoftware ArchitectureDistributed Systems

Why Implicit Glue Code Fails: Moving Toward Explicit Workflow State Machines

Brock Claussen details how a single-minute double-charge incident revealed the dangers of implicit state machines in workflow glue code.

Read more
AI NewsDistributed SystemsComputer Science

CRDTs: How Distributed Systems Agree Without Asking Permission

CRDTs enable Strong Eventual Consistency (SEC), a property defined in 2011 allowing distributed systems to converge without central coordination or locks.

Read more
AI NewsFinTechDistributed Systems

How Fiserv Optimized Payment Throughput by 25% Using Apache Kafka

Learn how Fiserv transitioned from synchronous REST APIs to an event-driven Kafka architecture, achieving a 25% throughput increase and zero transaction loss for 600+ enterprise clients.

Read more
AI NewsSoftware EngineeringDistributed Systems

The BEAM Runtime: Why Elixir Scales Differently than the JVM

Learn how the BEAM runtime enables Elixir to manage millions of processes with 2KB startup memory and reduction-based preemption for consistent low latency.

Read more
AI NewsDevOpsDistributed Systems

Measuring Real-World Failover: Django, Celery, and Redis Sentinel Latency

A production failover drill on a Django-Celery stack reveals a 54.7-second task recovery lag despite near-instant Redis Sentinel master election.

Read more
AI NewsDistributed SystemsEdge Computing

Data Persistence and Recovery: Analyzing Edge Node Failure Scenarios

Edge systems face frequent crashes, yet testing reveals that 45/45 mixed-fault scenarios can pass when durability is verified via Jepsen validation.

Read more
AI NewsPlatform EngineeringDistributed Systems

GitHub Refines Layered Defenses to Reduce False Positives

GitHub engineers resolved a 'Too Many Requests' error issue caused by outdated abuse-mitigation rules, affecting a tiny fraction of total traffic, on the order of a few requests per 100,000.

Read more
AI NewsDistributed SystemsCaching

Unifying Caching and In-Flight Deduplication with Durable Objects

Cloudflare Durable Objects can eliminate duplicate work during cache misses by treating in-flight requests and completed responses as two states of the same cache entry, reducing redundant computations by up to 100%.

Read more
distributed-systemssystem-designsoftware-engineering

Building Systems That Don't Fall Apart: Reliability, Scalability, and Maintainability

A practical guide to the three pillars of distributed systems design. Learn how to handle hardware failures, scale past 10,000 users, and avoid building unmaintainable legacy code from day one.

Read more
AI NewsDistributed SystemsCloud Computing

Fast Eventual Consistency: Inside Corrosion, the Distributed System Powering Fly.io

Fly.io built Corrosion, a distributed system for low-latency state replication, achieving p99 latency under 1 second across 800 physical servers.

Read more
AI NewsCloudDistributed Systems

Scaling Cloud and Distributed Applications: Lessons From Chase.com

Chase.com, handling 67M+ active users, achieved a 71% latency reduction through strategies like multi-region isolation and automated infrastructure 'repaving'.

Read more
AI Newswebdevdistributed systems

Most websites are basically offers

Standardizing websites as 'offer objects' could enable a decentralized marketplace, reducing reliance on centralized platforms.

Read more
AI NewsStreamingDistributed Systems

From On-Demand to Live: Netflix Streaming to 100 Million Devices in Under 1 Minute

Netflix’s live streaming pipeline delivers real-time updates to 100 million devices in under a minute, scaling global live events with low-latency architecture.

Read more
AI NewsCloud ComputingDistributed Systems

Scaling Cloud and Distributed Applications: Lessons from Chase.com

JP Morgan Chase reduced latency by 71% using edge computing and multi-region architectures in cloud migrations.

Read more
AI NewsAgentic AIDistributed Systems

Matrix: A Ray Native Decentralized Framework for Multi Agent Synthetic Data Generation

Meta AI's Matrix framework boosts synthetic data generation by 2–15.4x in token throughput using decentralized peer-to-peer agents.

Read more
AI NewsDistributed SystemsData Management

Netflix Tackles Data Deletion at Scale with Centralized Platform Architecture

Netflix’s new data deletion platform processed 76.8 billion row deletions across 1,300 datasets with zero data loss incidents.

Read more
AI NewsSystem DesignDistributed Systems

Heartbeats: The Silent Pulse of Distributed System Availability

A silent node failure at 3 a.m. can stall distributed systems—heartbeats are how engineers turn absence into actionable signals.

Read more
AI NewsDistributed SystemsCloud Computing

Temporal Cloud's 2025 Evolution: From OSS to Enterprise AI Workflows

16 of the top 20 AI companies now use Temporal for reliable, scalable workflows.

Read more
AI NewsDistributed SystemsSoftware Engineering

Effective Error Handling: A Uniform Strategy for Heterogeneous Distributed Systems

Jenish Shah from Netflix discusses a uniform approach to error handling in distributed systems, including exception categorization, handling different protocols (REST, gRPC, GraphQL), and implementing a reusable error handling library.

Read more