Skip to main content
surviving the spike

Rate Limiting: Algorithms, Redis, and Atomic Operations

10 min read Chapter 28 of 66

Rate Limiting: Algorithms, Redis, and Atomic Operations

Friday, 6:47 PM. Concert ends at Madison Square Garden. 40,000 people open the ride-hailing app within 90 seconds. The ride request endpoint processes 12,000 requests per second. The matching service, designed for 800 RPS, collapses. Drivers see phantom ride assignments. Riders see fares that change three times before the confirmation screen loads. The matching service restarts, processes the backlog, and assigns the same driver to four riders. Every system downstream of the API endpoint is drowning because nothing at the front door said “no.”

Rate limiting is that front door. The goal is straightforward: reject excess requests fast, before they consume resources that paying customers need.

Five algorithms exist for this problem. Each has a mechanical tradeoff that determines where it fits in a ride-hailing platform. This chapter covers all five, implements the right default with Redis and Lua, and measures the result under 10x load.

The Five Algorithms

Every rate limiting algorithm answers the same question: should this request be allowed? They differ in how they track request history and what guarantees they provide.

Token Bucket

A bucket holds tokens up to a maximum capacity. Each request consumes one token. Tokens refill at a fixed rate. If the bucket is empty, the request is rejected.

A bucket configured with capacity 10 and refill rate 5 tokens/second allows a burst of 10 requests, then sustains 5 RPS. A rider opening the app fires 6 requests in the first second (nearby drivers, fare estimate, surge check, promotions, payment methods, trip history). The burst capacity absorbs this. Sustained API calls from the same rider are throttled to 5 per second.

Token bucket is the right default for API endpoints. It handles the bursty nature of mobile clients while enforcing a sustained ceiling.

Token bucket rate limiter showing bucket states, token refill, and request flow

The diagram shows the token bucket mechanism across three states. Tokens refill at a constant rate (5/second) from the top. Each incoming request consumes one token from the bucket. When the bucket is full (10/10), burst traffic is absorbed and all requests pass. In steady state (5/10), the refill rate matches the request rate and the system sustains throughput. When the bucket drains to empty (0/10), every subsequent request is rejected with HTTP 429 until tokens refill. This design naturally accommodates the bursty behavior of mobile clients—a rider opening the app fires 6 requests instantly—while preventing any single client from overwhelming downstream services.

Leaky Bucket

Requests enter a queue. The queue drains at a constant rate regardless of input volume. If the queue is full, new requests are rejected.

Driver location ingestion receives GPS updates from 50,000 active drivers. The geospatial index downstream (R-tree update, spatial hash recalculation) handles 3,000 writes per second before response times degrade. A leaky bucket with queue size 5,000 and drain rate 3,000/s smooths bursts that occur when drivers cross cell tower boundaries and dump 30 seconds of buffered updates at once.

The difference from token bucket: leaky bucket enforces a constant output rate. Token bucket allows bursts up to the capacity. For driver location ingestion, constant output rate protects the downstream index. For rider-facing APIs, burst tolerance matters more.

Fixed Window Counter

Divide time into fixed windows (e.g., 1-minute intervals). Count requests per window. Reject when the count exceeds the limit.

Simple to implement. One counter per window in Redis. The problem: a rider sends 100 requests at 11:59:59 and another 100 at 12:00:01. Both windows allow 100. The rider sent 200 requests in 2 seconds. The boundary between windows creates a gap where the effective rate is 2x the configured limit.

For admin analytics endpoints where precision is not critical, this simplicity is acceptable. For rider-facing APIs, the boundary problem is a production bug waiting for a concert to end.

Sliding Window Log

Store the timestamp of every request. To check the limit, count all timestamps within the last N seconds. Remove expired timestamps.

This is precise. No boundary problem. The cost: O(n) memory per client, where n is the number of requests within the window. At 1,000 active API keys with a 60-second window and 100 requests/second limit, that is 6,000,000 timestamps in Redis. Sorted sets work, but the memory cost scales linearly with traffic.

For the ride-hailing platform, the client count at scale makes this impractical for most endpoints.

Sliding Window Counter

Combine the current window count with a weighted portion of the previous window count. If the current window is 40% elapsed, the effective count is: current_count + previous_count * 0.6.

This approximation eliminates the boundary problem of fixed windows and avoids the memory cost of sliding window logs. The accuracy loss is roughly 0.003% in studies by Cloudflare. For rate limiting, that accuracy is more than sufficient.

The sliding window counter is the right choice for endpoints where token bucket’s burst tolerance is unnecessary and memory efficiency matters. Analytics queries from internal dashboards fit this profile.

Decision Matrix for the Ride-Hailing Platform

EndpointAlgorithmReason
/api/rides/requestToken bucketRiders burst on app open; sustained rate matters
/api/fares/estimateToken bucketSame burst pattern as ride requests
/api/drivers/locationLeaky bucketDownstream index needs constant write rate
/api/drivers/nearbyToken bucketMobile client burst pattern
/api/trips/historySliding window counterLow burst, memory efficient
/api/admin/zones/statsFixed windowLow traffic, simplicity wins

The Baseline: No Rate Limiting

The unprotected ride request endpoint under 10x load:

# load-tests/ch10_no_ratelimit_locustfile.py
from locust import HttpUser, task, between, tag

class SurgeRiderUser(HttpUser):
    wait_time = between(0.1, 0.5)  # 10x normal request rate

    @tag("surge")
    @task(3)
    def request_ride(self):
        self.client.post(
            "/api/rides/request",
            json={
                "rider_id": f"rider-{self.environment.runner.user_count}",
                "pickup_lat": 40.7505,
                "pickup_lng": -73.9934,
                "dropoff_lat": 40.7580,
                "dropoff_lng": -73.9855
            },
            name="/api/rides/request"
        )

    @tag("surge")
    @task(1)
    def fare_estimate(self):
        self.client.post(
            "/api/fares/estimate",
            json={
                "pickup_lat": 40.7505,
                "pickup_lng": -73.9934,
                "dropoff_lat": 40.7580,
                "dropoff_lng": -73.9855
            },
            name="/api/fares/estimate"
        )
locust -f load-tests/ch10_no_ratelimit_locustfile.py \
    --host=http://localhost:8080 \
    --users 2000 \
    --spawn-rate 100 \
    --run-time 3m \
    --headless \
    --csv=load-tests/results/ch10-no-ratelimit

Results without rate limiting:

Name                     # reqs   Avg   Med   Min    Max     p95     p99   RPS    Fail%
/api/rides/request       28410   2840  1200    22  45000   12000   38000  157.8   18.4%
/api/fares/estimate       9470   3200  1400    30  52000   14000   42000   52.6   22.1%
Aggregated               37880   2930  1250    22  52000   12500   39000  210.4   19.3%

p99 at 39 seconds. 19.3% failure rate. The matching service is returning errors because its thread pool is saturated. The fare calculation service is timing out on Redis connections because 2,000 concurrent users are exhausting the connection pool. The system is accepting every request and failing at every layer.

The Fix: Token Bucket with Redis and Lua

The implementation requires three components: a Lua script for atomic token bucket operations, a Spring WebFlux WebFilter for request interception, and rate limit response headers for client cooperation.

The Lua script:

-- rate_limit_token_bucket.lua
-- KEYS[1] = rate limit key (e.g., "rl:rider-1234:/api/rides/request")
-- ARGV[1] = bucket capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = current timestamp (seconds, floating point)
-- ARGV[4] = tokens to consume (usually 1)

local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])

local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens == nil then
    -- First request: initialize full bucket
    tokens = capacity
    last_refill = now
end

-- Calculate tokens to add based on elapsed time
local elapsed = now - last_refill
local new_tokens = elapsed * refill_rate
tokens = math.min(capacity, tokens + new_tokens)

local allowed = 0
local remaining = tokens

if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
    remaining = tokens
end

-- Update bucket state
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / refill_rate) * 2)

-- Return: allowed (0/1), remaining tokens, retry-after seconds
local retry_after = 0
if allowed == 0 then
    retry_after = (requested - tokens) / refill_rate
end

return {allowed, math.floor(remaining), math.ceil(retry_after)}

Why Lua and not MULTI/EXEC? A Redis transaction with MULTI/EXEC cannot read a value and make a decision based on it within the same transaction. WATCH adds optimistic locking but retries under contention. The Lua script executes atomically on the Redis server. No round trips. No race conditions. Two pods checking the same rider’s bucket at the same instant get correct results.

The WebFilter:

// SCALED: Rate limiting WebFilter with Redis token bucket
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class RateLimitWebFilter implements WebFilter {

    private final ReactiveStringRedisTemplate redisTemplate;
    private final RedisScript<List<Long>> tokenBucketScript;

    private static final int BUCKET_CAPACITY = 20;
    private static final int REFILL_RATE = 10; // tokens per second

    public RateLimitWebFilter(ReactiveStringRedisTemplate redisTemplate) {
        this.redisTemplate = redisTemplate;
        this.tokenBucketScript = RedisScript.of(
            new ClassPathResource("scripts/rate_limit_token_bucket.lua"),
            (Class<List<Long>>) (Class<?>) List.class
        );
    }

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
        String clientId = extractClientId(exchange);
        String path = exchange.getRequest().getPath().value();
        String key = "rl:" + clientId + ":" + path;

        String now = String.valueOf(System.currentTimeMillis() / 1000.0);

        return redisTemplate.execute(
            tokenBucketScript,
            List.of(key),
            List.of(
                String.valueOf(BUCKET_CAPACITY),
                String.valueOf(REFILL_RATE),
                now,
                "1"
            )
        )
        .single()
        .flatMap(result -> {
            long allowed = result.get(0);
            long remaining = result.get(1);
            long retryAfter = result.get(2);

            ServerHttpResponse response = exchange.getResponse();
            HttpHeaders headers = response.getHeaders();
            headers.set("X-RateLimit-Limit", String.valueOf(BUCKET_CAPACITY));
            headers.set("X-RateLimit-Remaining", String.valueOf(remaining));
            headers.set("X-RateLimit-Reset",
                String.valueOf(System.currentTimeMillis() / 1000 + BUCKET_CAPACITY / REFILL_RATE));

            if (allowed == 1) {
                return chain.filter(exchange);
            }

            // BOTTLENECK: Without this 429, every request hits the matching service
            headers.set("Retry-After", String.valueOf(retryAfter));
            response.setStatusCode(HttpStatus.TOO_MANY_REQUESTS);
            return response.writeWith(Mono.just(
                response.bufferFactory().wrap(
                    "{\"error\":\"rate_limit_exceeded\",\"retry_after\":%d}"
                        .formatted(retryAfter).getBytes()
                )
            ));
        });
    }

    private String extractClientId(ServerWebExchange exchange) {
        // Priority: API key > authenticated user ID > IP address
        String apiKey = exchange.getRequest().getHeaders().getFirst("X-API-Key");
        if (apiKey != null) return apiKey;

        String userId = exchange.getRequest().getHeaders().getFirst("X-User-Id");
        if (userId != null) return userId;

        InetSocketAddress remoteAddress = exchange.getRequest().getRemoteAddress();
        return remoteAddress != null ? remoteAddress.getAddress().getHostAddress() : "unknown";
    }
}

The Proof: 10x Load With Rate Limiting

Same Locust scenario, same 2,000 users, rate limiting active:

locust -f load-tests/ch10_no_ratelimit_locustfile.py \
    --host=http://localhost:8080 \
    --users 2000 \
    --spawn-rate 100 \
    --run-time 3m \
    --headless \
    --csv=load-tests/results/ch10-with-ratelimit

Results with token bucket rate limiting (capacity=20, refill=10/s):

Name                     # reqs   Avg   Med   Min    Max    p95    p99   RPS    Fail%
/api/rides/request       28410    82    45    12    980    320    650  157.8    0.1%
  (of which 429s)        19087     3     2     1     12      8     10  106.0     -
/api/fares/estimate       9470    95    52    14   1100    380    720   52.6    0.1%
  (of which 429s)         6350     3     2     1     14      9     11   35.3     -
Aggregated               37880    85    48    12   1100    340    680  210.4    0.1%

The 429 responses return in 3ms average. They consume no matching service resources, no database connections, no fare calculation cycles. The requests that are allowed through see p99 of 680ms instead of 39,000ms. The failure rate on allowed requests drops from 19.3% to 0.1%.

67% of requests received 429s. That sounds aggressive. In practice, these are the same 2,000 users hammering the endpoint at 10x normal rate. Each user is allowed their fair share (20 burst, 10/s sustained). The system serves the same number of unique riders. It rejects the redundant retries that a panicked mobile client sends when the first request is slow.

The Prometheus query confirms backend protection:

# Matching service request rate during surge
sum(rate(http_server_requests_seconds_count{
  application="ride-hailing",
  uri="/api/rides/request",
  status!="429"
}[1m]))

Without rate limiting: 157 RPS hitting the matching service. With rate limiting: 52 RPS. The matching service operates within its capacity. No cascading failures. No phantom assignments.

What Comes Next

This chapter covered the five algorithms and the end-to-end implementation. The subsections that follow go deeper. CH10-S1 breaks down each algorithm’s mechanics with visual models and the decision logic for every ride-hailing endpoint. CH10-S2 walks through the Lua script line by line, covers EVALSHA and SCRIPT LOAD for production performance, and includes the Kubernetes manifest for a dedicated rate-limiting Redis instance.