Distributed Rate Limiting with Redis and Lua
Distributed Rate Limiting with Redis and Lua
The Symptom
The ride-hailing platform runs 12 pods behind a Kubernetes load balancer. Each pod has an in-memory token bucket. Rider “R-4421” sends a burst of 20 requests. The load balancer distributes them round-robin across 12 pods. Each pod sees 1-2 requests from R-4421. No pod rejects anything. R-4421 effectively has a rate limit of 12x the configured value. During the concert surge, every rider gets 12x their limit. The matching service receives 12x the expected load. The rate limiter is decorative.
The Cause
In-memory rate limiting is per-process state. Distributed systems need shared state. The rate limit counter must live in a single location that all pods consult. Redis is that location.
But shared state introduces a new problem: atomicity. Two pods check R-4421’s token count simultaneously. Both read 1 token remaining. Both allow the request. Both decrement. The count goes to -1. The rate limit is violated.
The check-and-decrement must be a single atomic operation. Redis provides three mechanisms:
- MULTI/EXEC: Batches commands into a transaction. Cannot read a value and branch on it within the transaction. Useless for “if tokens > 0 then decrement.”
- WATCH + MULTI/EXEC: Optimistic locking.
WATCHthe key, read the value, startMULTI, write the update,EXEC. If another client modified the key betweenWATCHandEXEC, the transaction fails and the client retries. Under high contention (concert surge), retry storms make this worse than no rate limiting. - Lua scripting (EVAL): The script executes atomically on the Redis server. No other command runs between the read and the write. No retries. No race conditions. This is the correct approach.
The Lua Script, Line by Line
-- rate_limit_token_bucket.lua
local key = KEYS[1] -- 1
local capacity = tonumber(ARGV[1]) -- 2
local refill_rate = tonumber(ARGV[2]) -- 3
local now = tonumber(ARGV[3]) -- 4
local requested = tonumber(ARGV[4]) -- 5
local bucket = redis.call('HMGET', key, -- 6
'tokens', 'last_refill') -- 7
local tokens = tonumber(bucket[1]) -- 8
local last_refill = tonumber(bucket[2]) -- 9
if tokens == nil then -- 10
tokens = capacity -- 11
last_refill = now -- 12
end -- 13
local elapsed = now - last_refill -- 14
local new_tokens = elapsed * refill_rate -- 15
tokens = math.min(capacity, -- 16
tokens + new_tokens) -- 17
local allowed = 0 -- 18
local remaining = tokens -- 19
if tokens >= requested then -- 20
tokens = tokens - requested -- 21
allowed = 1 -- 22
remaining = tokens -- 23
end -- 24
redis.call('HMSET', key, -- 25
'tokens', tokens, -- 26
'last_refill', now) -- 27
redis.call('EXPIRE', key, -- 28
math.ceil(capacity / refill_rate) * 2) -- 29
local retry_after = 0 -- 30
if allowed == 0 then -- 31
retry_after = (requested - tokens) -- 32
/ refill_rate -- 33
end -- 34
return {allowed, math.floor(remaining), -- 35
math.ceil(retry_after)} -- 36
Lines 1-5: Input parameters. The key identifies the client and endpoint combination. Capacity and refill rate define the bucket shape. The timestamp comes from the application server (not Redis’s clock) to avoid clock skew between app servers and Redis.
Lines 6-9: Read current bucket state. HMGET retrieves both fields in a single command. On first request, both values are nil.
Lines 10-13: First request initialization. A new client starts with a full bucket. This means the first burst of requests (app open) is always allowed, which matches the desired UX.
Lines 14-17: Token refill calculation. Elapsed time since last refill multiplied by the refill rate gives the number of new tokens. math.min caps at capacity. If a client is inactive for an hour, they get a full bucket, not 36,000 tokens.
Lines 20-24: The atomic check-and-decrement. If enough tokens exist, consume them and mark allowed. This is the operation that cannot be split across two Redis commands without a race condition.
Lines 25-29: Persist the updated state. The EXPIRE sets a TTL of twice the time needed to refill a full bucket. Inactive clients’ keys are automatically cleaned up. At capacity=20 and refill_rate=10, TTL is 4 seconds. Generous enough that active clients never lose their bucket state.
Lines 30-34: Calculate Retry-After for rejected requests. If 0.3 tokens remain and 1 is requested, the client needs 0.7 tokens, which arrive in $0.7 / 10 = 0.07$ seconds. The client can retry in 70ms instead of guessing.
Lines 35-36: Return three values. Lua arrays map to Redis multi-bulk replies. The Java client reads them as List<Long>.
EVAL vs EVALSHA vs SCRIPT LOAD
Every EVAL call sends the full Lua script text to Redis. For a script that runs thousands of times per second, this is wasted bandwidth.
EVAL "local key = KEYS[1]..." 1 "rl:rider-1234:/api/rides/request" 20 10 1716480000.123 1
The script text is ~800 bytes. At 10,000 rate limit checks per second, that is 8 MB/s of repeated script text on the Redis connection.
SCRIPT LOAD registers the script once and returns its SHA1 hash:
SCRIPT LOAD "local key = KEYS[1]..."
→ "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2"
EVALSHA executes by hash:
EVALSHA "a1b2c3d4..." 1 "rl:rider-1234:/api/rides/request" 20 10 1716480000.123 1
40-byte hash instead of 800-byte script. At 10,000 RPS: 400 KB/s instead of 8 MB/s.
The failure mode: SCRIPT FLUSH or a Redis restart clears the script cache. EVALSHA returns NOSCRIPT. The client must fall back to EVAL, which re-caches the script. Spring’s RedisScript abstraction handles this automatically:
// SCALED: RedisScript handles EVALSHA with EVAL fallback
@Bean
public RedisScript<List<Long>> tokenBucketScript() {
DefaultRedisScript<List<Long>> script = new DefaultRedisScript<>();
script.setLocation(new ClassPathResource("scripts/rate_limit_token_bucket.lua"));
script.setResultType((Class<List<Long>>) (Class<?>) List.class);
return script;
}
DefaultRedisScript computes the SHA1 at bean creation, uses EVALSHA for every call, and falls back to EVAL on NOSCRIPT. No application code handles the retry.
Spring WebFlux WebFilter Implementation
The WebFilter intercepts every request before it reaches the controller:
// SCALED: Rate limiting WebFilter with configurable per-endpoint limits
@Component
@Order(Ordered.HIGHEST_PRECEDENCE)
public class RateLimitWebFilter implements WebFilter {
private final ReactiveStringRedisTemplate redisTemplate;
private final RedisScript<List<Long>> tokenBucketScript;
private final Map<String, RateLimitConfig> endpointConfigs;
public RateLimitWebFilter(
ReactiveStringRedisTemplate redisTemplate,
RedisScript<List<Long>> tokenBucketScript) {
this.redisTemplate = redisTemplate;
this.tokenBucketScript = tokenBucketScript;
this.endpointConfigs = Map.of(
"/api/rides/request", new RateLimitConfig(20, 10),
"/api/fares/estimate", new RateLimitConfig(20, 10),
"/api/drivers/nearby", new RateLimitConfig(30, 15),
"/api/trips/history", new RateLimitConfig(10, 5)
);
}
@Override
public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
String path = exchange.getRequest().getPath().value();
RateLimitConfig config = endpointConfigs.get(path);
if (config == null) {
return chain.filter(exchange); // No rate limit for this endpoint
}
String clientId = extractClientId(exchange);
String key = "rl:" + clientId + ":" + path;
String now = String.valueOf(System.currentTimeMillis() / 1000.0);
return redisTemplate.execute(
tokenBucketScript,
List.of(key),
List.of(
String.valueOf(config.capacity()),
String.valueOf(config.refillRate()),
now,
"1"
)
)
.single()
.flatMap(result -> handleResult(exchange, chain, config, result));
}
private Mono<Void> handleResult(
ServerWebExchange exchange,
WebFilterChain chain,
RateLimitConfig config,
List<Long> result) {
long allowed = result.get(0);
long remaining = result.get(1);
long retryAfter = result.get(2);
ServerHttpResponse response = exchange.getResponse();
HttpHeaders headers = response.getHeaders();
headers.set("X-RateLimit-Limit", String.valueOf(config.capacity()));
headers.set("X-RateLimit-Remaining", String.valueOf(remaining));
long resetEpoch = System.currentTimeMillis() / 1000
+ config.capacity() / config.refillRate();
headers.set("X-RateLimit-Reset", String.valueOf(resetEpoch));
if (allowed == 1) {
return chain.filter(exchange);
}
// BOTTLENECK: Without this rejection, the request consumes
// a matching service thread, a DB connection, and a Redis slot
headers.set("Retry-After", String.valueOf(retryAfter));
response.setStatusCode(HttpStatus.TOO_MANY_REQUESTS);
byte[] body = ("{\"error\":\"rate_limit_exceeded\","
+ "\"retry_after\":" + retryAfter + "}")
.getBytes(StandardCharsets.UTF_8);
return response.writeWith(
Mono.just(response.bufferFactory().wrap(body))
);
}
private String extractClientId(ServerWebExchange exchange) {
String apiKey = exchange.getRequest().getHeaders()
.getFirst("X-API-Key");
if (apiKey != null) return "key:" + apiKey;
String userId = exchange.getRequest().getHeaders()
.getFirst("X-User-Id");
if (userId != null) return "user:" + userId;
InetSocketAddress remote = exchange.getRequest().getRemoteAddress();
if (remote != null) {
return "ip:" + remote.getAddress().getHostAddress();
}
return "unknown";
}
record RateLimitConfig(int capacity, int refillRate) {}
}
Rate Limit Headers
Three headers communicate rate limit state to clients:
X-RateLimit-Limit: The bucket capacity. Tells the client the maximum burst size.X-RateLimit-Remaining: Tokens remaining. The client can back off proactively when this approaches zero.X-RateLimit-Reset: Unix epoch when the bucket will be full again. The client can schedule retries precisely.
On rejection (HTTP 429):
Retry-After: Seconds until the client should retry. Computed from the Lua script’s remaining token deficit and refill rate. This prevents retry storms: the client waits exactly as long as needed.
A well-behaved mobile client uses these headers:
// Client-side rate limit handling (Android/Kotlin pseudocode)
if (response.code() == 429) {
val retryAfter = response.header("Retry-After")?.toLong() ?: 1
delay(retryAfter * 1000)
retry()
}
A poorly-behaved client ignores them and retries immediately. The rate limiter handles both correctly. The well-behaved client gets served faster. The poorly-behaved client burns through its tokens on retries and gets rejected repeatedly.
Kubernetes Manifest for Rate Limiting Redis
The rate limiting Redis instance must be separate from the caching Redis. Reason: if the caching Redis runs out of memory and evicts keys (using allkeys-lru), rate limit counters are evicted. Every client gets a fresh full bucket. The rate limiter stops working during a memory pressure event, which is exactly when you need it most.
# k8s/redis-ratelimit.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-ratelimit
namespace: ride-hailing
labels:
app: redis-ratelimit
tier: infrastructure
spec:
replicas: 1
selector:
matchLabels:
app: redis-ratelimit
template:
metadata:
labels:
app: redis-ratelimit
spec:
containers:
- name: redis
image: redis:7.4-alpine
ports:
- containerPort: 6379
command: ["redis-server"]
args:
- "--maxmemory"
- "256mb"
- "--maxmemory-policy"
- "noeviction" # Reject writes when full, never evict
- "--save"
- "" # Disable RDB persistence
- "--appendonly"
- "no" # Disable AOF persistence
- "--tcp-backlog"
- "511"
- "--timeout"
- "0"
resources:
requests:
memory: "300Mi"
cpu: "100m"
limits:
memory: "300Mi"
cpu: "500m"
readinessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
exec:
command: ["redis-cli", "ping"]
initialDelaySeconds: 15
periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
name: redis-ratelimit
namespace: ride-hailing
spec:
selector:
app: redis-ratelimit
ports:
- port: 6379
targetPort: 6379
clusterIP: None # Headless for direct pod addressing
Key configuration decisions:
noeviction: When Redis reaches 256 MB,EVALcalls return errors instead of silently evicting rate limit keys. TheWebFiltercatches this and allows the request through (fail-open). Failing open under Redis memory pressure is safer than failing closed and rejecting all traffic.- No persistence: Rate limit state is ephemeral. If Redis restarts, every client gets a fresh full bucket. This is a brief burst of un-rate-limited traffic (one bucket capacity per client), not a correctness problem. Persistence adds write latency to every rate limit check.
- Separate from caching Redis: Failure domain isolation. The caching Redis can be evicting keys, restarting, or running a
BGSAVEwithout affecting rate limiting.
Spring Boot configuration for connecting to the dedicated instance:
# application.yml
spring:
data:
redis:
host: redis-cache # Default Redis for caching
port: 6379
rate-limit:
redis:
host: redis-ratelimit # Dedicated Redis for rate limiting
port: 6379
// SCALED: Separate RedisTemplate for rate limiting Redis
@Configuration
public class RateLimitRedisConfig {
@Bean
public ReactiveStringRedisTemplate rateLimitRedisTemplate(
@Value("${rate-limit.redis.host}") String host,
@Value("${rate-limit.redis.port}") int port) {
LettuceConnectionFactory factory = new LettuceConnectionFactory(
new RedisStandaloneConfiguration(host, port)
);
factory.afterPropertiesSet();
return new ReactiveStringRedisTemplate(factory);
}
}
The Baseline: 10x Load Without Rate Limiting
The Locust scenario simulates the concert surge:
# load-tests/ch10s2_surge_locustfile.py
from locust import HttpUser, task, between, tag, events
import random
import string
class ConcertSurgeRider(HttpUser):
wait_time = between(0.1, 0.3) # Aggressive: 3-10 requests/second
def on_start(self):
self.rider_id = "rider-" + "".join(
random.choices(string.ascii_lowercase, k=8))
@tag("surge")
@task(5)
def request_ride(self):
self.client.post(
"/api/rides/request",
json={
"rider_id": self.rider_id,
"pickup_lat": 40.7505 + random.uniform(-0.01, 0.01),
"pickup_lng": -73.9934 + random.uniform(-0.01, 0.01),
"dropoff_lat": 40.7580 + random.uniform(-0.05, 0.05),
"dropoff_lng": -73.9855 + random.uniform(-0.05, 0.05)
},
headers={"X-User-Id": self.rider_id},
name="/api/rides/request"
)
@tag("surge")
@task(2)
def fare_estimate(self):
self.client.post(
"/api/fares/estimate",
json={
"pickup_lat": 40.7505,
"pickup_lng": -73.9934,
"dropoff_lat": 40.7580,
"dropoff_lng": -73.9855
},
headers={"X-User-Id": self.rider_id},
name="/api/fares/estimate"
)
@tag("surge")
@task(1)
def nearby_drivers(self):
self.client.get(
"/api/drivers/nearby",
params={"lat": 40.7505, "lng": -73.9934, "radius_km": 3},
headers={"X-User-Id": self.rider_id},
name="/api/drivers/nearby"
)
# Without rate limiting
locust -f load-tests/ch10s2_surge_locustfile.py \
--host=http://localhost:8080 \
--users 2000 \
--spawn-rate 200 \
--run-time 3m \
--headless \
--csv=load-tests/results/ch10s2-no-ratelimit
Results without rate limiting:
Name # reqs Avg Med Min Max p95 p99 RPS Fail%
/api/rides/request 42600 3100 1800 18 48000 14000 41000 236.7 21.3%
/api/fares/estimate 17040 2900 1600 22 44000 12000 38000 94.7 19.8%
/api/drivers/nearby 8520 2400 1200 15 38000 10000 32000 47.3 16.2%
Aggregated 68160 2920 1700 15 48000 13000 40000 378.7 20.1%
p99 at 40 seconds. One in five requests fails. The matching service, fare calculator, and geospatial index are all saturated. Every request that gets through is slow because it competes with requests that will eventually fail anyway.
The Fix: Token Bucket Rate Limiting Active
Same scenario, rate limiting enabled:
# With rate limiting (token bucket: capacity=20, refill=10/s per client)
locust -f load-tests/ch10s2_surge_locustfile.py \
--host=http://localhost:8080 \
--users 2000 \
--spawn-rate 200 \
--run-time 3m \
--headless \
--csv=load-tests/results/ch10s2-with-ratelimit
The Proof
Results with token bucket rate limiting:
Name # reqs Avg Med Min Max p95 p99 RPS Fail%
/api/rides/request 42600 68 38 10 820 280 580 236.7 0.1%
(429 responses) 31200 2 2 1 10 6 8 173.3 -
/api/fares/estimate 17040 74 42 12 900 310 620 94.7 0.1%
(429 responses) 12400 2 2 1 12 7 9 68.9 -
/api/drivers/nearby 8520 52 30 8 680 220 480 47.3 0.0%
(429 responses) 5900 2 2 1 9 5 7 32.8 -
Aggregated 68160 65 36 8 900 280 590 378.7 0.1%
The numbers tell the story:
| Metric | Before | After | Delta |
|---|---|---|---|
| p99 latency (allowed requests) | 40,000ms | 590ms | -98.5% |
| Failure rate (non-429) | 20.1% | 0.1% | -99.5% |
| 429 response time (avg) | n/a | 2ms | - |
| Matching service RPS | 378 | 105 | -72.3% |
The 429 responses cost 2ms each. They touch Redis (one EVALSHA round trip) and return. No matching service call. No database query. No fare calculation. The 72.3% reduction in matching service RPS keeps it within its capacity ceiling of ~150 RPS.
The Prometheus panel confirms the backend is protected:
# Matching service thread pool utilization during surge
hikaricp_connections_active{pool="matching-service-pool"}
/ hikaricp_connections_max{pool="matching-service-pool"}
Without rate limiting: 100% utilization, connection wait times > 30 seconds. With rate limiting: 62% utilization, connection wait times < 5ms.
The rate limiter did not reduce the number of riders served. It reduced the number of redundant requests from each rider. Each of the 2,000 simulated riders got their 20-request burst and 10/s sustained rate. The requests beyond that were retries from impatient clients or automated polling that consumed resources without adding value.
The matching service processes fewer requests but serves the same number of unique ride matches. The riders who get through see p99 of 590ms instead of 40 seconds. The riders who get a 429 see a “please wait” screen with a countdown based on the Retry-After header instead of a spinning loader that eventually times out.