JSON vs Protocol Buffers vs Avro: Serialization Cost in Bytes and CPU
JSON vs Protocol Buffers vs Avro: Serialization Cost in Bytes and CPU
The Black Box
The team chooses JSON “because it’s human-readable.” The Kafka topics carry 50 million messages per day. Each message is a package tracking event averaging 280 bytes in JSON. Nobody has measured the alternative. The daily bandwidth cost is 14 GB. In Protobuf, the same data is 64 bytes per message, 3.2 GB per day. The difference is 10.8 GB of network bandwidth, storage, and processing overhead per day that nobody accounted for.
The Mechanism
Protobuf Varint Encoding
Protobuf uses variable-length integer encoding. Small integers are encoded in fewer bytes:
| Value | Bytes (varint) | Bytes (fixed int32) |
|---|---|---|
| 0 | 1 | 4 |
| 127 | 1 | 4 |
| 128 | 2 | 4 |
| 16383 | 2 | 4 |
| 2097151 | 3 | 4 |
The logistics platform’s package weight in grams (integer, typically 100-50000) encodes in 2-3 bytes with varint vs 4 bytes with fixed int32 vs 5-6 characters in JSON (e.g., "2400").
Each Protobuf field is prefixed with a tag: a varint encoding the field number and wire type. For field numbers 1-15, the tag is 1 byte. For field numbers 16-2047, the tag is 2 bytes. Put the most frequently used fields in positions 1-15.
// Concept: field number optimization
message PackageEvent {
string package_id = 1; // 1-byte tag (field 1, high frequency)
string status = 2; // 1-byte tag (field 2, high frequency)
int64 timestamp_ms = 3; // 1-byte tag (field 3, high frequency)
string warehouse_id = 4; // 1-byte tag
float weight_kg = 5; // 1-byte tag
// Reserve fields 1-15 for high-frequency fields
string notes = 16; // 2-byte tag (field 16, low frequency)
repeated string tags = 17; // 2-byte tag (field 17, low frequency)
}
Avro Binary Encoding
Avro takes a different approach: the schema is not embedded in the data at all. The writer and reader must have the schema to encode/decode. Fields are written in schema order with no field tags or names.
// Concept: Avro schema for package event
{
"type": "record",
"name": "PackageEvent",
"namespace": "com.logistics.events",
"fields": [
{"name": "package_id", "type": "string"},
{"name": "status", "type": "string"},
{"name": "timestamp_ms", "type": "long"},
{"name": "warehouse_id", "type": "string"},
{"name": "weight_kg", "type": "float"}
]
}
An Avro-encoded PackageEvent:
// No field tags. Values written in schema order.
// Strings: length-prefixed (varint length + UTF-8 bytes)
// Longs: zigzag varint encoding
// Floats: 4 bytes, little-endian IEEE 754
12 50 4b 47 2d 34 30 32 39 31 // package_id: "PKG-40291" (length 9 + bytes)
14 49 4e 5f 54 52 41 4e 53 49 54 // status: "IN_TRANSIT" (length 10 + bytes)
86 f8 b4 8f c6 64 // timestamp_ms: 1731675600123 (zigzag varint)
0c 57 48 2d 30 34 32 // warehouse_id: "WH-042" (length 6 + bytes)
9a 99 19 40 // weight_kg: 2.4 (IEEE 754 float)
Total: 24 bytes. Avro is smaller than Protobuf (28 bytes) because there are no field tags. The cost is that both the writer and reader must have the schema.
Schema Evolution
All three formats handle schema evolution, but with different constraints:
// Concept: Protobuf backward-compatible schema evolution
// Version 1:
message PackageEvent {
string package_id = 1;
string status = 2;
}
// Version 2 (backward compatible):
message PackageEvent {
string package_id = 1;
string status = 2;
string warehouse_id = 3; // New field. Old readers ignore it.
float weight_kg = 4; // New field. Old readers ignore it.
// Field 2 NOT removed. Old data still has it.
}
// Rules for backward compatibility:
// - New fields must be optional (proto3: all fields are optional by default)
// - Never reuse a field number
// - Never change a field's type
// - Never rename a field (names are not transmitted, but they affect generated code)
The Observable Consequence
Benchmark comparing serialization of 1 million PackageEvent messages:
| Metric | JSON (Jackson) | Protobuf | Avro |
|---|---|---|---|
| Serialize 1M messages | 2,100 ms | 280 ms | 320 ms |
| Deserialize 1M messages | 2,800 ms | 310 ms | 360 ms |
| Total size (1M messages) | 280 MB | 64 MB | 56 MB |
| Average message size | 280 bytes | 64 bytes | 56 bytes |
For the logistics platform processing 50 million messages/day:
| Metric | JSON | Protobuf | Avro |
|---|---|---|---|
| Daily bandwidth | 14 GB | 3.2 GB | 2.8 GB |
| Kafka storage (7-day retention) | 98 GB | 22.4 GB | 19.6 GB |
| CPU for (de)serialization | ~42 core-hours | ~5 core-hours | ~6 core-hours |
The CPU difference is 37 core-hours per day, over 1 full core dedicated to parsing JSON that could be avoided.
The Decision Rule
Use Protobuf when services are developed by different teams and you need strong schema evolution guarantees with compile-time type safety. The .proto file is the contract between services. gRPC uses Protobuf natively.
Use Avro when the primary consumer is a data pipeline (Kafka to ClickHouse, Kafka to data lake) and you want the smallest possible message size. Avro integrates with the Confluent Schema Registry, which manages schema versions and compatibility checks automatically.
Use JSON when the consumer is a browser, a third-party API integration, or a human debugging a Kafka topic with kafkacat. The readability and universal tooling support of JSON justify the overhead for external interfaces and debugging.
For the logistics platform: Protobuf for inter-service gRPC calls (route optimizer to package service). Avro for Kafka topics consumed by ClickHouse and the data pipeline. JSON for the public API consumed by partner integration systems.