JSON vs Protocol Buffers vs Avro: Serialization Cost in Bytes and CPU

The Black Box

The team chooses JSON “because it’s human-readable.” The Kafka topics carry 50 million messages per day. Each message is a package tracking event averaging 280 bytes in JSON. Nobody has measured the alternative. The daily bandwidth cost is 14 GB. In Protobuf, the same data is 64 bytes per message, 3.2 GB per day. The difference is 10.8 GB of network bandwidth, storage, and processing overhead per day that nobody accounted for.

The Mechanism

Protobuf Varint Encoding

Protobuf uses variable-length integer encoding. Small integers are encoded in fewer bytes:

Value	Bytes (varint)	Bytes (fixed int32)
0	1	4
127	1	4
128	2	4
16383	2	4
2097151	3	4

The logistics platform’s package weight in grams (integer, typically 100-50000) encodes in 2-3 bytes with varint vs 4 bytes with fixed int32 vs 5-6 characters in JSON (e.g., "2400").

Each Protobuf field is prefixed with a tag: a varint encoding the field number and wire type. For field numbers 1-15, the tag is 1 byte. For field numbers 16-2047, the tag is 2 bytes. Put the most frequently used fields in positions 1-15.

// Concept: field number optimization
message PackageEvent {
    string package_id = 1;     // 1-byte tag (field 1, high frequency)
    string status = 2;         // 1-byte tag (field 2, high frequency)
    int64 timestamp_ms = 3;    // 1-byte tag (field 3, high frequency)
    string warehouse_id = 4;   // 1-byte tag
    float weight_kg = 5;       // 1-byte tag
    // Reserve fields 1-15 for high-frequency fields
    string notes = 16;         // 2-byte tag (field 16, low frequency)
    repeated string tags = 17; // 2-byte tag (field 17, low frequency)
}

Avro Binary Encoding

Avro takes a different approach: the schema is not embedded in the data at all. The writer and reader must have the schema to encode/decode. Fields are written in schema order with no field tags or names.

// Concept: Avro schema for package event
{
    "type": "record",
    "name": "PackageEvent",
    "namespace": "com.logistics.events",
    "fields": [
        {"name": "package_id", "type": "string"},
        {"name": "status", "type": "string"},
        {"name": "timestamp_ms", "type": "long"},
        {"name": "warehouse_id", "type": "string"},
        {"name": "weight_kg", "type": "float"}
    ]
}

An Avro-encoded PackageEvent:

// No field tags. Values written in schema order.
// Strings: length-prefixed (varint length + UTF-8 bytes)
// Longs: zigzag varint encoding
// Floats: 4 bytes, little-endian IEEE 754

12 50 4b 47 2d 34 30 32 39 31    // package_id: "PKG-40291" (length 9 + bytes)
14 49 4e 5f 54 52 41 4e 53 49 54  // status: "IN_TRANSIT" (length 10 + bytes)
86 f8 b4 8f c6 64                  // timestamp_ms: 1731675600123 (zigzag varint)
0c 57 48 2d 30 34 32              // warehouse_id: "WH-042" (length 6 + bytes)
9a 99 19 40                        // weight_kg: 2.4 (IEEE 754 float)

Total: 24 bytes. Avro is smaller than Protobuf (28 bytes) because there are no field tags. The cost is that both the writer and reader must have the schema.

Schema Evolution

All three formats handle schema evolution, but with different constraints:

// Concept: Protobuf backward-compatible schema evolution

// Version 1:
message PackageEvent {
    string package_id = 1;
    string status = 2;
}

// Version 2 (backward compatible):
message PackageEvent {
    string package_id = 1;
    string status = 2;
    string warehouse_id = 3;     // New field. Old readers ignore it.
    float weight_kg = 4;         // New field. Old readers ignore it.
    // Field 2 NOT removed. Old data still has it.
}

// Rules for backward compatibility:
// - New fields must be optional (proto3: all fields are optional by default)
// - Never reuse a field number
// - Never change a field's type
// - Never rename a field (names are not transmitted, but they affect generated code)

The Observable Consequence

Benchmark comparing serialization of 1 million PackageEvent messages:

Metric	JSON (Jackson)	Protobuf	Avro
Serialize 1M messages	2,100 ms	280 ms	320 ms
Deserialize 1M messages	2,800 ms	310 ms	360 ms
Total size (1M messages)	280 MB	64 MB	56 MB
Average message size	280 bytes	64 bytes	56 bytes

For the logistics platform processing 50 million messages/day:

Metric	JSON	Protobuf	Avro
Daily bandwidth	14 GB	3.2 GB	2.8 GB
Kafka storage (7-day retention)	98 GB	22.4 GB	19.6 GB
CPU for (de)serialization	~42 core-hours	~5 core-hours	~6 core-hours

The CPU difference is 37 core-hours per day, over 1 full core dedicated to parsing JSON that could be avoided.

The Decision Rule

Use Protobuf when services are developed by different teams and you need strong schema evolution guarantees with compile-time type safety. The .proto file is the contract between services. gRPC uses Protobuf natively.

Use Avro when the primary consumer is a data pipeline (Kafka to ClickHouse, Kafka to data lake) and you want the smallest possible message size. Avro integrates with the Confluent Schema Registry, which manages schema versions and compatibility checks automatically.

Use JSON when the consumer is a browser, a third-party API integration, or a human debugging a Kafka topic with kafkacat. The readability and universal tooling support of JSON justify the overhead for external interfaces and debugging.

For the logistics platform: Protobuf for inter-service gRPC calls (route optimizer to package service). Avro for Kafka topics consumed by ClickHouse and the data pipeline. JSON for the public API consumed by partner integration systems.