Beyond JSON: Protobuf, MessagePack, and Binary Protocols
Beyond JSON: Protobuf, MessagePack, and Binary Protocols
The main chapter showed Protobuf is 5-6x faster than Jackson JSON for the content platform’s article payloads. This section goes deeper: Protobuf schema design, MessagePack as a middle ground, and the engineering cost of adopting binary protocols.
Protobuf Schema Design for the Content Platform
The content platform’s article domain translates to Protobuf like this:
syntax = "proto3";
package content.v1;
option java_package = "com.contentplatform.proto";
option java_multiple_files = true;
message Article {
string id = 1;
string title = 2;
string body = 3;
repeated string categories = 4;
int64 published_at_millis = 5;
int64 view_count = 6;
ArticleStatus status = 7;
AuthorInfo author = 8;
ContentMetrics metrics = 9;
}
enum ArticleStatus {
ARTICLE_STATUS_UNSPECIFIED = 0;
DRAFT = 1;
PUBLISHED = 2;
ARCHIVED = 3;
}
message AuthorInfo {
string id = 1;
string name = 2;
string avatar_url = 3;
}
message ContentMetrics {
int32 word_count = 1;
int32 reading_time_minutes = 2;
double avg_scroll_depth = 3;
int64 unique_readers = 4;
}
message ArticleFeed {
repeated Article articles = 1;
string next_cursor = 2;
int32 total_count = 3;
}
message ArticleRequest {
string article_id = 1;
repeated string fields = 2; // Field mask for partial responses
}
message ViewEvent {
string article_id = 1;
string session_id = 2;
int64 timestamp_millis = 3;
int32 scroll_depth_percent = 4;
int32 time_on_page_seconds = 5;
}
message ViewEventBatch {
repeated ViewEvent events = 1;
}
Design rules that affect performance:
Use int64 for timestamps, not google.protobuf.Timestamp. The Timestamp well-known type adds a message wrapper (seconds + nanos), which costs an extra message header and two varint fields. If you only need millisecond precision, a single int64 is 3-5 bytes smaller per timestamp.
Use repeated primitives instead of wrapper messages. repeated string categories is packed efficiently. A repeated Category with message Category { string name = 1; } adds 2 bytes of message overhead per element.
Number fields by access frequency. Protobuf field numbers 1-15 use 1 byte for the tag. Fields 16-2047 use 2 bytes. Put the most frequently accessed fields in positions 1-15.
Code Generation and Build Integration
The Protobuf compiler (protoc) generates Java classes from .proto files. Integrate this into the Maven build:
<plugin>
<groupId>org.xolstice.maven.plugins</groupId>
<artifactId>protobuf-maven-plugin</artifactId>
<version>0.6.1</version>
<configuration>
<protocArtifact>
com.google.protobuf:protoc:3.25.1:exe:${os.detected.classifier}
</protocArtifact>
<protoSourceRoot>
${project.basedir}/src/main/proto
</protoSourceRoot>
</configuration>
<executions>
<execution>
<goals><goal>compile</goal></goals>
</execution>
</executions>
</plugin>
The generated Java code is 5-10x larger than the source .proto file. For the content platform’s 8 message types, the generated code is ~4,000 lines. This code is deterministic and should not be committed to version control. Generate it during the build.
Protobuf Performance at Different Payload Sizes
The serialization advantage of Protobuf varies with payload composition:
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class PayloadSizeBenchmark {
@Param({"tiny", "small", "medium", "large"})
String payloadType;
private byte[] jsonBytes;
private byte[] protobufBytes;
private ObjectMapper mapper;
@Setup(Level.Trial)
public void setup() throws Exception {
mapper = new ObjectMapper()
.registerModule(new JavaTimeModule())
.registerModule(new BlackbirdModule());
switch (payloadType) {
case "tiny" -> {
// View event: mostly integers
jsonBytes = mapper.writeValueAsBytes(createViewEvent());
protobufBytes = createViewEventProto().toByteArray();
}
case "small" -> {
// Article metadata: strings + integers
jsonBytes = mapper.writeValueAsBytes(createMetadata());
protobufBytes = createMetadataProto().toByteArray();
}
case "medium" -> {
// Full article: 4KB body
jsonBytes = mapper.writeValueAsBytes(createArticle());
protobufBytes = createArticleProto().toByteArray();
}
case "large" -> {
// Article feed: 50 articles
jsonBytes = mapper.writeValueAsBytes(createFeed());
protobufBytes = createFeedProto().toByteArray();
}
}
}
@Benchmark
public Object jsonDeserialize() throws Exception {
return switch (payloadType) {
case "tiny" -> mapper.readValue(jsonBytes, ViewEvent.class);
case "small" -> mapper.readValue(jsonBytes, Metadata.class);
case "medium" -> mapper.readValue(jsonBytes, Article.class);
case "large" -> mapper.readValue(jsonBytes,
new TypeReference<List<Article>>() {});
default -> throw new IllegalStateException();
};
}
@Benchmark
public Object protobufDeserialize() throws Exception {
return switch (payloadType) {
case "tiny" -> Proto.ViewEvent.parseFrom(protobufBytes);
case "small" -> Proto.Metadata.parseFrom(protobufBytes);
case "medium" -> Proto.Article.parseFrom(protobufBytes);
case "large" -> Proto.ArticleFeed.parseFrom(protobufBytes);
default -> throw new IllegalStateException();
};
}
}
| Payload | JSON Size | Protobuf Size | JSON Parse | Protobuf Parse | Speed Ratio |
|---|---|---|---|---|---|
| Tiny (view event) | 180 B | 42 B | 310 ns | 48 ns | 6.5x |
| Small (metadata) | 520 B | 215 B | 680 ns | 120 ns | 5.7x |
| Medium (article) | 5,240 B | 3,180 B | 1,820 ns | 310 ns | 5.9x |
| Large (50 articles) | 262 KB | 159 KB | 82 us | 15 us | 5.5x |
Key observations:
The speed ratio is consistent across sizes (5.5-6.5x). Protobuf’s advantage does not diminish with payload size. Both formats scale linearly, but Protobuf’s constant factor is lower.
The size ratio varies with content. Tiny payloads (mostly integers) compress to 23% of JSON size because varints encode small numbers in 1-2 bytes while JSON uses ASCII digits plus quotes. Large payloads (mostly string bodies) compress to 61% because strings have minimal overhead in both formats.
The tiny payload result matters most. The content platform processes 50,000 view events per second. At 48 ns vs 310 ns per parse, Protobuf saves 13 ms of CPU per second on view event deserialization alone.
MessagePack: Binary JSON Without Schema
MessagePack is a binary encoding of the JSON data model. It uses the same types (maps, arrays, strings, integers) but encodes them compactly. Unlike Protobuf, it requires no schema definition or code generation.
Jackson integrates with MessagePack through jackson-dataformat-msgpack:
// JSON: uses JsonFactory (default)
ObjectMapper jsonMapper = new ObjectMapper();
// MessagePack: swap the factory
ObjectMapper msgpackMapper = new ObjectMapper(new MessagePackFactory());
The same POJOs, annotations, and modules work with both. The only change is the factory.
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(2)
@State(Scope.Benchmark)
public class MessagePackBenchmark {
private byte[] jsonBytes;
private byte[] msgpackBytes;
private byte[] protobufBytes;
private ObjectMapper jsonMapper;
private ObjectMapper msgpackMapper;
@Setup(Level.Trial)
public void setup() throws Exception {
jsonMapper = new ObjectMapper()
.registerModule(new JavaTimeModule())
.registerModule(new BlackbirdModule());
msgpackMapper = new ObjectMapper(new MessagePackFactory())
.registerModule(new JavaTimeModule())
.registerModule(new BlackbirdModule());
Article article = createArticle();
jsonBytes = jsonMapper.writeValueAsBytes(article);
msgpackBytes = msgpackMapper.writeValueAsBytes(article);
protobufBytes = createArticleProto().toByteArray();
}
@Benchmark
public Article jsonDeserialize() throws Exception {
return jsonMapper.readValue(jsonBytes, Article.class);
}
@Benchmark
public Article msgpackDeserialize() throws Exception {
return msgpackMapper.readValue(msgpackBytes, Article.class);
}
@Benchmark
public Content.Article protobufDeserialize() throws Exception {
return Content.Article.parseFrom(protobufBytes);
}
}
| Format | Deserialize (ns) | Serialize (ns) | Size |
|---|---|---|---|
| JSON (Jackson) | 1,820 | 2,140 | 5,240 B |
| MessagePack | 1,580 | 1,780 | 3,890 B |
| Protobuf | 310 | 420 | 3,180 B |
MessagePack is 13-17% faster than JSON and 26% smaller. Protobuf is still 5x faster than MessagePack. The reason: MessagePack still encodes field names as strings (just in binary), while Protobuf uses integer field tags. MessagePack still uses Jackson’s reflection-based access path, while Protobuf uses generated code with direct field access.
MessagePack’s value proposition: you get a 15-25% improvement over JSON with zero schema management cost. It is a drop-in replacement. For teams that cannot adopt Protobuf’s schema management overhead, MessagePack is a pragmatic middle ground.
Schema Evolution in Practice
The content platform’s article schema evolved three times in six months:
v1 to v2: Added reading_time_minutes field.
// v2: Adding a field is safe
message Article {
string id = 1;
string title = 2;
string body = 3;
repeated string categories = 4;
int64 published_at_millis = 5;
int64 view_count = 6;
int32 reading_time_minutes = 7; // NEW: defaults to 0
}
Old consumers ignore field 7. New consumers read field 7 from new producers and get 0 from old producers. No coordination needed.
v2 to v3: Changed categories from strings to structured objects.
This is the dangerous evolution. You cannot change a field’s type. The team added a new field:
message Article {
string id = 1;
string title = 2;
string body = 3;
repeated string categories = 4; // DEPRECATED, kept for compat
int64 published_at_millis = 5;
int64 view_count = 6;
int32 reading_time_minutes = 7;
repeated Category structured_categories = 8; // NEW
}
message Category {
string slug = 1;
string display_name = 2;
int32 article_count = 3;
}
Both old and new fields coexist. New producers populate both. New consumers read structured_categories. Old consumers read categories. After all consumers are upgraded, producers stop populating categories.
v3 to v4: Removed the deprecated categories field.
message Article {
reserved 4;
reserved "categories";
// ... remaining fields unchanged
}
The reserved directive prevents accidental reuse of field number 4 or name “categories”.
Cost of this evolution: three deployment phases (add new field, migrate consumers, remove old field) spread over two weeks. The equivalent JSON change: add a new field, consumers that do not know about it ignore it. One deployment.
When Binary Protocols Do Not Justify the Complexity
Binary protocols are not always the right choice. The decision matrix:
| Factor | Choose JSON | Choose Binary |
|---|---|---|
| Consumers | External/browser clients | Internal services only |
| Throughput | < 1,000 msg/s | > 5,000 msg/s |
| Payload size | < 1 KB | > 1 KB |
| Schema changes | Frequent, uncoordinated | Planned, versioned |
| Team size | < 5 engineers | > 5 engineers |
| Debugging | Must read payloads | Can use tooling |
The content platform’s split: JSON for the public API (consumed by browsers, mobile apps, third-party integrations) and Protobuf for internal service communication (article service to search indexer, recommendation engine, analytics pipeline).
Migration Path: JSON to Protobuf
The content platform migrated internal communication in three phases without downtime:
Phase 1: Dual-write. Producers serialize both JSON and Protobuf. A request header (Accept: application/protobuf) determines the format. Default: JSON.
@GetMapping("/internal/articles/{id}")
public ResponseEntity<byte[]> getArticle(
@PathVariable String id,
@RequestHeader(value = "Accept",
defaultValue = "application/json") String accept) {
Article article = articleService.findById(id);
if ("application/protobuf".equals(accept)) {
byte[] proto = toProtobuf(article).toByteArray();
return ResponseEntity.ok()
.contentType(MediaType.parseMediaType(
"application/protobuf"))
.body(proto);
}
byte[] json = objectMapper.writeValueAsBytes(article);
return ResponseEntity.ok()
.contentType(MediaType.APPLICATION_JSON)
.body(json);
}
Phase 2: Consumer migration. Each consuming service switches to Accept: application/protobuf. Monitor error rates per consumer. Roll back individual consumers if issues arise.
Phase 3: Remove JSON path. After all internal consumers use Protobuf, remove the JSON serialization code from internal endpoints. Keep JSON for public endpoints.
Total migration time: 4 weeks. CPU savings on the article service: 9% reduction in serialization overhead. Network bandwidth savings on the analytics pipeline: 40% reduction from Protobuf’s compact encoding of view events.
The content platform now processes 50,000 view events per second over Protobuf at 48 ns per parse. The same volume over JSON would cost 310 ns per parse. At 50,000 events/s, that is 13.1 ms/s of CPU saved. Over a month, that is 34,000 seconds of CPU time, which translates directly to infrastructure cost.