Columnar Storage and Encoding

Columnar storage is a paradigm optimized for analytic workloads, characterized by read-mostly access patterns, complex queries, and the need for high throughput. Unlike row-oriented storage, which stores all column values for a single row contiguously, columnar storage organizes data by columns, storing all values of a single column together. This layout enables efficient compression, reduces I/O for queries that only access a subset of columns, and facilitates vectorized query execution, which processes data in batches to leverage CPU SIMD instructions.

Compression Techniques for Columnar Data

Several compression techniques are particularly effective for columnar data due to its homogeneous nature. Run-Length Encoding (RLE) is a simple yet effective method for columns with repeated values, storing each sequence of identical values as a single value and a count. Dictionary Encoding maps distinct values in a column to compact integer IDs, reducing storage size, especially for columns with low cardinality. Bitmap Indexes are used for filtering on low-cardinality columns, representing each distinct value as a bitmap where each bit corresponds to a row, enabling fast bitwise operations for filtering.

Example: Run-Length Encoding Implementation

# Example of Run-Length Encoding (RLE) for a column of integers.
# Input column (already sorted for best compression): [1,1,1,2,2,3,3,3,3]
# RLE encoded as list of (value, run_length) pairs.

input_column = [1, 1, 1, 2, 2, 3, 3, 3, 3]

def rle_encode(column):
    if not column:
        return []
    encoded = []
    current_val = column[0]
    run_length = 1
    for val in column[1:]:
        if val == current_val:
            run_length += 1
        else:
            encoded.append((current_val, run_length))
            current_val = val
            run_length = 1
    encoded.append((current_val, run_length))
    return encoded

encoded = rle_encode(input_column)
print(f"Encoded RLE: {encoded}")
# Output: [(1, 3), (2, 2), (3, 4)]

# Decoding function
def rle_decode(encoded):
    decoded = []
    for value, count in encoded:
        decoded.extend([value] * count)
    return decoded

decoded = rle_decode(encoded)
print(f"Decoded column: {decoded}")
assert decoded == input_column

Vectorized Query Execution

Vectorized query execution is a processing technique where operations are performed on batches of values (vectors) at once, rather than on single values (scalars). This approach reduces per-tuple overhead and leverages CPU SIMD (Single Instruction, Multiple Data) instructions for high throughput. Modern analytical databases and data warehouses often implement vectorized execution engines to improve performance on columnar data.

Example: Vectorized Filter Operation

# Pseudocode for a vectorized filter operation (late materialization style)
# Processes a batch (vector) of column values and a selection bitmap.

import numpy as np

# Column data for 8 rows (a vector)
age_vector = np.array([25, 42, 31, 28, 55, 19, 33, 60], dtype=np.int32)

# Vectorized comparison: produces a boolean mask
predicate_mask = age_vector > 30
print(f"Predicate mask (age > 30): {predicate_mask}")
# Output: [False, True, True, False, True, False, True, True]

# Use mask to select positions (row IDs) that pass the filter
selected_positions = np.where(predicate_mask)[0]
print(f"Selected row positions (IDs): {selected_positions}")
# Output: [1, 2, 4, 6, 7]

# Late materialization: We now have positions, not full rows.
# Subsequent operators can use these positions to fetch other column values.
# E.g., fetch 'salary' for those positions:
salary_vector = np.array([50000, 75000, 60000, 48000, 90000, 35000, 80000, 95000], dtype=np.int32)
selected_salaries = salary_vector[selected_positions]
print(f"Salaries for selected rows: {selected_salaries}")

Materialization Strategies

Materialization refers to the process of assembling a full row from columnar data by combining values from different columns. There are two primary strategies: late materialization and early materialization. Late materialization delays the assembly of full rows until the final result set is needed, reducing I/O and memory bandwidth by only materializing rows that pass all filters. Early materialization assembles rows early in the query pipeline, which can be beneficial for operations that need many columns from a row or for final result output but may increase memory usage and I/O for intermediate results.

Diagram: Late vs. Early Materialization

Diagram: Late vs. Early Materialization in a Columnar Query Pipeline

Description:
- **Query**: SELECT name, salary FROM employees WHERE age > 30 AND dept = 'Sales';
- **Late Materialization Path**:
  1. Scan column 'age' → apply predicate >30 → output bitmap B1 of qualifying row IDs.
  2. Scan column 'dept' → apply predicate = 'Sales' → output bitmap B2.
  3. Bitwise AND(B1, B2) → final bitmap F of row IDs matching both predicates.
  4. **Only now materialize**: Use bitmap F to fetch values from 'name' and 'salary' columns and assemble final result rows.
- **Early Materialization Path**:
  1. Scan and assemble full rows (age, dept, name, salary) from all columns.
  2. Apply predicates (>30 and ='Sales') on the materialized rows, filtering in memory.
  3. Project to output columns (name, salary).
- **Comparison Arrow**: Late materialization minimizes data movement (only moves column data for filtered rows). Early materialization moves all column data for all rows early.

Comparison of Storage Layouts

Workload Characteristic	Transactional (OLTP)	Analytic (OLAP)
Primary Operation	Point reads/writes (by key)	Aggregations, scans, joins
Data Access Pattern	Random I/O (index lookups)	Sequential I/O (full/partial scans)
Write Frequency	High (frequent inserts/updates)	Low (batch loads, infrequent updates)
Read Pattern	Small, selective reads	Large, bulk reads
Storage Layout	Row-oriented (optimized for whole-row access)	Column-oriented (optimized for column scans)
Index Type	B-tree / LSM-tree (for point queries)	Bitmap, columnar projections, zone maps
Optimization Goal	Low latency, high concurrency	High throughput, efficient compression
Example Systems	PostgreSQL, MySQL, Oracle DB	Amazon Redshift, Google BigQuery, Snowflake

Encoding Techniques

Encoding Technique	Best For	Compression Mechanism	Example (Before → After)
Run-Length Encoding (RLE)	Sorted columns, low cardinality	Replace consecutive identical values with (value, count)	[A,A,A,B,B] → [(A,3),(B,2)]
Dictionary Encoding	Columns with repeated values	Map distinct values to compact integer IDs	[‘Red’,‘Blue’,‘Red’] → [0,1,0] + Dict{0:‘Red’,1:‘Blue’}
Bit-Packing	Small-range integers	Store values using minimal bits per value (e.g., 5 bits for values 0-31)	-
Delta Encoding	Sorted, sequential values (e.g., timestamps)	Store difference from previous value	[1000, 1002, 1005] → [1000, +2, +3]
Frame Of Reference	Integer columns with local clustering	Store values relative to a base (min value in block)	Block values [103, 105, 104] → base=100, encoded [3,5,4]

Conclusion

Columnar storage and encoding are foundational elements of modern analytical databases and data warehouses, offering significant advantages in terms of compression, query performance, and storage efficiency. By understanding the principles behind columnar storage, vectorized query execution, and materialization strategies, developers and data analysts can better leverage these technologies to support complex analytics and business intelligence applications.

Sources

This section has drawn on internal knowledge and examples to illustrate key concepts in columnar storage and encoding. For further reading and deeper dives into specific technologies and implementations, readers are encouraged to explore the documentation and research papers related to Amazon Redshift, Google BigQuery, Snowflake, and other leading analytical databases and data warehouses.