When Iceberg Beats Parquet+Projection on AWS Glue: A Performance Comparison

When does Iceberg beat Parquet+projection on AWS Glue, and when doesn’t ?

AWS Glue streaming and batch jobs process stock ticker data through Kinesis to compare Iceberg and Parquet storage formats. The architecture evaluates anomaly detection and OHLC computation under controlled scenarios including stable, trend, and spike patterns.

Why This Matters

Partition projection is only usable when tables are queried through Athena, forcing Spark-based Glue jobs to fall back to standard metadata and potentially triggering exhaustive S3 scans. Iceberg resolves this by maintaining manifest lists with column statistics (min/max), allowing query engines to skip files without opening them, which provides a significant performance advantage as data scales to the 50-100 GB range.

Key Insights

Iceberg manifest pruning is O(1) over partition count, whereas standard Parquet S3 LIST operations scale at O(n).
Firehose data format conversion enforces a minimum 64 MB buffering size, while native Iceberg ingestion supports 1 MB buffers.
The OpenXJsonSerDe used by Firehose fails on ISO 8601 timestamps, requiring a string-type raw layer and casting in the Spark transformation layer.
Spark SQL extensions for Iceberg must be applied before SparkSession initialization; runtime configuration changes via spark.conf.set are ignored.
Iceberg’s column statistics enable file skipping on non-partition filters, such as ticker_symbol, which Parquet cannot achieve without full file reads.

Working Examples

Terraform configuration for injecting Iceberg Spark settings into Glue job default arguments.

locals {
  iceberg_spark_conf = join(" --conf ", [
    "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog",
    "spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO",
    "spark.sql.catalog.glue_catalog.warehouse=s3://${data.aws_s3_bucket.main.id}/iceberg/",
    "spark.sql.defaultCatalog=glue_catalog",
  ])
}

Python method for injecting Iceberg configuration by restarting the SparkContext before SparkSession creation.

sc = SparkContext()
conf = sc.getConf()
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)

Practical Applications

Use Case: Deploying Iceberg for datasets where queries frequently filter on non-partitioned columns to leverage O(1) file skipping. Pitfall: Attempting to register Iceberg configurations after SparkSession init results in ‘Catalog plugin class not found’ errors.
Use Case: Implementing Parquet with partition projection for Athena-only workloads to minimize Glue Crawler costs. Pitfall: Reading projection-based tables from Spark without manual partition registration leads to empty results or full scans.
Use Case: Using Firehose for real-time ingestion with format conversion. Pitfall: Setting buffering_size below 64 MB triggers an InvalidArgumentException when Parquet conversion is enabled.

References:

https://dev.to/bilardi/when-does-iceberg-beat-parquetprojection-on-aws-glue-and-when-doesnt—2g2

On This Page

When does Iceberg beat Parquet+projection on AWS Glue, and when doesn’t ?

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Architecting AWS-Snowflake Lakehouses with Apache Iceberg Integration Patterns

Beyond the Warehouse: Architecting Data Lineage and Source of Truth

Scaling AWS VPCs: Architecture Patterns for Multi-Account Environments