When Iceberg Beats Parquet+Projection on AWS Glue: A Performance Comparison
These articles are AI-generated summaries. Please check the original sources for full details.
When does Iceberg beat Parquet+projection on AWS Glue, and when doesn’t ?
AWS Glue streaming and batch jobs process stock ticker data through Kinesis to compare Iceberg and Parquet storage formats. The architecture evaluates anomaly detection and OHLC computation under controlled scenarios including stable, trend, and spike patterns.
Why This Matters
Partition projection is only usable when tables are queried through Athena, forcing Spark-based Glue jobs to fall back to standard metadata and potentially triggering exhaustive S3 scans. Iceberg resolves this by maintaining manifest lists with column statistics (min/max), allowing query engines to skip files without opening them, which provides a significant performance advantage as data scales to the 50-100 GB range.
Key Insights
- Iceberg manifest pruning is O(1) over partition count, whereas standard Parquet S3 LIST operations scale at O(n).
- Firehose data format conversion enforces a minimum 64 MB buffering size, while native Iceberg ingestion supports 1 MB buffers.
- The OpenXJsonSerDe used by Firehose fails on ISO 8601 timestamps, requiring a string-type raw layer and casting in the Spark transformation layer.
- Spark SQL extensions for Iceberg must be applied before SparkSession initialization; runtime configuration changes via spark.conf.set are ignored.
- Iceberg’s column statistics enable file skipping on non-partition filters, such as ticker_symbol, which Parquet cannot achieve without full file reads.
Working Examples
Terraform configuration for injecting Iceberg Spark settings into Glue job default arguments.
locals {
iceberg_spark_conf = join(" --conf ", [
"spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.catalog.glue_catalog.warehouse=s3://${data.aws_s3_bucket.main.id}/iceberg/",
"spark.sql.defaultCatalog=glue_catalog",
])
}
Python method for injecting Iceberg configuration by restarting the SparkContext before SparkSession creation.
sc = SparkContext()
conf = sc.getConf()
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
Practical Applications
- Use Case: Deploying Iceberg for datasets where queries frequently filter on non-partitioned columns to leverage O(1) file skipping. Pitfall: Attempting to register Iceberg configurations after SparkSession init results in ‘Catalog plugin class not found’ errors.
- Use Case: Implementing Parquet with partition projection for Athena-only workloads to minimize Glue Crawler costs. Pitfall: Reading projection-based tables from Spark without manual partition registration leads to empty results or full scans.
- Use Case: Using Firehose for real-time ingestion with format conversion. Pitfall: Setting buffering_size below 64 MB triggers an InvalidArgumentException when Parquet conversion is enabled.
References:
Continue reading
Next article
Optimizing Recruitment: Overcoming Algorithmic Bias in Legacy ATS Platforms
Related Content
Architecting AWS-Snowflake Lakehouses with Apache Iceberg Integration Patterns
Learn two architectural patterns for integrating AWS S3 and Apache Iceberg with Snowflake to enable cross-platform data sovereignty and analytics.
Beyond the Warehouse: Architecting Data Lineage and Source of Truth
Sarah Usher discusses the limitations of relying solely on data warehouses like BigQuery, highlighting a 5-minute query latency issue in a real-world example.
Mastering Advanced SQL for Surgical Business Intelligence
Datta Sable explains how advanced SQL techniques like CTEs and window functions are essential for optimizing BI performance and preventing AI hallucinations.