Optimizing Image Comparison with AVX512 Assembly in odiff
These articles are AI-generated summaries. Please check the original sources for full details.
Optimizing Image Comparison with AVX512 Assembly in odiff
This article details the author’s contribution to the odiff image comparison tool by rewriting its core algorithm using AVX512 assembly to achieve significant performance gains. The implementation, named vxdiff, leverages SIMD (Single Instruction, Multiple Data) capabilities to process image data in parallel, reducing computation time for pixel-by-pixel comparisons. The work was part of the 2025 Hacktoberfest challenge and integrates into odiff via the --enable-asm flag.
Key Concepts and Implementation Details
1. Background on odiff
- Purpose: odiff is a high-performance image comparison tool that uses the YIQ color space (instead of RGB) to better detect perceptual differences between images.
- Rewritten in Zig: Originally in OCaml, odiff was transitioned to Zig for better SIMD support, but the core algorithm remained unoptimized for AVX512.
- vxdiff: A pure AVX512 assembly reimplementation of odiff’s core logic, focusing on the default comparison options.
2. SIMD and AVX512 Overview
- SIMD Architecture: Traditional CPUs process one data element at a time, but SIMD extensions (like AVX512) allow parallel operations on 64-byte (512-bit) chunks of data.
- AVX512 Capabilities:
- Processes 16 RGBA pixels at once (each pixel is 4 bytes).
- Supports advanced masking and vector operations for efficient data manipulation.
- Enables array-like operations at the hardware level, similar to functional programming languages.
3. Core Algorithm in AVX512 Assembly
The algorithm processes images in RGBA8 format (3 color channels + 1 alpha channel, 8 bits per channel). Key steps include:
Step 1: Image Preprocessing
- Load image data into registers (
xmm1,xmm2) usingvmovdqu8. - Replace transparent pixels (alpha = 0) with white using mask registers (
k6,k7) andvpcmpequbfor comparison, followed byvmovdqu8to apply the replacement.
kxor k6, k6, k6
vpcmpequb k6 {k4}, xmm1, xmm0 ; Compare alpha channel to 0
kshiftlb k6, k6, 1 ; Propagate mask to RGB channels
vmovdqu8 xmm1 {k7}, xmm31 ; Replace masked pixels with white
Step 2: Data Conversion and Normalization
- Convert bytes to 32-bit integers with
vpmovzxbd, then to floats usingvcvtudq2ps. - Normalize alpha values from
0-255to0.0-1.0usingvmulpswith a scaling factor.
vpmovzxbd zmm1, xmm1
vcvtudq2ps zmm1, zmm1
vmulps zmm1 {k4}, zmm1, zmm3 ; Scale alpha to [0.0, 1.0]
Step 3: Alpha Blending and YIQ Conversion
- Blend RGB channels with white using alpha via
(RGB - 255.0) * A + 255.0. - Convert RGB to YIQ using precomputed matrices (
zmm7,zmm8,zmm9) andvmulps.
vmulps zmm10, zmm1, zmm7 ; Y component
vmulps zmm11, zmm1, zmm8 ; I component
vmulps zmm12, zmm1, zmm9 ; Q component
Step 4: Difference Calculation
- Compute the difference between base and comparison image YIQ values.
- Square the differences and multiply by a delta coefficient (
zmm29).
vsubps zmm16, zmm16, zmm26 ; YIQ difference
vmulps zmm16, zmm16, zmm16 ; Square the differences
vmulps zmm16, zmm16, zmm29 ; Multiply by delta coefficient
Step 5: Summing Differences and Counting Pixels
- Aggregate YIQ channel differences into a single value using
vaddps. - Compare against a threshold (
xmm28) and count differing pixels withpopcnt.
vcmpgtps k6, xmm16, xmm28
kmov eax, k6
popcnt eax, eax ; Count differing pixels
Working Example (AVX512 Assembly)
section .data
; Precomputed YIQ conversion matrix
zmm7: dd 0.299, 0.587, 0.114, 0.0, 0.0, 0.0, 0.0, 0.0, ...
zmm8: dd 0.5957, -0.2744, -0.3213, 0.0, 0.0, 0.0, 0.0, 0.0, ...
zmm9: dd 0.2114, -0.5226, 0.3112, 0.0, 0.0, 0.0, 0.0, 0.0, ...
section .text
global vxdiff
vxdiff:
; Load image data into registers
vmovdqu8 xmm1, [rdi] ; Base image
vmovdqu8 xmm2, [rsi] ; Comparison image
; Replace alpha=0 with white
kxor k6, k6, k6
vpcmpequb k6 {k4}, xmm1, xmm0
kshiftlb k6, k6, 1
kor k7, k7, k6
kshiftlb k6, k6, 1
kor k7, k7, k6
kshiftlb k6, k6, 1
kor k7, k7, k6
vmovdqu8 xmm1 {k7}, xmm31
; Convert to float and normalize alpha
vpmovzxbd zmm1, xmm1
vcvtudq2ps zmm1, zmm1
vmulps zmm1 {k4}, zmm1, zmm3 ; zmm3 = 1/255.0
; Alpha blending and YIQ conversion
; ... (omitted for brevity)
; Final difference count
vcmpgtps k6, xmm16, xmm28
kmov eax, k6
popcnt eax, eax
ret
Recommendations for AVX512 Implementation
- Start Simple: Focus on core functionality (e.g., basic pixel comparison) before adding advanced features like alpha blending or YIQ conversion.
- Use Mask Registers: Leverage AVX512 masks (
k0–k7) to conditionally apply operations to specific data elements (e.g., alpha channel). - Test for Compatibility: Ensure your code works on CPUs that support AVX512 (Intel Xeon Silver/Gold, AMD EPYC Zen 3+).
- Profile Performance: Use tools like
perforvalgrindto identify bottlenecks and optimize register usage. - Avoid Over-Vectorization: Balance between parallelism and readability; overly complex vector operations can reduce maintainability.
Potential Pitfalls
- Mask Register Misuse: Incorrect mask propagation (e.g.,
kshiftlb) can lead to unintended data modifications. - Data Alignment: AVX512 requires 64-byte aligned memory for optimal performance (use
align 64in assembly). - Floating-Point Precision: Conversions between integers and floats (
vpmovzxbd,vcvtudq2ps) may introduce rounding errors. - Portability: AVX512 is not supported on ARM or older x86 CPUs; use
#ifdefguards or fallback implementations if needed.
Integration with odiff
vxdiff is now part of the odiff project. To use it, activate the AVX512 backend with:
odiff --enable-asm image1.png image2.png
This provides a significant speed boost for large-scale image comparisons, leveraging hardware-level parallelism.
Reference
Continue reading
Next article
Comparing the Top 6 OCR Models in 2025: A Comprehensive Analysis
Related Content
Building Advanced Technical Analysis and Backtesting Workflows with pandas-ta-classic
Learn to implement a complete trading workflow using pandas-ta-classic, including RSI-based signals and Sharpe ratio performance metrics.
Cloud Resume Challenge - Chunk 4: Professional DevOps Practices with Terraform and AWS
This article details the implementation of infrastructure-as-code, supply chain security, and AWS best practices for a production-ready Cloud Resume project using Terraform, GitHub Actions, and AWS services.
Mastering AWS Lambda for Real-Time Pipelines: A Technical Deep Dive
Optimize AWS Lambda performance using memory-CPU scaling, VPC integration, and Kinesis stream processing with a 15-minute execution limit.